The engine_c=True fast path
Control(engine_c=True) is the default since v0.5.11. The fit runs
through a monolithic Cython kernel that does the entire S-init,
survivor refinement, MM iteration, optional D-scale
(Koller & Stahel 2014), and vcov_avar1 inside one nogil C block
with a single workspace allocation.
The kernel mirrors the structure of robustbase/src/lmrob.c:
LAPACK via
scipy.linalg.cython_lapack(dgesvfor the p-subset solve and the(X'WX)inversion,dgelsfor IRWLS,dsyevfor the vcov posdefify).Subset draws via numpy’s
bitgen_tcapsule (Floyd’s combination algorithm).Per-family
psi/psi'/chi/chi'kernels inlined ascdefhelpers; family dispatch is a small constant-folded enum.
When to use it
Turn it on when you want the fastest end-to-end fit and you can tolerate basin-of-attraction drift (see below):
from pylmrob import lmrob, Control
fit = lmrob(formula, df, control=Control(engine_c=True), seed=42)
It currently covers all five lmrob-supported psi families
(bisquare, hampel, optimal, lqq, ggw) and both
cov=".vcov.avar1" and the setting="KS2014" / "KS2011" SMDM
pipeline.
Behaviour at larger n
engine_c=True runs the monolithic Cython kernel only when
n*p^2 < 100,000. Above that, the threaded default path is faster
because the monolithic kernel is a single non-parallel call, whereas
the default _resample_chunk parallelises across a thread pool.
The fallback is automatic: setting Control(engine_c=True) also
enables n_workers=0 (auto threading) on the default path when the
problem is large. Users do not need to choose a path manually.
Concrete example at n=5000, p=30, nResample=500 (single-thread OpenBLAS):
config |
wall-clock |
|---|---|
|
~2200 ms |
|
~860 ms |
|
~870 ms (auto fallback) |
What you give up
The Cython subset-draw uses Floyd’s combination algorithm via numpy’s
bitgen_t. That’s mathematically equivalent to
np.random.Generator.choice(n, p, replace=False) but the byte sequence
consumed from the BitGenerator differs, so the basin of attraction can
shift slightly. On tiny-n problems (e.g. n=21 stackloss with some
seeds) the shifted basin occasionally lands on a fit where X'WX is
singular and vcov_avar1 fails. lmrob() catches that
FloatingPointError once and retries with engine_c=False so the
fit always succeeds. The fit’s coef/scale stay within ~1e-3 rerr of
the numpy path on those edge cases. To get byte-identical fits with
the pre-engine_c releases, pass Control(engine_c=False).
Note that Control(rng="R") forces engine_c=False automatically: the
monolithic engine owns its own BitGenerator and can’t share R-mode’s
unif_rand state.
Benchmark
Single-threaded BLAS, 7-rep median, four classical datasets × three settings, on a 16-core OpenBLAS Linux x86_64 box. R wall-clock on the same fits is 3-7 ms.
dataset / setting |
default |
engine_c |
speedup |
vs R |
|---|---|---|---|---|
stackloss MM |
22.4 ms |
4.0 ms |
5.6x |
1.3x R |
stackloss KS2014 |
62.8 ms |
7.1 ms |
8.8x |
1.4x R |
stackloss KS2011 |
70.4 ms |
7.2 ms |
9.8x |
1.4x R |
delivery MM |
24.7 ms |
5.1 ms |
4.8x |
1.7x R |
delivery KS2014 |
68.8 ms |
8.7 ms |
7.9x |
1.7x R |
phosphor MM |
23.5 ms |
3.7 ms |
6.4x |
1.2x R |
phosphor KS2014 |
69.5 ms |
6.1 ms |
11.4x |
1.2x R |
phosphor KS2011 |
74.2 ms |
6.2 ms |
12.0x |
1.2x R |
salinity MM |
26.8 ms |
5.2 ms |
5.2x |
1.7x R |
salinity KS2014 |
76.0 ms |
9.9 ms |
7.7x |
1.7x R |
Reproduce with scripts/bench_engine_c.py.
What’s still pure Python
The remaining Python work per fit (with the default engine_c=True) is:
the formula parser (about 0.3 ms via a hand-written fast path for the common
y ~ x1 + x2 + ...pattern; complex formulas go throughformulaic)Controlfield unpacking and a few small ndarray allocationsthe final
LmRobResultsconstruction
The user-facing array API (LmRob.fit(X, y)) skips formula parsing
entirely. The total wall-clock floor is approximately the underlying
BLAS work (the same R does internally) plus one Python/C boundary
cross. We pay that crossing because pylmrob keeps a Python
public API; R can stay in C from R_lmrob_S all the way to the
returned SEXP.