FAQ / troubleshooting
“Algorithm did not converge”
The MM IRWLS hit max_it without converging. The three usual causes:
Near-singular design. Check
np.linalg.cond(fit.design_x_); if it’s much above 1e10 you have near-collinear columns. Drop or merge.Bad initial scale. Increase
Control(nResample=1000)or switch toControl(setting="KS2014")(which uses a more stable D-scale).Contamination above 50%. Past the breakdown point of the M-estimator. Robust regression can’t fix it; you need a different model.
“X’ W X is singular; consider cov=‘.vcov.w’”
Almost every observation has been downweighted to zero, so the asymptotic covariance is rank-deficient. Either:
Switch to
Control(cov=".vcov.w")(more stable when many weights hit zero), orBootstrap:
fit.confint(method="bootstrap", n_boot=2000).
The default engine_c=True path already retries with the NumPy basin
on this error. If it bubbles up, both basins hit the same wall.
“Wald CI looks suspiciously narrow”
For small n or heavy contamination the asymptotic Wald CI
underestimates. Use the bootstrap; on n < 50 it’s typically 2-3×
wider:
fit.confint(method="bootstrap", n_boot=2000)
Confidence vs prediction vs tolerance interval
Confidence (
predict(interval="confidence")): range for the mean response at the new row. Shrinks asngrows.Prediction (
predict(interval="prediction")): range for a single new observation. Strictly wider; doesn’t shrink withnbecause per-observation noise stays.Tolerance: covers a fraction of the population. Not built in; compose your own using
fit.predict_std(kind="prediction")+ a tolerance factor.
fit.predict_std(new, kind=...) returns the SE directly if you’d
rather build bands at a non-t distribution.
How do I get bit-identical fits to R?
pylmrob’s default uses NumPy’s PCG64 BitGenerator. R uses Mersenne
Twister, so the fast-S subsets differ. Agreement on the validation
corpus is rtol=1e-3 on default settings.
For tighter agreement, drive the resample loop through R’s exact
unif_rand stream:
fit = lmrob(formula, df, control=Control(rng="R"), seed=42)
Stackloss-class agreement: rtol=1e-4 across seeds. The residual
gap is BLAS-precision (different LAPACK implementations accumulate
different ULPs in IRWLS); see numerical-notes
for the worked investigation. Truly bit-identical fits require
sharing R’s BLAS — use rpy2 if you need that.
Lower-level helpers: r_set_seed, r_sample_noreplace,
r_subsample_nonsingular give you the byte-identical uniform stream
and subset draws standalone (see rng-r-perf).
How do I pick nResample?
Default nResample=500 matches R and covers the standard cases.
Bump to 1000-2000 if:
The data is small with a difficult contamination pattern.
You’re on
init="M-S"and want tighter R agreement.
Cost is linear; with n_workers=4 even 2000 is sub-second.
How do I make it faster?
engine_c=True (default) already runs the whole fit in one Cython
block. Beyond that:
Control(n_workers=4)for OpenMP parallel resampling. 2-3× speedup onn > 5000.For many small fits in a loop, the formula parser adds ~3 ms each. Use
LmRob(...).fit(X, y)(raw arrays) to skip it.
See engine_c for the long version.
LmRob vs lmrob?
Same machinery, two faces:
lmrob(formula, data, ...): R-style. Most expressive (formulas, factors, transforms).LmRob(control).fit(X, y): sklearn-style. Drops intoPipeline,cross_val_score,GridSearchCV. The underlying result is atest.result_.
predict() is returning weird values
Two pitfalls:
Array path but factor design. A raw NumPy
X_newmust match the encoded design (intercept included). Pass a DataFrame and let the stored formula spec do the encoding.Missing columns. The DataFrame needs the same column names the formula referenced. Check
new.columns.
Why does setting="KS2014" use psi="lqq" by default?
It’s Koller & Stahel’s recommended pairing. lqq has slightly heavier
downweighting at the bend than bisquare, which complements the D-scale
refinement. Override with Control(setting="KS2014", psi="bisquare")
if you want.
See what the fitter is doing
Control(trace_lev=2) # per-iteration output, matches R's trace.lev=2
trace_lev=3 and =4 are progressively noisier.
Or for a post-hoc diagnostic:
fit.summary(detail="full") # appends init method, scale, engine info
Where do I report bugs?
GitHub issues.
Include pylmrob.__version__, a minimal reproducer, and (if you have
R handy) what robustbase::lmrob says on the same data.