M-S init for factor designs

Robust regression with a mix of continuous and categorical predictors uses a different initial estimator than the pure-continuous case. lmrob switches automatically when it sees factor columns in the design; Control(init="auto") does it for you, but understanding the mechanics helps when reading the trace output or debugging convergence.

The problem

A full S-estimator on a design with k factor levels needs subsamples that include at least one observation from each level. With small per-level counts this becomes unlikely; the resampling either fails or lands on degenerate subsets. Maronna and Yohai (2000) solve this with a two-step estimator: an L1 estimate for the factor effects, then an M estimate over the continuous part conditional on the factor effects.

A small worked example

import numpy as np
import pandas as pd
from pylmrob import Control, lmrob

rng = np.random.default_rng(0)
n = 80
group = np.repeat(["A", "B", "C", "D"], n // 4)
x = rng.standard_normal(n)
# True model: y = 1.5 * x + group effects (-2, 0, +1, +3)
group_effect = {"A": -2.0, "B": 0.0, "C": 1.0, "D": 3.0}
y = 1.5 * x + np.array([group_effect[g] for g in group]) + 0.4 * rng.standard_normal(n)
# Inject 6 outliers in group D.
y[60:66] += 25.0
df = pd.DataFrame({"x": x, "group": pd.Categorical(group), "y": y})

Default `lmrob` (auto-picks M-S)

fit = lmrob("y ~ x + group", df, control=Control(), seed=42)
print(fit.summary())

The output includes method='MM' and init='M-S' in the header. lmrob saw the factor column and chose the M-S path automatically.

Forcing pure-S init

fit_s = lmrob("y ~ x + group", df, control=Control(init="S"), seed=42)

This may or may not work, depending on the seed: fast-S needs subsets that span the factor levels. On a small design (n=80, 4 levels), the failure rate is non-trivial. The M-S init avoids that.

Inspecting which observations got downweighted

print("Observations with weight < 0.1:")
print(np.where(fit.rweights_ < 0.1)[0])

The 6 outliers in group D show up here.

When this matters

Whenever your design has categorical predictors with few levels per group.
Whenever fast-S is reporting “no non-singular subsample found” warnings.
ANCOVA-style models with a continuous covariate and a treatment factor.

R-side reference: lmrob.control(method="MM", init="M-S"). pylmrob’s default Control(init="auto") makes the same choice.