The problem: extraction is a pattern, not an event

Model extraction steals a model by querying its API at scale and learning from the responses. Every individual request is well-formed and authorized, so no single-request rule fires. The signal lives in the distribution of requests over time: rate, regularity, breadth of targets, session structure. The detection problem is therefore a behavioral one, not a content one.

Design constraint: metadata only

The system is restricted to request metadata and never inspects prompt or response content. Two reasons, one principle. Privacy: an abuse detector that has to read user prompts is itself a liability. Portability: behavioral shape generalizes across deployments in a way that content signatures do not. The operating thesis was see the pattern, not the content, and the constraint is what forced the design to earn that.

v1: a 14-feature behavioral bank

The first build engineered fourteen behavioral features into a per-entity score. Grouped by what they measure:

volume and rate
request frequency and burstiness per entity over rolling windows.
token ratios
input-to-output size relationships that distinguish probing from normal use.
timing regularity
inter-arrival periodicity, which separates scripted extraction from human cadence.
session structure
session length, target breadth, and access patterns across endpoints.

On synthetic data it separated attackers from normal users cleanly. Synthetic data is exactly where this kind of detector flatters itself, so the real test was real data.

The reality check: 0.45 AUC on real data

16.9M
real LANL authentication events
87 / 500
entities actually compromised (ground truth)
0.45
v1 ROC AUC, below the 0.50 chance line

Evaluated against the labeled Los Alamos authentication dataset, v1 scored 0.45 AUC, worse than a coin flip. Diagnosis: 7 of the 14 features were inverted on real traffic. They had learned a synthetic attacker's silhouette, and on real data that silhouette belonged to legitimate power users, so the detector flagged heavy-but-benign accounts and cleared the actual intruders. Well-engineered, pointed backwards.

v2: subtract the memorized features

The fix was subtractive, not additive. Rather than tune fourteen features, I removed the ones that had memorized a specific attack shape and kept the one signal that held up across the domain shift: change from an entity's own baseline. A feature that says "this account is behaving unlike its own history" carries no assumption about what an attack looks like, so it survives contact with a new environment. That raised real-data performance to 0.68 AUC.

The transferable lesson: detectors that memorize attack silhouettes do not survive a domain change; detecting deviation from a per-entity baseline does. Subtraction beat addition.

Documented evasions

A change-from-baseline detector has known blind spots, and they are published rather than hidden:

feature normalization
shape one driving signal to look ordinary and the score drops below threshold.
traffic dilution
blend a small amount of attack traffic into a large volume of normal traffic to disappear into the entity's own baseline.
low-and-slow
extend the campaign so the per-window deviation never crosses the threshold.

Why it matters

Weaponizing a frontier model requires access to it first, and access at extraction scale leaves a behavioral trail even when every request is individually clean. PARALLAX is the tripwire at that gate: 0.68 AUC on real data, metadata-only by construction, open source, and explicit about the three ways it can still be beaten.