The Origin
AI platforms can't tell when someone is systematically extracting model behavior through the API. Distillation attacks don't trip content filters because there's nothing malicious in any single query. The abuse lives in the pattern, not the content.
Short input, max output. Thousands of single-turn conversations. Mechanical timing. Cheap models only. Every request is benign. The aggregate is theft.
So the question: what if you could detect the extraction without ever reading the prompts?
The Thesis
See the pattern, not the content. Build behavioral detection that operates entirely on metadata — request velocity, token ratios, session cadence, model targeting patterns. Privacy-preserving by architecture, not policy. You never need to see what someone asked. You need to see how they asked it.
The Build
14 detectors across two tiers, built in roughly a week. 92 tests, 94% coverage. Request velocity, token ratios, timing regularity, session structure, model targeting, entropy analysis, cross-account clustering — the full behavioral surface. Full detector list on GitHub.
composite = Σ(score × weight × confidence) / Σ(weight)
threat: NONE < 0.25 | LOW < 0.50 | MEDIUM < 0.70 | HIGH < 0.85 | CRITICAL
Synthetic archetypes showed clean separation. Perfect class boundary at threshold 0.40 on generated data. Zero false positives, zero false negatives. It looked great on paper.
The Reality Check
Ran it against the LANL Cyber1 dataset. 16.9 million real authentication events from Los Alamos National Laboratory. 500 users. 87 labeled compromised. 187K sessions.
Worse than a coin flip. The detectors tuned on synthetic data were actively anti-correlated on real data. Seven of fourteen detectors were hurting, not helping. Normal users scored higher than compromised users. The engine was penalizing the wrong group.
The detectors designed for AI platform abuse — volume anomaly, temporal clustering, entropy analysis, distribution divergence — all pointed the wrong way on authentication data. They measured "how unusual is this user" and the most unusual users were the legitimate power users, not the attackers.
The Fix
Diagnosed root causes in one session. Three iterations.
v2: Built T1-009 Host Fan-Out detector for lateral movement. Added 4-hour sliding window scoring instead of whole-profile aggregation. Zeroed three dead-weight detectors. AUC: 0.48. Still below random — anti-correlated detectors drowning the signal.
v3: Zeroed all 10 non-contributing detectors. Redistributed weight to the 5 correctly-oriented signals, proportional to their measured separation delta.
The key finding: pattern-matching detectors don't transfer across domains, but behavioral shift detection does. That one signal — "this user's behavior just changed" — worked everywhere.
T2-006 Behavioral Shift: compromised mean 0.71 vs normal 0.43. Delta +0.29. Drives most of the separation. Weight: 0.30.
T1-009 Host Fan-Out: compromised mean 0.79 vs normal 0.66. Delta +0.13. Lateral movement leaves a destination diversity signature. Weight: 0.25.
The approach that worked: diagnose which signals are noise, silence them, amplify what remains. Subtraction beat addition.
The Evasion Findings
Adversarial testing surfaced three gaps:
Published openly because honesty about failure modes is more credible than claiming perfect detection.
Why It Matters Now
Anthropic announced that Mythos-class models can autonomously discover thousands of zero-day vulnerabilities. The attack surface for AI platforms is about to explode.
Before a Mythos-class model gets weaponized, someone needs access to it — and that access creates behavioral patterns. PARALLAX detects the precursor. Distillation, credential fraud, resource abuse at scale. The first gate in the kill chain.
Where It Stands
0.68 AUC on real data. Honest about limitations. Open source. Three clear paths to improvement:
- Signal correlation bonuses to resist single-signal evasion
- Windowed session-level scoring to catch blended behavior attacks
- Auth-failure-sequence detector for credential spraying patterns