The problem: attacks at the infrastructure layer
PARALLAX watches what users do to an AI platform; KESTREL watches the machines underneath it. GPU clusters hijacked for mining, a rogue training job burning compute on a stolen account for days, a recon tool enumerating permissions across every region at 2 a.m. None of that surfaces in application behavior. It surfaces in the cloud control plane's own audit logs, which is where KESTREL looks.
The approach: self-referential baselines
The detector is unsupervised and self-referential. For each account it builds a rolling per-feature baseline and flags activity that deviates beyond a z-score threshold from that account's own history. No labeled training data and no global fixed thresholds, so a small account and a large one are each judged against themselves rather than against a one-size cutoff. Every finding is tagged with the relevant MITRE ATLAS technique and exports directly to Sigma rules a SIEM can ingest.
v1: three workload detectors
The first build shipped three detectors aimed at compute-layer abuse:
On synthetic data it performed cleanly, which with anomaly detectors is a warning sign rather than a result.
The reality check: precision 1.0, recall 0.0005
Run against 34,427 real CloudTrail events containing genuine attacks (IAM enumeration, privilege escalation, recon scanning), v1 hit perfect precision and near-zero recall. Flawless aim, wrong target.
The diagnosis: an instrument-to-data mismatch
The detectors were built for GPU telemetry: utilization, power draw, training durations. CloudTrail is an API audit log. 99.98% of the events were API calls and exactly zero were GPU metrics. The instrument was sound but pointed at data that was not in the stream. Tellingly, the baseline engine itself had already noticed the attacker scanning at off-hours; nothing in v1 was wired to listen to that signal.
v2: detect the signal that is actually present
v2 retargets the same baselining engine at features the audit log actually carries: API-call burst rate, cross-region fan-out, off-hours spikes, and walls of access-denied responses. Recall went from 0.0005 to 0.766, a 1,532x improvement, with precision held at 1.0 and zero false positives. The fix was not heavier machine learning, it was meeting the data where it lives: diagnose the mismatch, then build the detector that fits the signal in front of you. Subtraction before addition, again. Stack: a Python CLI over Flask and SQLite.