KESTREL: Cloud AI Workload Anomaly Detection

The problem: attacks at the infrastructure layer

PARALLAX watches what users do to an AI platform; KESTREL watches the machines underneath it. GPU clusters hijacked for mining, a rogue training job burning compute on a stolen account for days, a recon tool enumerating permissions across every region at 2 a.m. None of that surfaces in application behavior. It surfaces in the cloud control plane's own audit logs, which is where KESTREL looks.

The approach: self-referential baselines

The detector is unsupervised and self-referential. For each account it builds a rolling per-feature baseline and flags activity that deviates beyond a z-score threshold from that account's own history. No labeled training data and no global fixed thresholds, so a small account and a large one are each judged against themselves rather than against a one-size cutoff. Every finding is tagged with the relevant MITRE ATLAS technique and exports directly to Sigma rules a SIEM can ingest.

v1: three workload detectors

The first build shipped three detectors aimed at compute-layer abuse:

GPU hijacking

utilization and power-draw anomalies signalling unauthorized compute use.

rogue training

unexpected long-running training jobs on an account's baseline.

cloud-API abuse

anomalous control-plane API behavior against the account's history.

On synthetic data it performed cleanly, which with anomaly detectors is a warning sign rather than a result.

The reality check: precision 1.0, recall 0.0005

34,427

real CloudTrail events, 2,167 of them attacks

1.0

v1 precision: zero false positives

0.0005

v1 recall: 1 of 2,167 attack events caught

Run against 34,427 real CloudTrail events containing genuine attacks (IAM enumeration, privilege escalation, recon scanning), v1 hit perfect precision and near-zero recall. Flawless aim, wrong target.

The diagnosis: an instrument-to-data mismatch

The detectors were built for GPU telemetry: utilization, power draw, training durations. CloudTrail is an API audit log. 99.98% of the events were API calls and exactly zero were GPU metrics. The instrument was sound but pointed at data that was not in the stream. Tellingly, the baseline engine itself had already noticed the attacker scanning at off-hours; nothing in v1 was wired to listen to that signal.

v2: detect the signal that is actually present

0.766

v2 recall, up from 0.0005

1,532x

recall improvement, precision held at 1.0

220

findings; every account v1 missed was caught

v2 retargets the same baselining engine at features the audit log actually carries: API-call burst rate, cross-region fan-out, off-hours spikes, and walls of access-denied responses. Recall went from 0.0005 to 0.766, a 1,532x improvement, with precision held at 1.0 and zero false positives. The fix was not heavier machine learning, it was meeting the data where it lives: diagnose the mismatch, then build the detector that fits the signal in front of you. Subtraction before addition, again. Stack: a Python CLI over Flask and SQLite.

KESTREL