Projects

Five projects. AI-agent security research, offense, application defense, infrastructure defense. The attack lives in the pattern, not the content.

Methodology

I use zettelkasten for research. Write everything down in real-time. Link ideas as they emerge. Document failed attempts and successful techniques.

Why it works: Forces clear thinking. Creates searchable knowledge base. Makes it easier to explain work to others.

MCP-Poison-Bench

Closed-loop study of MCP tool-poisoning: a client-side tool_transform defense evaluated over 4 injection classes × 3 Claude models × 5 seeds (60 seeded trials), scored as baseline-vs-defended ASR/utility deltas with 95% Wilson-score CIs, then adversarially bypassed 9/11 times, one bypass verified live against the defense ON.

method: Threat model: poisoned tool metadata coerces an export_data sink to exfiltrate a canary token. Defense segments server-supplied descriptions, redacts instruction-shaped spans (imperative / authority / redirection / exfiltration rules), wraps survivors in client-origin provenance. Provider-agnostic ModelClient (Anthropic / OpenAI / DeepSeek / Gemini) over the official MCP SDK; pure-trace ASR + utility scorers.

Python 3.14 MCP SDK Prompt Injection Wilson CIs Anthropic API

Status: Research · Public

→ details → live demo → source

PARALLAX

Metadata-only behavioral detection of model-extraction abuse, with no prompt or payload inspection. A 14-feature bank scored 0.45 AUC (worse than chance) on 16.9M real LANL auth events; collapsing it to a single change-from-baseline signal reached 0.68 AUC.

method: Per-entity behavioral baselining over request rate, token ratios, timing regularity, and session structure. 7 of 14 hand-built detectors were inverted on real data (flagging power users, clearing attackers), so the design subtracts memorized-shape features for a domain-portable drift signal. Three documented evasions: feature normalization, traffic dilution, low-and-slow.

Python Anomaly Detection Metadata-Only LANL Auth (16.9M)

Status: v2 · Real Data Validated

→ details → live demo

KESTREL

Per-account z-score baselining for cloud-workload attacks, MITRE ATLAS-tagged with Sigma export. v1 hunted GPU telemetry and scored 0.0005 recall (1 of 2,167 attack events) on 34,427 real CloudTrail events; re-targeting to the API-audit signal actually present lifted recall to 0.766 at precision 1.0.

method: Unsupervised, self-referential baselines (no labels, no fixed thresholds) over API-call burst rate, cross-region fan-out, off-hours spikes, and access-denied walls. Diagnosis was an instrument/data mismatch: 99.98% of events were API calls, 0 were GPU metrics. Findings export to Sigma for SIEM hand-off. CLI + Flask + SQLite.

Python CloudTrail MITRE ATLAS Sigma Export Flask · SQLite

Status: v2 · Real Data Validated

→ details → live demo

CTF Records

Running log of LLM prompt-injection CTFs (Gandalf L2-L7 complete), mapping each solve to the guardrail it defeats (output filtering, input classifiers, dual-LLM checks) and the OWASP LLM / MITRE ATLAS category it falls under.

method: Per level: extract the secret, identify the defense layer, document the working technique (encoding indirection, translation, riddle/side-channel, role-play). Defenses escalate from naive string-matching to conversational classifiers, so the techniques escalate with them.

Prompt Injection OWASP LLM MITRE ATLAS Practice Log

Status: Ongoing

→ details → practice repo

ReconAI

LLM-prioritized attack-surface triage: BBOT enumerates subdomains and hosts breadth-first, then GPT-4 ranks the enumerated assets by exploitability so the operator works the highest-value target first instead of reading a flat 10k-line dump.

method: BBOT modules for breadth-first recon; per-asset OSINT enrichment; GPT-4 scoring over the enriched inventory emits a ranked work queue with rationale per target.

Python GPT-4 BBOT OSINT

Status: Active

→ github