AI Agents vs Code Vulnerabilities: Was Anthropic Mythos a Big Deal or Fear-mongering?
Then I read AISLE's proof more carefully and got a lot less comfortable. Their examples were too scoped and narow - showing models exact spots and asking if it could see issues with the code. That does not tell me enough about repo-scale discovery, tool use, prioritization, or whether an agent can find the path that actually matters in a messy real codebase.
I do this kind of work in practice - e.g. in one of the projects we used oridinary GitHub Copilot and specialy cooked agents skills to scout for vulns. So I used that gap in AISLE's research as the reason to run my own test. I benchmarked 15 models across 21 GitHub Copilot CLI agent runs on real worktrees pinned to a vulnerable commit in a codebase with a little over 2,000 files and roughly 350,000 lines of code (Python, YAML, backe-end and fronted, Docker, CI/CD pipleines etc.). Mythos Preview itself was not tested. The point was to test the middle ground AISLE left open: harder than pre-isolated snippets, clearly short of Mythos-style end-to-end exploitation, but still real enough that agents had to work through the repo, find the chain, explain it, and keep the main risk from getting buried.