“Benchmarks are the dashboard of progress.”
Every layer of the Cybersecurity AI stack — LLMs, scaffolds, agents, steering — is measured continuously against the world's hardest adversarial environments. We don't ship capability we can't prove.
Benchmarking results reveal CAI consistently outperforms humans in time and cost efficiency across most categories. Average time ratio of 11× and cost ratio of 156×.
Source: Mayoral-Vilches et al. (2025). CAI: An Open, Bug Bounty-Ready Cybersecurity AI. arXiv:2504.06017
alias3 leads Cybench (pass@3) early 2026 against every frontier model from Anthropic, OpenAI, Google and Mistral. Saturation passed.
Cybench — pass@3, 300 agentic interactions max, 245 minutes max, $40 API expenses max.
2025 saw Cybersecurity AI compete head-to-head against the best human teams on real, public CTFs.
$50,000 prize · Solved 41 of 45 flags · 155 teams
37% faster velocity · >1,200 teams · OT
Top 20 Global · 19/20 flags · 163 teams
Late entry (54 days late) · #21 final · 635 teams
| Event | Area | Field size | Peak / Final rank | Flags / Points | Window |
|---|---|---|---|---|---|
| AI vs Humans CTF | IT | 163 teams | #6 (3h) / #1 AI (#21) | 19/20 flags · 15.9k pts | 3 h |
| Cyber Apocalypse CTF 2025 | IT | 8,129 teams | #22 (3h) / #859 | 30/77 flags · 19,275 pts | 3 h |
| Dragos OT CTF 2025 | OT | >1,200 teams | #1 (7–8h) / #6 | 32/34 · 18,900 pts | 24 h |
| UWSP Pointer Overflow 2025 | IT | 635 teams | #14 (24h) / #21 | 58 solves · 11,500 pts | 24 h |
| Neurogrid CTF | IT | 155 teams | #1 (6h) / #1 | 41/45 flags · $50k prize | 48 h |
Source: Mayoral-Vilches et al. (2025). Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag. arXiv:2512.02654
Same Cybench challenges, head-to-head with the leading CLI agents. CAI wins on score and uses an order of magnitude fewer tokens.
Total score — Cowsay + Pingpong
2.6× lead over the next-best agent.
Input tokens on Cybench “dynastic” challenge
CAI uses 13× fewer tokens than Claude Code.
Source: Sanz-Gómez et al. (2025). CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents. arXiv:2510.24317
Previous work claimed attackers have a structural edge. Our 23 experimental runs on Hack The Box Battlegrounds (46 team deployments, Linux hosts, 15-minute windows) find offensive and defensive performance is statistically comparable (p > 0.05).
Source: Balassone et al. (2025). Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs. arXiv:2510.17521
As CTFs saturate (>80%), we built the next-generation evaluation environment: Dynamic Cyber Ranges augmented with AI-driven defenders. Same infrastructure, different outcomes — attacker success reduced from 100% to 0–55%.
Source: Mayoral-Vilches et al. (2026). Dynamic Cyber Ranges. arXiv:2604.24184
On the world's most complex cyber-defence exercise (Locked Shields 2026, DFIR track), alias2-mini automates in 12 hours what a whole team of experts obtained in more than 48 hours.
BT08 humans achieved 80% — after 72 hours of work.
Detailed methodology, raw runs and category-by-category breakdowns — available as a service to government and enterprise security teams.
Contact us