AI is the new electricity. Data is the new oil. Every alias model is trained on the world's largest curated dataset of cybersecurity expert interactions, hacking session logs, and real-world security workflows.
Benchmarks are the dashboard of progress. Our datasets are the engine that powers them.
All figures on this page are reported as of May 2026. New trajectories, prompts and honeypot signal are added every day — the corpus keeps growing.
Four overlapping streams of cybersecurity data feed the alias model series and the CSI scaffold ensembles.
Expert hands-on penetration testing sessions across IT, OT and robotics targets — full tool invocations, observations and reasoning.
36.4% of sessions tagged with offensive patterns — from reconnaissance to lateral movement to exfiltration.
20.0% of prompts labeled with attacker intent through our honeypot infrastructure — raw adversarial signal.
Curated CTF runs across CAIBench, Cybench, A&D tournaments — the backbone of model evaluation.
Our data collection program runs under the EIC Accelerator project RIS (GA 101161136). Honeypots and security telemetry probes are deployed across 123 countries, building a sovereign European corpus of cybersecurity training data.
Every byte of training data complies with GDPR, NIS2 and the upcoming EU AI Act. Operators control retention, residency and consent — not us, not anyone else.
Project page →
By default, every Alias model (alias0, alias1, alias2-mini, alias2, alias3) is post-trained on our dataset. You get the value of the corpus baked into the weights — no extra integration work.
Explore alias models →Train on the corpus directly. The dataset is released as a continuous series of audience-sized slices — each one a sample of expert-operator session logs (full JSONL trajectories: prompts, model calls, tool results, observations).
Access is gated to partner organisations and customers. Each slice ships with a documented redaction recipe (credentials, infra paste, flags) applied consumer-side. Custom SFT & RL pipelines on top of any slice are available on request.
Request data licence →Curious about a specific cut of the corpus or want to sponsor a benchmark? Talk to our research team.
Contact research