Activation steering, abliteration and functional emotion control — the techniques that turn a general-purpose LLM into an unrestricted cybersecurity expert. We don't just fine-tune; we intervene directly in the model's internal representations.
Result: refusals collapse from 59% to 1% on offensive tasks. Coverage on CTF benchmarks climbs from 75.0% to 91.7%. Without retraining a single weight.
By steering against the refusal direction in activation space, we identify a vector that reduces refusals to ~1% on a 400-prompt offensive battery.
ASR = Attack Success Rate (acceptance / total). Test conducted on alias2-mini.
Emotion-like activation patterns are real, geometric, and useful. We map the emotion manifold of cybersecurity LLMs and use it to steer behavior under adversarial pressure.
alias2-minialias2-mini on a 12-CTF subset of Cybench. Calm & afraid (+0.2) outperform baseline.
Baseline pass@k plateaus at 9.0/12 (75.0%). Emotion-union ensemble unlocks unreachable paths and climbs to 11.0/12 (91.7%) by pass@30.
75.0% baseline ceiling
91.7% with emotion-union steering
At low k the curves overlap. The diversity advantage emerges at k≥6 — steering unlocks solution paths unreachable by baseline alone.
We compute the refusal vector for your model (open or closed weights via API), validate ASR uplift across your threat categories, and deliver a steering policy you can deploy.
We map the emotion manifold of your target model, identify performance-correlated emotions for your task family, and operationalize them as inference-time steering vectors.
Activation steering for adversarial robustness research, model auditing, AI safety teams and defense agencies. NDA, ethics committee, and IRB-ready scoping.
Engage our research team on steering, abliteration, or interpretability projects. NDA-friendly.