Towards a Common Standard for Evaluating Frontier AI Safeguards Against Biological Misuse

Towards a Common Standard for Evaluating Frontier AI Safeguards Against Biological Misuse

This work represents the views of its authors, rather than the views of the organization, and does not constitute legal advice. GovAI technical reports have received extensive feedback, but have not gone through formal peer review.

Frontier AI developers increasingly rely on safeguards to limit the biological misuse risk of their models. However, there is no agreed way to judge whether those safeguards are effective, how they compare across models and deployments, or how they should strengthen as model capabilities increase. This report makes seven recommendations to address that concern, and outlines a more comprehensive framework that could be used to create a standard for evaluating safeguards.

We propose seven recommendations, organized around four principles:

1. Comparability across companies. Report safeguard performance against shared benchmarks with breakdowns by biological risk category and false-positive rates on benign prompts. Standardize reporting of safeguard robustness – including how easily safeguards can be bypassed and how models perform when they are (Recommendations 1–2).

2. Accounting for the deployment environment. Report safeguard adequacy for each major deployment configuration, not once per model, since the same model may be offered with different safeguards across access tiers and platforms (Recommendation 3).

3. Treating safeguards as dynamic. Specify how safeguard requirements strengthen as biological uplift increases within a capability tier. Report how safeguard adequacy is reassessed and updated when new findings emerge (Recommendations 4–5).

4. Preserving legitimate scientific use. As models become more scientifically capable, introduce trusted-access programs for vetted researchers. Pair such programs with compensating oversight if model-level restrictions are relaxed (Recommendations 6–7).

These recommendations can be acted on now, but they are not a complete roadmap for assessing safeguard adequacy. Today’s evaluations largely focus on model-level safeguards and do not address all parts of the safeguard stack, including access controls and post-deployment monitoring. We outline a preliminary framework that begins to close this gap, setting out the questions evaluators would need to answer to assess safeguards across the full stack. We also sketch how results across these dimensions could be combined into composite “safeguard levels”, providing a basis for judging whether a deployment’s safeguards are adequate against different threat actors.

Research Summary

Footnotes
Further reading