Evaluating Offline Monitoring of Internal AI Agents

Frontier AI companies use offline monitoring to address risks from internally deployed AI agents. AI developers increasingly rely on AI agents for internal work, including for safety research and model training. At the same time, these companies are concerned that a misaligned model could exploit this access to take concerning actions, such as sabotaging efforts to understand the risks posed by AI. To identify such instances, AI companies have separate AI models called “monitors” that review transcripts of AI agents’ actions and flag suspicious activity. Human reviewers examine activity flagged as suspicious by monitors, judge whether that activity is concerning, and decide on an appropriate response. This monitoring occurs offline, meaning that actions are reviewed after they have been executed rather than intercepted in real time.

Companies currently assess the effectiveness of offline monitoring via synthetic attacks. To assess the effectiveness of offline monitoring, OpenAI and Anthropic use synthetic attacks – transcripts constructed to contain the kind of harmful actions a misaligned AI might take during deployment – and then check whether monitors flag them.

Current reporting on assessments of effectiveness is insufficient. Given the information currently made public by Anthropic and OpenAI, external parties cannot assess the overall effectiveness of their offline monitoring. Anthropic reports how often monitors flag synthetic attack transcripts, but not how often those transcripts survive the full pipeline and reach human review; later filters can discard transcripts that were initially flagged. OpenAI reports that their monitor flags all transcripts independently flagged by human employees, but this tells us little if employees only catch obvious cases of misbehavior. As far as we are aware, only Anthropic and OpenAI have published information about their offline monitoring.

Recommendations. To enable external parties to assess the effectiveness of offline monitoring to address risks from internally deployed models, we recommend that companies:

Report the fraction of model outputs that are monitored across different deployment types.
Report the fraction of synthetic attack transcripts that are sent to human review by the monitoring pipeline.
Report the fraction of synthetic attack transcripts sent to human review that are escalated by reviewers.
Report the time lag between an attack and its detection.
Ensure that synthetic attacks are similar to hypothetical real attacks, as the estimates in recommendations (2) and (3) are only informative if synthetic attacks resemble real ones.

‍

Read paper

Evaluating Offline Monitoring of Internal AI Agents

Theme

Date

author

s

Share

Research Summary

Footnotes

Further reading

Related publications

Technical AI Governance

Comprehensive AI Governance Requires Addressing Non-Model Capability Gains

May 2026

Research Paper

Arthur Goemans, Dan Altman, Noemi Dreksler, Jonas Freund, et al. inc Allan Dafoe

Technical AI Governance

Measuring AI R&D Automation

March 2026

Research Paper

Alan Chan, Ranay Padarath, Joe Kwon, Hilary Greaves, Markus Anderljung

Technical AI Governance

Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies

January 2026

Miles Brundage, Noemi Dreksler, Aidan Homewood, Sean McGregor, Patricia Paskov, Conrad Stosz, Girish Sastry, A. Feder Cooper, George Balston, Steven Adler, Stephen Casper, Markus Anderljung et al.

Technical AI Governance