Evaluating Offline Monitoring of Internal AI Agents
Frontier AI companies use offline monitoring to address risks from internally deployed AI agents. AI developers increasingly rely on AI agents for internal work, including for safety research and model training. At the same time, these companies are concerned that a misaligned model could exploit this access to take concerning actions, such as sabotaging efforts to understand the risks posed by AI. To identify such instances, AI companies have separate AI models called “monitors” that review transcripts of AI agents’ actions and flag suspicious activity. Human reviewers examine activity flagged as suspicious by monitors, judge whether that activity is concerning, and decide on an appropriate response. This monitoring occurs offline, meaning that actions are reviewed after they have been executed rather than intercepted in real time.
Companies currently assess the effectiveness of offline monitoring via synthetic attacks. To assess the effectiveness of offline monitoring, OpenAI and Anthropic use synthetic attacks – transcripts constructed to contain the kind of harmful actions a misaligned AI might take during deployment – and then check whether monitors flag them.
Current reporting on assessments of effectiveness is insufficient. Given the information currently made public by Anthropic and OpenAI, external parties cannot assess the overall effectiveness of their offline monitoring. Anthropic reports how often monitors flag synthetic attack transcripts, but not how often those transcripts survive the full pipeline and reach human review; later filters can discard transcripts that were initially flagged. OpenAI reports that their monitor flags all transcripts independently flagged by human employees, but this tells us little if employees only catch obvious cases of misbehavior. As far as we are aware, only Anthropic and OpenAI have published information about their offline monitoring.
Recommendations. To enable external parties to assess the effectiveness of offline monitoring to address risks from internally deployed models, we recommend that companies:
- Report the fraction of model outputs that are monitored across different deployment types.
- Report the fraction of synthetic attack transcripts that are sent to human review by the monitoring pipeline.
- Report the fraction of synthetic attack transcripts sent to human review that are escalated by reviewers.
- Report the time lag between an attack and its detection.
- Ensure that synthetic attacks are similar to hypothetical real attacks, as the estimates in recommendations (2) and (3) are only informative if synthetic attacks resemble real ones.



