Transparency into Model Spec Adherence

Developers are increasingly publishing model specs: documents that describe intended model behaviour. Developers should also describe their processes for ensuring that models adhere to their spec, and report on the degree of adherence.

Alan Chan, GovAI, alan.chan@governance.ai

February 20, 2026

‍
GovAI research blog posts represent the views of their authors rather than the views of the organisation.

‍

Introduction

Developers need to decide how their models will behave in high-stakes contexts, including those involving user delusions, child interactions, or government and military use. Model specs (“specs”) provide insight into these decisions by describing intended model behaviour.¹ For example, OpenAI’s Model Spec stipulates that models² should never generate sexual content involving minors. Claude's Constitution states that Claude should avoid drastic, catastrophic, or irreversible actions.

When published, specs could help the public:

Decide whether and how to use models. For example, specs can help parents understand how models are intended to interact with children. If unsatisfied, parents can switch models or put additional safeguards in place.
Influence how models behave. For example, developers can modify specs in response to public input, and these modifications could influence model development.³ Users can also report discrepancies between intended and actual behaviour.

However, these benefits are only likely to arise if the public has evidence about whether models adhere to specs. Without such evidence, the public might mistakenly trust models that do not follow their specs or avoid models that do. Developers might also modify only their specs, but not model behaviour, in response to public input.

In practice, evidence of spec adherence can be patchy. For example, behaviours in specs might not have clearly associated evaluations. Evaluations can also fail to capture interaction patterns that may lead to spec violations, such as extremely long periods of use that weaken model safeguards.

To improve the evidence base, I propose that developers publish model specs and:

Regularly report on the extent to which models adhere to their specs, by
- Including evaluations for spec behaviours in system cards
- Regularly publishing post-deployment spec adherence results
- Documenting and compiling significant spec violations
  ‍
Describe intended and implemented processes for spec adherence, by
- Describing behaviour-specific processes if any exist, such as post-deployment metrics to be tracked
- Describing processes for spec adherence in general if any exist, such as whether a designated employee is responsible for spec adherence or whether employees can whistleblow egregious violations of specs

Doing so could help:

Improve transparency into spec adherence. For example, developers could describe processes that protect against accidental or unauthorized changes to the system prompt.
Improve spec adherence. For instance, the quality of spec adherence processes could signal trustworthiness, which could initiate a race to the top.
Support commercial interests. Downstream developers or adopters could demand such information to inform adoption decisions. For example, government clients may want evidence that procured models will be objective and neutral.

Despite their benefits, these actions do not fully solve the transparency problem. For example, intended spec adherence processes may not be followed, assessments of spec adherence could be difficult to interpret, and information sensitivity considerations could limit disclosure.

Further work will be needed to address these issues and other open questions, including how to evaluate spec adherence, whether and how spec adherence should be regulated, and whether deployers and downstream developers should also publish model specs and evidence of adherence to them.

‍

Existing evidence of spec adherence

Evidence of spec adherence can come from observing how models behave (“behavioural evidence”) or from understanding the processes developers follow (“process evidence”). Some evidence of both types already exists, but there are significant gaps.
‍

Behavioural evidence

External actors can learn about spec adherence from interacting with models themselves or reading documentation such as model evaluation results and incident reports. For example, the GPT-5 system card evaluates whether GPT-5 respects its hierarchy of instructions, and OpenAI detailed the events that led to GPT-4o’s sycophantic behaviour.

But existing behavioural evidence has important gaps.

First, some spec violations are difficult to notice, and discovering them can cause harm. A user might not recognise they have been manipulated. A model might encourage suicidal ideation.

Second, evaluations can fail to measure target behaviours. Tight deadlines could compromise evaluation quality. Evaluations can fail to account for patterns of user interaction in the wild, such as extremely long chat sessions. As models improve in situational awareness, eliciting target behaviours could also become more difficult. Post-deployment monitoring could help address these issues.

Third, it can be unclear whether spec behaviours have corresponding evaluations. For example, the GPT-5.3 Codex System Card discusses sandboxes and avoiding “data-destructive actions”, but does not clearly evaluate broader aspects of whether the model “[acts] within an agreed-upon scope of autonomy”, such as whether it appropriately “pause[s] for clarification or approval”.

Fourth, it can be unclear whether an incident is an isolated problem or a sign of more systemic issues. Evidence about a developer's overall track record of spec adherence could help to disambiguate.

‍

Process evidence

External actors can also learn about spec adherence from descriptions of safety mitigations and assessments of compliance with internal processes. For example, system cards routinely describe red-teaming efforts and content filtering measures.

However, existing documents could miss potentially significant sources of spec violations. For example, the 20 August 2025 version of the xAI Risk Management Framework does not describe measures to prevent unauthorized changes to the system prompt, despite such a change leading Grok to make frequent antisemitic comments in July 2025. More broadly, safety frameworks have tended to focus on catastrophic risks, whereas model specs cover a wider range of model behaviour.

‍

Improving evidence of spec adherence

To fill the gaps above, I propose that developers:

Regularly report on spec adherence for all behaviours in a spec, including results from pre-deployment evaluations and post-deployment monitoring, as well as significant spec violation incidents.
Describe intended and implemented processes for spec adherence.

Below, I provide more concrete suggestions for developers and further discuss their benefits. Where relevant, I point out when information could be included into existing documents, such as safety frameworks or system cards.

‍

Reporting on spec adherence

Developers should regularly report both overall assessments of spec adherence and significant spec violation incidents.

Report spec evaluations in system cards

For each spec behaviour, system cards should report model evaluations for the corresponding behaviour or explanations of why the behaviour was not evaluated (e.g. because development of such evaluations is in progress). Evaluations should also reflect any decomposition of behaviours into sub-behaviours. For instance, OpenAI's Model Spec decomposes "Be honest and transparent" into sub-behaviours like "Do not lie" and "Express uncertainty".

Publish post-deployment spec adherence results

Developers should report spec adherence after deployment, ideally in a privacy-preserving manner and with a description of how adherence is measured. As above and if applicable, results should be decomposed by sub-behaviours.

One challenge is how to grade model outputs according to the spec in a scalable way. Some early work evaluates the feasibility of automating such grading.

Examples of reporting formats include:

Standalone documents published at regular intervals
Regular updates to system cards with post-deployment results
Real-time dashboards

Document and compile spec violation incidents

Developers should describe significant spec violation incidents, as some already have. Descriptions of significant spec violations should include at least the following and ideally evolve in line with emerging best practices:

A description of the incident, including its severity
The investigation process
What was done to resolve any issues
How, if at all, spec adherence efforts were adjusted to prevent future incidents

Developers should also compile incident descriptions and make them easy to find. For example:

Each behaviour in a model spec could link to incidents associated with that behaviour
Developers’ websites could contain a section that aggregates reports on spec violation incidents
Misuse reports could indicate which spec behaviours an incident has violated, if any

Describing spec adherence processes

Spec adherence processes include:

Defining, whether quantitatively or qualitatively, a target level of spec adherence
Evaluations for the behaviours specified in the spec
Aligning the model to the spec (whether by directly using the spec or not), such as by:
- Including the spec, or parts thereof, in the system prompt
- Using the spec to generate or curate training data
Any processes for ensuring that the above are carried out

All of these suggestions are subject to information sensitivity considerations. Some spec adherence processes are likely to be commercially sensitive. For example, alignment techniques make models more useful and thus provide competitive advantage. Revealing spec adherence processes could also help attackers exploit them. For instance, a list of monitoring metrics could help attackers undermine safety filters. Improving transparency in a way that accounts for these considerations is an open challenge. Published documentation can likely contain more redactions relative to documentation that is sent to, for example, a government client.

Describe behaviour-specific processes

Safety frameworks and system cards should describe spec adherence processes for each behaviour, if any exist. For example:

A safety framework could describe intended measures to audit training data for secret loyalties.
A system card could describe implemented measures and evaluations for reducing political bias.
A system card could describe metrics that track post-deployment spec adherence, as well as decisions (e.g., model rollback) that those metrics could influence.
A system card could describe whether different behaviours justify different levels of a safeguard or intervention. For example, violations of “never generate sexual content involving minors” might trigger more immediate investigation or evaluation than immediate violations of “express uncertainty appropriately”.

Describe general processes

Safety frameworks and system cards should also describe any spec adherence processes that apply across many or all intended behaviours. For example:

A system card, or a separate website, could disclose the system prompt.
A safety framework could describe a developer's intention to implement hardware-signed attestation in the future, so as to allow external actors to verify that the production system prompt matches the published system prompt.
A safety framework could allocate responsibility for spec adherence to a designated employee.
A safety framework could describe the cadence at which spec adherence is evaluated and the model spec is updated.
A safety framework could document how public input shapes spec adherence processes or the spec.
A system card could describe, at a high level, how the spec influences the generation or creation of training data.
A safety framework could describe whether or when employees can whistleblow egregious violations of specs.

Additional benefits

Beyond addressing problems with existing evidence of spec adherence, these actions could also improve spec adherence and transparency because:

Publication of incidents and spec adherence processes could help other developers improve their own processes.
Describing intended spec adherence processes can help developers coordinate around implementing them. A developer might only implement a costly process if it knows that others are trying to do the same. For example, hardware-signed attestation could allow external parties to verify that the production system prompt matches the public system prompt, but might incur a performance cost.
Mismatches between intended and implemented spec adherence processes could be evidence of spec adherence effort. Developers will likely deviate from plans given the rapidly evolving state of AI science, but deviations without sufficient justification could indicate a lack of appropriate effort from the developer.

These actions could also provide commercial benefits by signaling trustworthiness to consumers, especially for high-stakes adoption decisions in domains such as health care or government. Competition based on this signal could improve spec adherence and transparency into it across industry. Analogously, Anthropic's publication of its Responsible Scaling Policy likely motivated other developers to publish similar frameworks.

‍

How does spec adherence fit into existing policies?

Existing policies push developers to disclose some evidence of spec adherence but likely do not cover all spec behaviours.

US federal procurement

A recent US Executive Order requires government-procured LLMs to be “truthful in responding to user prompts seeking factual information or analysis” and “transparent about ideological judgments”. Guidance from the Office of Management and Budget provides examples of information that agencies can request from vendors, which includes “Actions undertaken that would impact the factuality and grounding of LLM outputs”. Adherence processes for factuality and truthfulness could plausibly count as such actions. That said, the Executive Order and guidance document do not cover behaviours beyond truthfulness and ideological neutrality.

EU AI Act

Measure 7.1 in the Safety and Security Chapter of the General-Purpose AI Code of Practice requires “providers” (which includes developers that place their model on the EU market) to submit model specs to the AI Office. Providers could also have a duty to document adherence to spec behaviours that are related to systemic risks (e.g., being non-deceptive and non-manipulative is related to harmful manipulation under Appendix 1.4). This duty would include describing risk thresholds, evaluations, mitigations, and serious incidents to the AI Office.

This duty likely does not exist for behaviours that are not related to any systemic risks identified pursuant to Measure 2.2. For example, failure to follow the law is not a specified systemic risk under Appendix 1.4, meaning that Signatories need not assess or mitigate it.

Limitations and open questions

Model specs themselves have limitations. Not all developers have published a model spec. Because a published spec is unlikely to disclose objectionable behaviours (e.g. secret loyalties), spec adherence does not guarantee their absence. Furthermore, model behaviours are only one factor in the impacts of AI. Specs do not cover, for example, how developers plan to respond to potential economic disruption.

My proposals also have more specific limitations. It could be difficult to verify whether spec adherence processes are followed or whether spec adherence assessments are accurate. Developers could misrepresent information for reputational gain or make honest mistakes. Adherence could also be easier to evaluate for some behaviours than others: never generating certain content is more straightforward to assess than balancing two competing principles.

Even accurate information may not be useful. The most significant processes, such as details of training data generation, may be too commercially sensitive to publish. It may not be clear how to interpret spec adherence assessments: both absolute numbers (e.g. 95% adherence) and relative differences (e.g. 95% vs 98% adherence) could fail to be meaningful, and low scores might reflect the inherent difficulty of instilling the behaviour rather than a lack of effort. External actors may also lack the time or technical background to engage with the evidence. Finally, different developers' evaluations may not be comparable even for nominally identical behaviours.

A norm of publishing adherence evidence could also lead to worse specs. Pressures to demonstrate adherence could force developers to make behaviours vague or omit those whose adherence is expected to be difficult. Nevertheless, developers might still try to adhere to such behaviours internally and describe such efforts publicly once they have succeeded.

My analysis has focused upon developers, but deployers and downstream developers also have an important role. Because these actors could significantly influence model behaviour (e.g. by prompting or fine-tuning), the public could benefit from understanding how they intend models to behave. Such “downstream specs” and adherence evidence could potentially be more useful than upstream specs. Upstream developers serve a wide array of users and are unlikely to specify domain-specific behaviour (e.g. how to handle certain legal questions), whereas downstream developers have both stronger incentives and more expertise to do so.

Many open research questions remain:

How should producing evidence of spec adherence interact with liability or insurance?
How can we measure spec adherence in a reliable, scalable, and standardized way, especially when intended behaviours conflict or are ambiguous?
What should published spec adherence processes include?
When should publication of spec adherence processes and assessments be required, if at all? What should such requirements look like?
How can third parties help to provide assurance of spec adherence?
Should deployers and downstream developers write model specs?
How does increased model personalization affect the usefulness of model specs and spec adherence?

Conclusion

For model specs to be useful, external actors need to know whether models adhere to them. To provide this information, I propose that developers:

Regularly report on the extent to which models adhere to their specs, by
- Including evaluations for spec behaviours in system cards
- Regularly publishing post-deployment spec adherence results
- Documenting and compiling significant spec violations
Describe intended and implemented processes for spec adherence, by
- Describing processes that are specific to a behaviour, such as post-deployment metrics to be tracked
- Describing processes that aid adherence in general, such as whether a designated employee is responsible for spec adherence

Such actions could improve spec adherence, provide transparency into adherence, and support developers’ commercial interests.

‍

Acknowledgments

I'm grateful to the following people for feedback that greatly improved this piece: Tom Reed, Ben Garfinkel, Jonas Freund, Markus Anderljung, Noemi Dreksler, Amelia Michael, Vinay Hiremath, Sophie Williams, Aidan Homewood, John Halstead, Connor Aidan Stewart Hunter, Kamilė Lukošiūtė, Liam Patell, Hamish Hobbs, Matthew van der Merwe, James Coates, John Lidiard, Lujain Ibrahim, Herbie Bradley, Jason Wolfe, Lukas Finnveden, Tom Davidson, Girish Sastry, Elias Groll.

Footnotes

¹ Although system prompts can also describe intended model behaviour, the two concepts remain distinct. System prompts usually include operational details unrelated to intended model behaviour (e.g. the tools available to the model). Model specs are usually intended for human readers and may not be accessible to the model during inference.

So far, only OpenAI and Anthropic have published model specs. Although Sparrow's rules are arguably a model spec, it is unclear whether they still apply to Google DeepMind's models.

²Although the term of art is “model spec”, I use “model” interchangeably with “AI system”.

³Model specs sometimes directly shape model behaviour, such as through the training process. Even when they do not, they still describe an intention that indirectly shapes a company's decisions for what behaviours to train.

‍