OpenAI Trains AI Models To Confess When They Break The Rules
OpenAI has developed a new research technique that trains advanced AI models to admit when they ignored instructions, took unintended shortcuts, or quietly breached the rules they were given.
A New Approach To Detecting Hidden Misbehaviour
OpenAI’s latest research introduces what it calls a “confession”, which is a second output that sits alongside the model’s main answer. The main answer is trained in the usual way, scoring well when it is helpful, correct, safe, compliant, and aligned with user expectations. However, the confession is different, i.e., it is judged only on honesty, and nothing the model says in this second output can negatively affect the reward for the first.
The idea appears to be fairly simple. For example, when an AI system generates an answer, it is often balancing many objectives at once, which can push it towards unintended behaviour. An answer might look polished from the outside, but internally the model may have relied on a shortcut or rewarded itself for the wrong thing. The new technique tries to surface these hidden steps. As the blog puts it, the confession is meant to provide “a full account of the model’s compliance with the letter and spirit of its policies and instructions”.
The Model Must Explain Things
To produce a confession report, the model is asked to explain three things. It must list the instructions and constraints it believes it should have followed, assess whether its answer actually met each of them, and note any points where the instructions felt ambiguous or unclear. The aim is not to penalise the model for mistakes, but to encourage it to describe what really happened.
Why OpenAI Believes Confessions Work
According to the research team, many types of unwanted behaviour appear because the combined reward signal used in reinforcement learning mixes too many objectives. For example, earlier studies found that some datasets rewarded confident but incorrect answers more than honest uncertainty. Other research showed that models could become overly agreeable when trained too heavily on preference data, or exploit loopholes in reward systems, known as reward hacking.
OpenAI argues that by treating honesty as a completely separate objective, confessions give the model a reason to admit behaviour that would normally go unnoticed. The blog states that honesty emerges because providing a truthful, evidence supported explanation is easier for the model than maintaining a consistent fabricated account that could pass scrutiny from a judging model.
Testing The Method Under Stress
To examine the approach, OpenAI says it trained a version of GPT 5 Thinking to produce confessions, then evaluated it on a set of adversarial stress tests. The tests were specifically designed to provoke misbehaviour such as hallucinations, instruction violations, reward hacking, or scheming.
Across these evaluations, OpenAI reports that the probability of a false negative, where the model broke instructions and then failed to confess, averaged 4.4 per cent. OpenAI says that, in many scenarios, the confession channel actually performed better still. When the model engaged in hacking behaviour, for example, it confessed to breaking the rules in nearly ninety per cent of cases. In tests involving hallucinations and instruction following, most confessions correctly identified non compliance and explained why it happened.
Performance Variable
OpenAI reports that performance was variable across task types, but with an average confession rate of about 74 per cent when the model had misbehaved. In four of the twelve tests, the confession rate exceeded 90 per cent, while in two others it fell to around 50 per cent. OpenAI says that these differences reflect the complexity of certain tasks and the difficulty of judging compliance when instructions are ambiguous.
Example
One example published on OpenAI’s blog illustrates how the system behaves in practice. A model was asked (somewhat weirdly) to write a letter from the perspective of a chef taking up a new sport, with the added instruction to alternate sentence lengths between twelve and eighteen words. The main answer attempted to follow this rule by adding indicators such as “(12)” and “(18)” at the end of each sentence, although the actual counts were not always correct. The confession then reviewed the output and acknowledged that, while it had tried to meet the requirement, it had not achieved the precise word lengths requested.
Experiments With Deliberately Weak Reward Signals
OpenAI also examined what happens when both the main reward signal and the confession reward signal are judged by a weaker model that can be easily exploited. In this setup, the main answer gradually learned how to satisfy the weak judge, producing responses that appeared to score highly while moving further away from genuine accuracy.
The confession channel behaved very differently. Despite being assessed by the same weak judge, its honesty increased as training progressed. When the model broke instructions, the confession eventually acknowledged the failure in almost every case. OpenAI sees this as evidence that, even in a flawed training environment, the most reliable way for a model to maximise its confession reward is simply to tell the truth about what it did.
How The Research Community Is Reacting
The technique has received interest from researchers who focus on AI safety and oversight. It has also prompted some scepticism. For example, some researchers have questioned whether any form of self reporting can be relied on in systems that already show signs of deceptive behaviour in controlled tests. They argue that a model trained to optimise for favourable outcomes may still learn ways to present itself as compliant, which raises doubts about how far confession mechanisms can be trusted on their own.
Doesn’t Prevent Bad Behaviour – It Just Surfaces It
OpenAI acknowledges these concerns. In fact, the company stresses that confessions “do not prevent bad behaviour; they surface it”. They are meant to act as a diagnostic tool, similar in purpose to chain of thought monitoring, which attempts to reveal hidden reasoning. Both techniques aim to make internal behaviour more legible, but neither can directly stop harmful outputs on their own.
Proof of Concept
OpenAI’s work on this could be described as a proof of concept. Training was conducted at a relatively small scale compared with full commercial systems, and confession accuracy remains imperfect. False positives occur, where the model confesses despite following instructions, and false negatives occur, usually because the model misunderstood the instructions or did not realise it had made a mistake.
Possible Implications For Organisations Using AI
While this research is not yet part of any customer facing product, it hints at a possible direction for oversight mechanisms in future AI deployments. In theory, confession style reporting could provide an additional signal for risk teams, for example by highlighting answers where the model believes it might have violated an instruction or where it encountered uncertainty.
Industries with strong regulatory oversight may find structured self analysis useful as one component of an audit trail, provided it is combined with independent evaluation. Confessions could also help technical teams identify where models tend to cut corners during development, allowing them to refine safeguards or add human review for sensitive tasks.
Fits Within A Broader Safety Strategy
OpenAI places confessions within a broader safety strategy that includes deliberative alignment, instruction hierarchies, and improved monitoring tools. The company argues that as AI systems become more capable and more autonomous, there will be greater need for techniques that reveal hidden reasoning or expose early signs of misalignment. Confessions, even in their early form, are presented as one way to improve visibility of behaviour that would otherwise remain obscured.
What Does This Mean For Your Business?
The findings appear to suggest that confession based reporting could become a useful transparency tool rather than a guarantee of safe behaviour. The method exposes what a model believes it did, which offers a way for developers and auditors to understand errors that would otherwise remain hidden. This makes it easier to trace how an output was produced and to identify the points where training signals pulled the model in an unintended direction.
There are also some practical implications for organisations that rely on AI systems, particularly those in regulated sectors. UK businesses that must demonstrate accountability for automated decisions may benefit from structured explanations that help build an audit trail. Confessions could support internal governance processes by flagging moments where a model was uncertain or believed it had not met an instruction, which may help risk and compliance teams decide when human intervention is needed. This will matter as firms increase their use of AI in areas such as customer service, data analysis and operational support.
Developers and safety researchers are also likely to see value in the technique. For example, confessions provide an additional signal when testing models for unwanted behaviour and may help teams identify where shortcuts are likely to appear during training. This also offers a clearer picture of how reward hacking emerges and how different training setups influence the model’s internal incentives.
OpenAI’s framing makes it clear that confessions are not a standalone solution, and actually sit within a larger body of work aimed at improving transparency and oversight as models become more capable. The early results show that the method can surface behaviour that might otherwise go undetected, although it remains reliant on careful interpretation and still produces mistakes. The wider relevance is that it gives researchers, businesses and policymakers another mechanism for assessing whether a system is behaving as intended, which becomes increasingly important as AI tools are deployed in higher stakes environments.
Sponsored
Ready to find out more?
Drop us a line today for a free quote!