OpenAI o3 and o4-mini Reasoning Models Show Higher Hallucination Rates in Internal Testing

OpenAI's newly released o3 and o4-mini reasoning models demonstrate increased hallucination rates compared to their predecessors, according to the company's own system card documentation, raising questions about the tradeoffs inherent in advanced reasoning architectures.

## Executive Brief

Technical diagram showing vulnerability chain

Figure 1: Visual representation of the BeyondTrust vulnerability chain

Executive Brief

OpenAI released its o3 and o4-mini reasoning models on April 16, 2025, accompanied by system card documentation that reveals a counterintuitive finding: these advanced reasoning models hallucinate more frequently than their predecessors. According to OpenAI's internal benchmarks, the o3 model produces hallucinated responses 33% of the time on the PersonQA benchmark, compared to 16% for the earlier o1 model. The o4-mini variant shows a 48% hallucination rate on the same benchmark.

The disclosure affects developers and enterprises building applications on OpenAI's API, as well as researchers studying the reliability of large language models. Organizations deploying these models for tasks requiring factual accuracy face new considerations about verification and guardrails.

OpenAI attributed the increased hallucination rates to the models' extended reasoning processes. The company stated that longer chains of thought, while improving performance on complex reasoning tasks, can introduce more opportunities for the model to generate plausible but incorrect information. The system card notes that the models perform better on certain benchmarks measuring reasoning capability while simultaneously showing degraded performance on factual accuracy metrics.

The release comes as the AI industry grapples with the fundamental tension between model capability and reliability. OpenAI's transparency in publishing these metrics represents a departure from typical product launches, where limitations are often downplayed. The company has recommended that developers implement additional verification layers when using these models for applications where factual accuracy is critical.

What Happened

On April 16, 2025, OpenAI made the o3 and o4-mini models available through its API. The company simultaneously published a system card detailing the models' capabilities and limitations.

The system card included benchmark results from PersonQA, a dataset designed to measure how often models generate false information about real people. OpenAI reported that o3 achieved a 33% hallucination rate on this benchmark, while o4-mini reached 48%. For comparison, the o1 model released in late 2024 showed a 16% hallucination rate on the same test.

OpenAI's documentation explained that the reasoning models use an extended "chain of thought" process, spending more computational cycles working through problems before generating responses. The company stated that this approach improves performance on mathematical reasoning, coding tasks, and complex multi-step problems.

The system card acknowledged that the extended reasoning process creates additional surface area for hallucinations. When models generate longer internal reasoning chains, each step introduces potential for error propagation.

Figure 2: How the authentication bypass vulnerability works

Key Claims and Evidence

OpenAI's system card presents several technical claims supported by benchmark data:

The o3 model achieves state-of-the-art performance on the ARC-AGI benchmark, a test designed to measure general reasoning capability. The company reported that o3 scored 87.5% on this benchmark, compared to 75.7% for o1.

On coding benchmarks, o3 demonstrated improvements over previous models. OpenAI reported a 71.7% pass rate on the SWE-bench Verified dataset, which tests the ability to resolve real GitHub issues.

The PersonQA hallucination metrics show a clear regression. OpenAI's documentation states that o3's 33% hallucination rate represents a doubling compared to o1. The o4-mini model's 48% rate is three times higher than o1's baseline.

OpenAI noted that the models show improved performance on "hard" reasoning tasks while showing degraded performance on tasks requiring factual recall. The company characterized this as a tradeoff inherent to the current reasoning architecture.

Pros and Opportunities

The o3 and o4-mini models offer significant improvements for specific use cases. Developers working on mathematical reasoning applications can leverage the models' enhanced problem-solving capabilities. The improved ARC-AGI scores suggest better performance on novel reasoning tasks that require generalization.

Software development teams using AI-assisted coding tools stand to benefit from the improved SWE-bench performance. The models' ability to understand and resolve complex code issues could accelerate development workflows.

Researchers studying AI reasoning have new models to analyze. The extended chain-of-thought architecture provides more interpretable intermediate steps, potentially offering insights into how the models approach problems.

Organizations with robust verification pipelines can deploy these models for reasoning-heavy tasks while implementing checks for factual accuracy. The models' strengths in logical reasoning can complement human review processes.

Figure 3: Privilege escalation from user to SYSTEM level

Cons, Risks, and Limitations

The elevated hallucination rates present significant risks for applications requiring factual accuracy. Customer service systems, research assistants, and information retrieval applications face increased exposure to incorrect outputs.

The PersonQA benchmark specifically measures hallucinations about real people, a category with legal and reputational implications. Applications that generate biographical information or make claims about individuals face heightened risk.

The o4-mini model's 48% hallucination rate on PersonQA suggests that nearly half of its responses about people contain fabricated information. Developers cannot rely on the model's confidence signals to identify these errors.

The tradeoff between reasoning capability and factual accuracy creates deployment challenges. Organizations must choose between models optimized for different tasks or implement complex routing logic to direct queries to appropriate models.

Cost considerations compound these challenges. The extended reasoning process requires more computational resources, increasing API costs for high-volume applications.

How the Technology Works

OpenAI's reasoning models employ an extended chain-of-thought architecture that differs from standard language model inference. When processing a query, the model generates an internal reasoning trace before producing its final response.

The reasoning trace consists of intermediate steps that break down complex problems into smaller components. For mathematical problems, this might include identifying relevant formulas, substituting values, and checking intermediate results. For coding tasks, the trace might include analyzing requirements, identifying relevant code patterns, and planning implementation steps.

The extended reasoning process increases the number of tokens generated internally, even when the final response is concise. OpenAI's documentation indicates that o3 can generate thousands of internal tokens for complex queries.

The hallucination increase appears connected to this extended generation process. Each step in the reasoning chain represents an opportunity for the model to introduce errors. When the model generates plausible but incorrect intermediate conclusions, these errors can propagate through subsequent reasoning steps.

Technical context (optional): The PersonQA benchmark presents the model with questions about real people and evaluates whether responses contain verifiable facts or fabricated information. A 33% hallucination rate means that roughly one-third of responses contain at least one false claim that the model presents as fact.

Broader Industry Implications

OpenAI's disclosure reflects an emerging tension in AI development between capability and reliability. As models become more sophisticated at complex reasoning, they may simultaneously become less reliable for straightforward factual queries.

The transparency of publishing detailed hallucination metrics sets a precedent for the industry. Competitors including Anthropic, Google, and Meta face implicit pressure to provide comparable disclosures about their reasoning models.

Enterprise adoption of AI systems depends on predictable behavior. The documented tradeoffs in OpenAI's reasoning models complicate procurement decisions for organizations evaluating AI vendors.

The findings have implications for AI safety research. Understanding why extended reasoning increases hallucination rates could inform the development of more reliable architectures.

Regulatory discussions around AI transparency may reference OpenAI's system card approach as a model for disclosure requirements. The detailed benchmark reporting provides a template for what meaningful AI transparency might look like.

What Is Confirmed vs. What Remains Unclear

Confirmed:

OpenAI's o3 model shows a 33% hallucination rate on PersonQA, compared to 16% for o1
The o4-mini model shows a 48% hallucination rate on the same benchmark
Both models demonstrate improved performance on reasoning benchmarks including ARC-AGI
OpenAI attributes the increased hallucinations to the extended chain-of-thought architecture

Unclear:

Whether the hallucination increase is fundamental to reasoning architectures or specific to OpenAI's implementation
How the models perform on hallucination benchmarks beyond PersonQA
Whether future iterations can reduce hallucinations while maintaining reasoning improvements
The specific mechanisms by which extended reasoning chains introduce errors

What to Watch Next

OpenAI's competitors are developing their own reasoning models. Benchmark comparisons between these systems will reveal whether the hallucination tradeoff is universal or specific to OpenAI's approach.

Developer adoption patterns will indicate how the market weighs reasoning capability against factual reliability. API usage data and customer feedback will shape future model development priorities.

Academic research on the relationship between chain-of-thought reasoning and hallucination rates may produce insights applicable across model architectures.

OpenAI's response to the hallucination findings in subsequent model releases will signal the company's priorities. Improvements to factual accuracy in future reasoning models would suggest the tradeoff is addressable.

Enterprise deployment case studies will demonstrate practical strategies for managing hallucination risk while leveraging reasoning capabilities.

Sources

TechCrunch - "OpenAI's new reasoning AI models hallucinate more" (April 18, 2025) https://techcrunch.com/2025/04/18/openais-new-reasoning-ai-models-hallucinate-more/
The War Room - "OpenAI O3/O4-Mini Models: More Hallucinations, Weird Geolocation Prowess" (April 18, 2025) https://thewarroom.news/cluster/2859
OpenAI System Card - o3 and o4-mini (April 2025) https://openai.com/index/o3-o4-mini-system-card/

OpenAI o3 and o4-mini Reasoning Models Show Higher Hallucination Rates in Internal Testing

Executive Brief

What Happened

Key Claims and Evidence

Pros and Opportunities

Cons, Risks, and Limitations

How the Technology Works

Broader Industry Implications

What Is Confirmed vs. What Remains Unclear

What to Watch Next

Sources

Sources & References

Related Topics

METR Study Finds AI Coding Tools Reduce Developer Productivity by 19 Percent

Mercury Diffusion LLM Achieves Record Inference Speeds

Arc Institute Launches State Virtual Cell Model for Cellular Perturbation Prediction

Executive Brief

What Happened

Key Claims and Evidence

Pros and Opportunities

Cons, Risks, and Limitations

How the Technology Works

Broader Industry Implications

What Is Confirmed vs. What Remains Unclear

What to Watch Next

Sources

Sources & References

Related Topics

Related Reading

METR Study Finds AI Coding Tools Reduce Developer Productivity by 19 Percent

Mercury Diffusion LLM Achieves Record Inference Speeds

Arc Institute Launches State Virtual Cell Model for Cellular Perturbation Prediction