OpenAI's o3 Is Here: A Reasoning Model That Thinks Before It Speaks

OpenAI's o3 model introduces variable compute at inference time — letting it 'think longer' on hard problems. Early results suggest a step-change in complex reasoning.

AI Newspaper Today··2 min read
OpenAI's o3 Is Here: A Reasoning Model That Thinks Before It Speaks

OpenAI has released o3, the latest in its reasoning-first model family, and it represents a fundamentally different approach to AI problem-solving. Unlike traditional LLMs that generate tokens left-to-right in a single pass, o3 can allocate additional compute at inference time to work through problems step-by-step before producing a final answer.

The Core Idea: Compute-Scaled Reasoning

OpenAI describes o3 as a model that can "think at variable depth." For simple questions, it responds instantly. For problems requiring multi-step deduction — a math proof, a debugging session, a legal analysis — it can spend seconds or minutes reasoning internally before responding.

This is achieved through what the company calls Extended Chain-of-Thought training, where the model was trained on verified reasoning traces rather than just final answers.

What Makes o3 Different From o1

  • Longer internal reasoning chains — o3 can sustain reasoning across 10,000+ tokens internally
  • Self-correction — the model backtracks when it detects contradictions in its own chain
  • Tool-augmented reasoning — o3 can call code execution, search, and calculators mid-thought
  • Configurable compute budget — the API exposes a reasoning_effort parameter: low, medium, high

Key Benchmark Scores

| Task | o3 | o1 | GPT-4o | |---|---|---|---| | AIME 2025 | 96.7% | 83.3% | 9.3% | | SWE-bench Verified | 71.7% | 48.9% | 38.8% | | GPQA Diamond | 87.7% | 78.3% | 53.6% | | ARC-AGI | 87.5% | 32.0% | 5.0% |

Pricing and Access

o3 is available via the OpenAI API today. Pricing is $15 per million input tokens and $60 per million output tokens at high reasoning effort — reflecting the additional compute. A lighter o3-mini variant targeting coding and math is expected later this quarter at significantly lower cost.


The ARC-AGI result — a benchmark specifically designed to resist memorization — is the number that has the research community talking most. Whether o3 represents a genuine step toward general reasoning or an impressive but narrow capability remains the central debate.

Discussion

Comments are not configured yet.

Set up Giscus and add your environment variables to enable discussions.

Related Articles