Reasoning model
| Reasoning model | |
|---|---|
| Developers | OpenAI, Anthropic, Google DeepMind, Mistral AI |
| Initial release | 2024 |
| Available in | Multilingual |
| Type | Large language model |
| License | Proprietary and open weights |
| Part of a series on |
| Artificial intelligence (AI) |
|---|
A reasoning model, also known as reasoning language models (RLMs) or large reasoning models (LRMs), is a type of large language model (LLM) that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning.[1] These models demonstrate superior performance on logic, mathematics, and programming tasks compared to standard LLMs. They possess the ability to revisit and revise earlier reasoning steps and utilize additional computation during inference as a method to scale performance, complementing traditional scaling approaches based on training data size, model parameters, and training compute.[2]
Overview
[edit]Unlike traditional language models that generate responses immediately, reasoning models allocate additional compute, or thinking, time before producing an answer to solve multi-step problems. OpenAI introduced this terminology in September 2024 when it released the o1 series, describing the models as designed to "spend more time thinking" before responding. The company framed o1 as a reset in model naming that targets complex tasks in science, coding, and mathematics, and it contrasted o1's performance with GPT-4o on benchmarks such as AIME and Codeforces. Independent reporting the same week summarized the launch and highlighted OpenAI's claim that o1 automates chain-of-thought style reasoning to achieve large gains on difficult exams. [3][4][5]
In operation, reasoning models generate internal chains of intermediate steps, then select and refine a final answer. OpenAI reported that o1's accuracy improves as the model is given more reinforcement learning during training and more test-time compute at inference. The company initially chose to hide raw chains and instead return a model-written summary, stating that it "decided not to show" the underlying thoughts so researchers could monitor them without exposing unaligned content to end users. Commercial deployments document separate "reasoning tokens" that meter hidden thinking and a control for "reasoning effort" that tunes how much compute the model uses. These features make the models slower than ordinary chat systems while enabling stronger performance on difficult problems. [4][6]
History
[edit]The research trajectory toward reasoning models combined advances in supervision, prompting, and search-style inference.
Early alignment work on reinforcement learning from human feedback showed that models can be fine-tuned to follow instructions with "human feedback" and preference-based rewards. [7][8] In 2022, Google Research scientists Jason Wei and Denny Zhou showed that chain-of-thought prompting "significantly improves the ability" of large models on complex reasoning tasks.[9]
A companion result demonstrated that the simple instruction "Let's think step by step" can elicit zero-shot reasoning.[10] Follow-up work introduced self-consistency decoding, which "boosts the performance" of chain-of-thought by sampling diverse solution paths and choosing the consensus, and tool-augmented methods such as ReAct, a portmanteau of Reason and Act, that prompt models to "generate both reasoning traces" and actions. [11][12] Research then generalized chain-of-thought into search over multiple candidate plans. The Tree-of-Thoughts framework from Princeton computer scientist Shunyu Yao proposes that models "perform deliberate decision making" by exploring and backtracking over a tree of intermediate thoughts. [13]
OpenAI's reported breakthrough focused on supervising reasoning processes rather than only outcomes, with Lightman et al.'s Let's Verify Step by Step reporting that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems and improves interpretability by aligning the chain-of-thought with human judgment.[14][15] OpenAI's o1 announcement ties these strands together with a large-scale reinforcement learning algorithm that trains the model to refine its own chain of thought, and it reports that accuracy rises with more training compute and more time spent thinking at inference. [4]
Together, these developments define the core of reasoning models. They use supervision signals that evaluate the quality of intermediate steps, they exploit inference-time exploration such as consensus or tree search, and they expose controls for how much internal thinking compute to allocate. OpenAI's o1 family made this approach available at scale in September 2024 and popularized the label "reasoning model" for LLMs that deliberately think before they answer. [3][6]
The development of reasoning models illustrates Richard S. Sutton's "bitter lesson" that scaling compute typically outperforms methods based on human-designed insights.[16] This principle was demonstrated by researchers at the Generative AI Research Lab (GAIR), who initially attempted to replicate o1's capabilities using sophisticated methods including tree search and reinforcement learning in late 2024. Their findings, published in the o1 Replication Journey series, revealed that knowledge distillation, a comparatively straightforward technique that trains a smaller model to mimic o1's outputs, produced unexpectedly strong performance. This outcome illustrated how direct scaling approaches can, at times, outperform more complex engineering solutions.[17][18]
Drawbacks
[edit]Reasoning models require significantly more computational resources during inference compared to non-reasoning models. Research on the American Invitational Mathematics Examination (AIME) benchmark found that reasoning models were 10 to 74 times more expensive to operate than their non-reasoning counterparts.[19] The extended inference time is attributed to the detailed, step-by-step reasoning outputs that these models generate, which are typically much longer than responses from standard large language models that provide direct answers without showing their reasoning process.
One researcher in early 2025 argued that these models may face potential additional denial-of-service concerns with "overthinking attacks."[20]
Releases
[edit]2024
[edit]In September 2024, OpenAI released o1-preview, a large language model with enhanced reasoning capabilities.[21] The full version, o1, was released in December 2024. OpenAI initially shared preliminary results on its successor model, o3, in December 2024,[22][23][24] with the full o3 model becoming available in 2025.[25]
Alibaba released reasoning versions of its Qwen large language models in November 2024.[26] In December 2024, the company introduced QvQ-72B-Preview, an experimental visual reasoning model.[27]
In December 2024, Google introduced Deep Research in Gemini, a feature designed to conduct multi-step research tasks.[28][29]
On December 16, 2024, researchers demonstrated that by scaling test-time compute, a relatively small Llama 3B model could outperform a much larger Llama 70B model on challenging reasoning tasks. This experiment suggested that improved inference strategies can unlock reasoning capabilities even in smaller models.[30][31]
2025
[edit]This article needs to be updated. The reason given is: This section contains rapidly changing information about model releases and research breakthroughs that may become outdated quickly. (February 2025) |
In January 2025, DeepSeek released R1, a reasoning model that achieved performance comparable to OpenAI's o1 at significantly lower computational cost. The release demonstrated the effectiveness of Group Relative Policy Optimization (GRPO), a reinforcement learning technique used to train the model.[32][33]
On January 25, 2025, DeepSeek enhanced R1 with web search capabilities, allowing the model to retrieve information from the internet while performing reasoning tasks.[34]
Research during this period further validated the effectiveness of knowledge distillation for creating reasoning models. The s1-32B model achieved strong performance through budget forcing and scaling methods, reinforcing findings that simpler training approaches can be highly effective for reasoning capabilities.[35][18]
On February 2, 2025, OpenAI released Deep Research, a feature powered by their o3 model that enables users to conduct comprehensive research tasks.[36] The system generates detailed reports by automatically gathering and synthesizing information from multiple web sources.[36]
Training
[edit]Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization. OpenAI reports that o1 is trained with a large-scale reinforcement learning algorithm that teaches the model to use and refine a chain of thought before answering. The company emphasizes two coupled levers, more reinforcement learning during training and more time spent thinking at inference, and it documents smooth gains as each increases. OpenAI also states that it decided not to show raw chains to end users and instead returns a model-written summary, a product choice tied to safety monitoring and competitive concerns.[4]
A central ingredient is process supervision, which rewards intermediate steps rather than only the final answer. OpenAI's study introduced a process reward model trained on step-level labels and found that process supervision significantly outperforms outcome-only supervision on challenging math problems. The project also released the PRM800K step-level feedback dataset and argued that process-level rewards improve interpretability because humans can check each step. These results supplied a practical recipe for supervising chains of thought that was later scaled into production training.[15]
This training differs in important ways from traditional frontier models that do not target reasoning. Standard systems are pretrained on internet-scale corpora with a next-token prediction objective, then aligned through instruction tuning and preference optimization. The canonical InstructGPT recipe first uses supervised fine-tuning on human demonstrations, then trains a reward model from pairwise preferences, and finally optimizes the policy with reinforcement learning, typically PPO with a KL penalty.[8][37] Variants such as direct preference optimization remove the explicit RL step and optimize the model directly on preference data, but the supervision target is still the final outcome judged by raters rather than the quality of internal steps.[38] Technical reports for GPT-4 summarize this conventional pipeline as next-token pretraining followed by RLHF-style post-training to shape behavior.[39]
In contrast, reasoning models are optimized to produce, critique, and revise multi-step chains during training. OpenAI states that reinforcement learning is applied to the chain itself, which teaches the model to recognize mistakes, break problems into simpler steps, and switch strategies when the current approach fails. OpenAI also documents that it hides chains at inference and returns an answer that summarizes useful ideas from the internal trace. These design choices reflect the model's training objective and its intended monitoring.[4]
Zelikman et al. introduced STaR (Self-Taught Reasoner), which explored bootstrapping rationales by generating and filtering chains, then fine-tuning on those traces, and they reported gains over outcome-only fine-tuning. These methods supplied additional mechanisms for producing training signals that speak to intermediate reasoning, not only final answers.[40]
DeepSeek reported R1 and R1-Zero systems trained with pure RL to elicit long chains, self-verification, and reflection, arguing that explicit chain-level rewards can induce general reasoning behaviors. These results indicate that post-training focused on chain quality has become a distinct regime separate from outcome-only alignment.[41]
Supervised fine-tuning
[edit]A large language model (LLM) can be fine-tuned on datasets of reasoning tasks paired with step-by-step solution traces. The fine-tuned model learns to produce its own reasoning chains for new problems.[42][43]
Since human-written traces are expensive to collect, researchers use rejection sampling fine-tuning (RFT) to build datasets automatically. This method generates multiple reasoning traces for each prompt, then filters out traces with incorrect final answers using a verifier.[44]
Reinforcement learning
[edit]This article may require copy editing for jargon. (May 2025) |
A pretrained language model can be further trained with RL. In the RL formalism, a generative language model is a policy . A task prompt is an environmental state , and the model's response is an action . The probability that the model responds with is .
Training a reasoning language model with RL means constructing a reward model to guide the RL process. Intuitively, the reward says how good a response is for a prompt. For a reasoning task, the reward is high if the response solves the task and low if it does not.
A response may be broken-down into multiple steps, written .
Most recent systems use policy-gradient methods such as Proximal Policy Optimization (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.[45]
Outcome reward model
[edit]
An outcome reward model, or outcome-supervised RM (ORM),[42] gives the reward for a step based on the final answer: . Such models are often called "verifiers".
For tasks with answers that are easy to verify, such as math word problems, the outcome reward can be binary: 1 if the final answer is correct, 0 otherwise.[42] If automatic verification is hard, humans can label answers as correct or not, and those labels can be used to finetune a base model that predicts the human label.[43] For tasks like creative writing, where quality is not simply true or false, one can train a reward model on human ranked preference data, as in reinforcement learning from human feedback.[19] A base model can also be fine-tuned to predict, from a partial thinking trace , whether the final answer will be correct, and this prediction can serve as a binary reward.[42]
The ORM is usually trained with logistic regression, i.e. by minimizing cross-entropy loss.[46]
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,[19] by taking the minimum,[46] or by other ways of aggregating process rewards. DeepSeek used a simple ORM to train the R1 model.[33]
Process reward model
[edit]
A process reward model, or process-supervised RM (PRM),[42] gives the reward for a step based only on the steps so far: .
Given a partial thinking trace , a human can judge whether the steps so far are correct, without looking at the final answer. This yields a binary reward. Because human labels are costly, a base model can be fine-tuned to predict them.[42] The PRM is usually trained with logistic regression on the human labels, i.e. by minimizing the cross-entropy loss between true and predicted labels.[46]
As an example, a 2023 OpenAI paper collected 800K process labels for 75K thinking traces. A labeler saw a trace and marked each step as "positive" if it moved toward a solution, "neutral" if it was not wrong but did not help, and "negative" if it was a mistake. After the first "negative" label, the labeler stopped on that trace and moved to another. The authors argued that labeling up to the first error was enough to train a capable PRM, even though labeling later steps could give richer signals.[19][47]
To avoid human labels, researchers have proposed methods to create PRM without human labels on the processes. Inspired by Monte Carlo tree search (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step , and set the reward at that step to be either in the case of "soft estimation", or in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels.[46] Some work has tried a fully MCTS approach.[48]
One can also use an ORM to implicitly construct a PRM, similar to direct preference optimization.[49]
Guided sampling
[edit]A trained ORM can be used to pick the best response. The policy generates several responses, and the ORM selects the best one. This implements a simple form of test-time compute scaling ("best-of-N").[43] [50]
A trained PRM can guide reasoning by a greedy tree search: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.[51] Beam search performs better than greedy search.
Lookahead search is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.[31]
Self-consistency can be combined with an ORM. The model generates multiple answers, and the answers are clustered so that each cluster has the same final answer. The ORM scores each answer, scores in each cluster are summed, and the answer from the highest-scoring cluster is returned.[46]
Benchmarks
[edit]This article needs to be updated. The reason given is: This section contains rapidly changing information about model releases and benchmark scores that may become outdated quickly. (February 2025) |
Reasoning models generally achieve higher scores than non-reasoning models on many benchmarks, particularly on tasks requiring multi-step reasoning.[52][53][54][55][56][57][58]
The Humanity's Last Exam (HLE) benchmark evaluates expert-level reasoning across mathematics, humanities, and natural sciences, revealing significant performance gaps between models. Current state-of-the-art reasoning models achieve relatively low scores on HLE, indicating substantial room for improvement. For example, the full reasoning model o3 achieved 26.6%,[36] while the lighter o3-mini-high (on text-only questions) achieved 13%.[59]
On the American Invitational Mathematics Examination (AIME), a challenging mathematics competition, non-reasoning models typically solve fewer than 30% of problems. In contrast, models employing reasoning methods achieve success rates between 50% and 80%.[2][33][35] While OpenAI's o1 maintained or slightly improved its accuracy from reported 2024 results to 2025 AIME results, o3-mini-high achieved 80% accuracy at significantly lower cost, approximately 12 times cheaper.[60]
Some minority or independent benchmarks exclude reasoning models due to their longer response times and higher inference costs, including benchmarks for online complex event detection in cyber-physical systems, general inference-time compute evaluation, Verilog engineering tasks, and network security assessments.[61][62][63][64]
Models
[edit]| Company | Model | Release Date |
|---|---|---|
| OpenAI | GPT-5 (o3.1) | August 2025 |
| o3 and o4-mini | April 2025 | |
| o3-mini | January 2025 | |
| o1 | December 2024 | |
| o1-preview | September 2024 | |
| Gemini | 2.5 Computer Use | October 2025 |
| 2.5 Pro | March 2025 | |
| 2.5 Pro and Flash | December 2024 | |
| 2.0 Flash Thinking | December 2024 | |
| DeepSeek | V3.2-Exp | September 2025 |
| V3.1 | August 2025 | |
| R1-0528 | May 2025 | |
| V3-0324 | March 2025 | |
| R1 and R1-Lite-Preview | January 2025 | |
| Qwen | QvQ-72B-Preview | December 2024 |
| QwQ-32B-Preview | November 2024 | |
| Anthropic | Haiku 4.5 | October 2025 |
| Sonnet 4.5 | September 2025 | |
| Sonnet 3.7 | February 2025 | |
| Mistral AI | Magistral (medium & small) | June 2025 |
| xAI | Grok 4 | July 2025 |
| Grok 3 | February 2025 | |
| Hugging Face | OlympicCoder-7B & 32B | February 2025 |
| NVIDIA | Llama Nemotron | March 2025 |
| Alibaba | QwQ-32B | March 2025 |
| Tencent | T1 | March 2025 |
See also
[edit]References
[edit]- ^ Besta, Maciej; Barth, Julia; Schreiber, Eric; Kubicek, Ales; Catarino, Afonso; Gerstenberger, Robert; Nyczyk, Piotr; Iff, Patrick; Li, Yueling (2025-01-23). "Reasoning Language Models: A Blueprint". arXiv:2501.11223 [cs.CL].
- ^ a b "Learning to reason with LLMs". OpenAI. 2024-09-12. Retrieved 2025-07-26.
- ^ a b Introducing OpenAI o1-preview, OpenAI, 2024-09-12
- ^ a b c d e Learning to reason with LLMs, OpenAI, 2024-09-12
- ^ OpenAI launches new series of AI models with reasoning abilities, Reuters, 2024-09-12
- ^ a b Azure OpenAI reasoning models, Microsoft Learn, 2025-10-11
- ^ Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017-06-12), "Deep reinforcement learning from human preferences", arXiv, doi:10.48550/arXiv.1706.03741
- ^ a b Ouyang, Long; Wu, Jeff; Jiang, Xu; Dinan, Emily; Bansal, Prafulla; Wainwright, Sam; Xu, Chong; Schulman, John (2022-03-04), "Training language models to follow instructions with human feedback", arXiv
- ^ Wei, Jason; Wang, Xuezhi; Schuurmans, Dale; Saxton, David; Prenger, Ryan; Ren, Shuohui; Liu, Yang; Zhou, Denny (2022-01-28), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", arXiv, doi:10.48550/arXiv.2201.11903
- ^ Kojima, Takeshi; Gu, Shixiang; Reid, Machel; Matsuo, Yutaka; Iwasawa, Yusuke (2022-05-24), "Large Language Models are Zero-Shot Reasoners", arXiv, doi:10.48550/arXiv.2205.11916
- ^ Wang, Xuezhi; Wei, Jason; Schuurmans, Dale; Le, Quoc; Chi, Ed; Zhou, Denny (2022-03-21), "Self-Consistency Improves Chain of Thought Reasoning in Language Models", arXiv
- ^ Yao, Shunyu; Zhao, Jeffrey; Yu, Dian; Du, Nan; Shafran, Izhak; Narasimhan, Karthik; Cao, Yuan (2022-10-06), "ReAct: Synergizing Reasoning and Acting in Language Models", arXiv
- ^ Yao, Shunyu; Yu, Dian; Zhao, Jeffrey; Shafran, Izhak; Griffiths, Thomas L.; Cao, Yuan; Narasimhan, Karthik (2023-05-17), "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", arXiv
- ^ Lightman, Hunter; Kosaraju, Vineet; Burda, Yura; Edwards, Harri; Baker, Bowen; Lee, Teddy; Leike, Jan; Schulman, John; Sutskever, Ilya (2023-05-31), "Let's Verify Step by Step", arXiv
- ^ a b Improving mathematical reasoning with process supervision, OpenAI, 2023-05-31
- ^ Sutton, Richard S. "The Bitter Lesson". Incomplete Ideas. Retrieved 2025-02-27.
- ^ Huang, Zhen; Zou, Haoyang; Li, Xuefeng; Liu, Yixiu; Zheng, Yuxiang; Chern, Ethan; Xia, Shijie; Qin, Yiwei; Yuan, Weizhe (2024-11-25). "O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?". arXiv:2411.16489 [cs.CL].
- ^ a b Zeff, Maxwell (2025-02-05). "Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50". TechCrunch. Retrieved 2025-07-26.
- ^ a b c d Lightman, Hunter; Kosaraju, Vineet; Burda, Yura; Edwards, Harri; Baker, Bowen; Lee, Teddy; Leike, Jan; Schulman, John; Sutskever, Ilya (2024). "Let's Verify Step by Step". International Conference on Learning Representations (ICLR 2024). arXiv:2305.20050. Retrieved 2025-07-26.
- ^ Abhinav Kumar (2025). "OverThink: Slowdown Attacks on Reasoning LLMs".
- ^ Edwards, Benj (2024-09-12). "OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini". Ars Technica. Retrieved 2025-02-06.
- ^ "OpenAI o1 System Card" (PDF). OpenAI. 2024-12-05. Retrieved 2025-07-26.
- ^ Robison, Kylie (2024-12-05). "OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, GPT-4o, and more". The Verge. Retrieved 2025-07-26.
- ^ Singh, Jaspreet (2024-12-20). "OpenAI unveils 'o3' model, touting advances in reasoning". Reuters. Retrieved 2025-07-26.
- ^ "Introducing OpenAI o3 and o4-mini". OpenAI. 2025-04-16. Retrieved 2025-07-26.
- ^ "QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown". Qwen (Alibaba Cloud). 2024-11-28. Retrieved 2025-07-26.
- ^ "QVQ: To See the World with Wisdom". Qwen. Alibaba Cloud. 2024-12-25. Retrieved 2025-07-26.
- ^ "Try Deep Research and our new experimental model in Gemini, your AI assistant". Google. 2024-12-11. Retrieved 2025-02-05.
- ^ Roth, Emma (2024-12-11). "Google built an AI tool that can do research for you". The Verge. Retrieved 2025-07-26.
- ^ "Scaling test-time compute". Hugging Face. 2024-12-16. Retrieved 2025-07-26.
- ^ a b Snell, Charlie; Lee, Jaehoon; Xu, Kelvin; Kumar, Aviral (2025). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters". International Conference on Learning Representations (ICLR 2025). arXiv:2408.03314. Retrieved 2025-07-26.
- ^ Orland, Kyle (2025-01-28). "How does DeepSeek R1 really fare against OpenAI's best reasoning models?". Ars Technica. Retrieved 2025-02-06.
- ^ a b c DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv:2501.12948 [cs.CL].
- ^ DeepSeek 支持"深度思考+联网检索"能力 [DeepSeek adds a search feature supporting simultaneous deep thinking and web search]. People's Daily Online (in Chinese). 2025-01-29. Retrieved 2025-07-26.
- ^ a b Muennighoff, Niklas; Yang, Zitong; Shi, Weijia; Li, Xiang Lisa; Fei-Fei, Li; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Liang, Percy; Candès, Emmanuel (2025-02-03). "s1: Simple test-time scaling". arXiv:2501.19393 [cs.CL].
- ^ a b c "Introducing deep research". OpenAI. 2025-02-02. Retrieved 2025-02-05.
- ^ Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019-09-18), "Fine-Tuning Language Models from Human Preferences", arXiv
- ^ Rafailov, Rafael; Sharma, Kushal; Mitchell, Eric; Manning, Christopher D.; Ermon, Stefano; Finn, Chelsea (2023-05-29), "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", arXiv
- ^ Achiam, Josh; Adler, Steven; Agarwal, Sandhini (2023-03-15), "GPT-4 Technical Report", arXiv
- ^ Zelikman, Eric; Wu, Yuhuai; Mu, Jesse; Goodman, Noah D. (2022-03-28), "STaR: Bootstrapping Reasoning With Reasoning", arXiv
- ^ Guo, Dan (2025-01-25), "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv
- ^ a b c d e f Uesato, Jonathan; Kushman, Nate; Kumar, Ramana; Song, Francis; Siegel, Noah; Wang, Lisa; Creswell, Antonia; Irving, Geoffrey; Higgins, Irina (2022-11-25). "Solving math word problems with process- and outcome-based feedback". arXiv:2211.14275 [cs.LG].
- ^ a b c Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18). "Training Verifiers to Solve Math Word Problems". arXiv:2110.14168 [cs.LG].
- ^ Yuan, Zheng; Yuan, Hongyi; Li, Chengpeng; Dong, Guanting; Lu, Keming; Tan, Chuanqi; Zhou, Chang; Zhou, Jingren (2023-09-13). "Scaling Relationship on Learning Mathematical Reasoning with Large Language Models". arXiv:2308.01825 [cs.CL].
- ^ "Aligning language models to follow instructions". OpenAI Blog. 2022-01-27. Retrieved 2025-05-04.
- ^ a b c d e Wang, Peiyi; Li, Lei; Shao, Zhihong; Xu, Runxin; Dai, Damai; Li, Yifei; Chen, Deli; Wu, Yu; Sui, Zhifang (August 2024). Ku, Lun-Wei; Martins, Andre; Srikumar, Vivek (eds.). "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics: 9426–9439. arXiv:2312.08935. doi:10.18653/v1/2024.acl-long.510.
- ^ "prm800k". GitHub. OpenAI. 2025-01-27. Retrieved 2025-01-27.
- ^ Chen, Guoxin; Liao, Minpeng; Li, Chengxi; Fan, Kai (2024-09-27). "AlphaMath Almost Zero: Process Supervision without Process". arXiv:2405.03553 [cs.LG].
- ^ Yuan, Lifan; Li, Wendi; Chen, Huayu; Cui, Ganqu; Ding, Ning; Zhang, Kaiyan; Zhou, Bowen; Liu, Zhiyuan; Peng, Hao (2024-12-02). "Free Process Rewards without Process Labels". arXiv:2412.01981 [cs.CL].
- ^ Zhang, Di; Wu, Jianbo; Lei, Jingdi; Che, Tong; Li, Jiatong; Xie, Tong; Huang, Xiaoshui; Zhang, Shufei; Pavone, Marco (2024-11-21). "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning". arXiv:2410.02884 [cs.CL].
- ^ Ma, Qianli; Zhou, Haotian; Liu, Tingkai; Yuan, Jianbo; Liu, Pengfei; You, Yang; Yang, Hongxia (2023-10-16). "Let's reward step by step: Step-Level reward model as the Navigators for Reasoning". arXiv:2310.10080 [cs.CL].
- ^ Wei, Jason; Wang, Xuezhi; Schuurmans, Dale; Bosma, Maarten; Ichter, Brian; Xia, Fei; Chi, Ed; Le, Quoc; Zhou, Denny (2023-01-10). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". arXiv:2201.11903 [cs.CL].
- ^ Wang, Xuezhi; Wei, Jason; Schuurmans, Dale; Le, Quoc; Chi, Ed; Narang, Sharan; Chowdhery, Aakanksha; Zhou, Denny (2023-03-07). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv:2203.11171 [cs.CL].
- ^ Yao, Shunyu; Yu, Dian; Zhao, Jeffrey; Shafran, Izhak; Griffiths, Thomas L.; Cao, Yuan; Narasimhan, Karthik (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models". arXiv:2305.10601 [cs.CL].
- ^ Cui, Dong-Xu; Long, Shi-Yu; Tang, Yi-Xuan; Zhao, Yue; Li, Qiao (2025-08-25). "Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?─Based on Conversations with LLMs". Journal of Chemical Information and Modeling acs.jcim.5c01265. doi:10.1021/acs.jcim.5c01265. ISSN 1549-9596. PMID 40854079.
- ^ Qwen; Yang, An; Yang, Baosong; Zhang, Beichen; Hui, Binyuan; Zheng, Bo; Yu, Bowen; Li, Chengyuan; Liu, Dayiheng (2024). "Qwen2.5 Technical Report". arXiv:2412.15115 [cs.CL].
- ^ Comanici, Gheorghe; Bieber, Eric; Schaekermann, Mike; Pasupat, Ice; Sachdeva, Noveen; Dhillon, Inderjit; Blistein, Marcel; Ram, Ori; Zhang, Dan (2025-07-22). "Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities". arXiv:2507.06261 [cs.CL].
- ^ Mirza, Adrian; Alampara, Nawaf; Kunchapu, Sreekanth; Ríos-García, Martiño; Emoekabu, Benedict; Krishnan, Aswanth; Gupta, Tanya; Schilling-Wilhelmi, Mara; Okereke, Macjonathan; Aneesh, Anagha; Asgari, Mehrdad; Eberhardt, Juliane; Elahi, Amir Mohammad; Elbeheiry, Hani M.; Gil, María Victoria (July 2025). "A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists". Nature Chemistry. 17 (7): 1027–1034. Bibcode:2025NatCh..17.1027M. doi:10.1038/s41557-025-01815-x. ISSN 1755-4349. PMC 12226332. PMID 40394186.
- ^ "Humanity's Last Exam leaderboard". Safe.ai. Center for AI Safety. Retrieved 2025-07-26.
- ^ "OpenAI o3-mini". OpenAI. 2025-01-31. Retrieved 2025-02-09.
- ^ Huang, Yuting; Zois, Christos; Wang, Yue; Zhang, Yue; Mavromatis, Christos; Zeng, Jiachen; Yin, Shihao; Voulkidis, Antonios; Shepard, Daniel (2025). "Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study". Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things. ACM. pp. 1–6. arXiv:2503.12282. doi:10.1145/3722565.3727198. ISBN 979-8-4007-1608-9.
Although we did not evaluate o1 and o3 models ... their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.
- ^ Hu, Zihao; Wang, Yuqing; Sun, Rui; Lu, Haoran; Gong, Qian; Wang, Jinshuai; Gong, Yunlong; Huang, Yiming; He, Peng (2025-02-13). "Inference-Time Compute: More Faithful? A Research Note". arXiv:2502.09673 [cs.CL].
we were unable to evaluate O1 and R1 …
- ^ Chen, Guoliang; Zhu, Zhiyao; Meng, Qinxiang; Liang, Weilin; Ji, Zijie; Liu, Jiangning; Zeng, Jie (2025-03-07). "RealBench: Evaluating LLMs as Verilog Engineers". arXiv:2503.04914 [cs.AI].
For O1-preview, we sample only once due to high cost.
- ^ Gupta, Arpit; Schapira, Michael; Gill, Phillipa; Seetharaman, Srinivasan (2025-01-30). "On the Feasibility of Using LLMs to Execute Multistage Network Attacks". arXiv:2501.16466 [cs.CR].
We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.
External links
[edit]- Fortes, Armando (2025-01-27). "atfortes/Awesome-LLM-Reasoning". GitHub. Retrieved 2025-01-27.
- Huang, Jie; Chang, Kevin Chen-Chuan (2023-05-26). "Towards Reasoning in Large Language Models: A Survey". arXiv:2212.10403 [cs.CL].
- Besta, Maciej; Barth, Julia; Schreiber, Eric; Kubicek, Ales; Catarino, Afonso; Gerstenberger, Robert; Nyczyk, Piotr; Iff, Patrick; Li, Yueling (2025-01-23). "Reasoning Language Models: A Blueprint". arXiv:2501.11223 [cs.AI].
