Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (2024)

Zhiheng Xi1 , Dingwen Yang1, Jixuan Huang1, Jiafu Tang1, Guanyu Li1, Yiwen Ding1,
Wei He1, Boyang Hong1, Shihan Dou1, Wenyu Zhan1, Xiao Wang1, Rui Zheng1, Tao Ji1,
Xiaowei Shi2, Yitao Zhai2, Rongxiang Weng2, Jingang Wang2, Xunliang Cai2,
Tao Gui1, Zuxuan Wu1, Qi Zhang1, Xipeng Qiu1, Xuanjing Huang1, Yu-Gang Jiang1

1Fudan University 2Meituan

Equal contribution.Correspondence to: zhxi22@m.fudan.edu.cn, tgui@fudan.edu.cn

Abstract

Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model’s capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and training-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of 76,3217632176,32176 , 321 responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor’s performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor’s self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor’s exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase their potential. Our code and datasets are at https://mathcritique.github.io/.

1 Introduction

With the rapid advancement of large language models (LLMs) [1, 2, 3, 4, 5], significant progress has been made in enhancing their reasoning capabilities [6, 7, 8, 9, 10, 11]. By prompting or training language models to reason step-by-step like humans (i.e., chain-of-thought, CoT), these models have demonstrated impressive reasoning abilities [6, 9, 12]. Recently, OpenAI’s o1 model has introduced a new paradigm shift, exploring to increase inference-time computation in language models and explicitly generate longer chains of thought [13]. This enables them to tackle more complex reasoning tasks that even humans find challenging, such as problems in the domains of science, coding, and mathematics [14, 15, 16, 17].

At the same time, many studies have explored test-time scaling by employing mechanisms like self-reflection, self-correction, and self-critique to generate longer thinking chains [18, 14, 19, 12, 20, 21], similar to OpenAI’s o1. However, the effectiveness of these mechanisms depends on the models’ ability to accurately evaluate their own performance. This ability can be limited by factors such as initial accuracy, problem complexity, and the lack of external feedback [17, 22, 23, 18]. As a result, their performance remains constrained, even with increased inference-time computation [24].

In light of this, to reliably increase reasoning models’ performance with increased inference-time computation, we delve into a two-player paradigm, where the actor model engages in reasoning while the critique model provides supervisory feedback on the thought chains [18, 25, 26, 27]. This approach represents a scalable oversight technique aiming at providing reliable and effective supervision for the continued development of LLMs [22, 28, 29]. The goal is to help the actor model identify errors and refine its outputs, ultimately leading to higher-quality results. In this paper, We aim to explore the research question of how to develop effective and reliable critique models, and how to enhance the actor’s reasoning performance through collaboration with the critique model at test-time. Additionally, we explore incorporating supervision from critique models into the actor’s training process to build more capable reasoning models.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (1)

We first propose an automated and scalable framework called AutoMathCritique to collect diverse and high-quality step-level critique data without additional human supervision (Section 3). The framework consists of three main stages: flawed reasoning path construction, critique generation, and data filtering. In the first step, we leverage several approaches for controlled error synthesis, each of which targets different aspects of reasoning errors, such as their location or specific content. This controlled process ensures the diversity and comprehensiveness of the reasoning paths and provides informative and precise hints to guide the subsequent critique generation. In the second step, annotator models are provided with the original reasoning path, and possible hints about the mistakes to label step-level correctness and offer constructive feedback. In the second step, the reasoning model revises the response according to the critiques, and Monte Carlo sampling [30, 31] is used to eliminate low-quality or non-informative critique data, while preventing high-quality data from being accidentally discarded. A case of the resulting data is illustrated in Figure 2.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (2)

Next, using AutoMathCritique, we create a critique dataset containing 76321763217632176321 samples named MathCritique-76k, which is subsequently used to fine-tune a language model to obtain the critique model. We demonstrate that the critique models can assist the actor model in improving exploration efficiency and reasoning quality during test time, leading to a significant enhancement in its reasoning performance (Section 4). Through in-depth analysis, we find that the critique models are particularly effective in helping the actor achieve better results on difficult queries. Additionally, by scaling inference-time computation [15, 32, 33], the performance gains brought by the critique models continue to grow.

Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method (Section 5). With the supervision of critique models and by scaling exploration computation for difficult queries, our method improves the actor’s exploration efficiency and solution diversity, alleviating the issue of tail narrowing [34] in reasoning models during iterative exploration and learning. We perform extensive experiments to demonstrate the effectiveness of our method. Additionally, we conduct further analysis of the critique models (Section 6), e.g., the scaling properties, and whether we should scale test-time computation in sequential or parallel.

Finally, we take a step further and conduct preliminary explorations on how to leverage critique data to construct step-level self-talk data (Section 7). We propose the self-talk-via-critique method, and train a single language model to reflect and self-correct at each step, demonstrating the potential of this approach.

In summary, our main contributions are:

  • We introduce AutoMathCritique, an automated and scalable framework for collecting step-level critique data without additional human supervision, which we use to build the large-scale critique dataset MathCritique-76k.

  • We fine-tune the critique model with MathCritique-76k to offer constructive feedback on reasoning paths. We demonstrate and analyze the performance gains of the trained critique models in enhancing the actor’s reasoning during test time, particularly when scaling test-time computation.

  • Motivated by the insights from test-time analysis, we introduce the critique model to the actor’s self-training process, and propose the critique-in-the-loop self-improvement method to enhance exploration efficiency and solution diversity, ultimately training better reasoning models.

  • We conduct extensive experiments to validate the effectiveness of our method and perform in-depth analysis of critique models, e.g., their scaling properties, and whether we should scale test-time computation in sequential or parallel.

  • We propose the self-talk-via-critique method, and take the preliminary step to train models that can perform step-level reasoning, reflection and correction, and demonstrate their potential. We hope our work offers valuable insights for future research on LLM reasoning and scalable supervision.

2 Preliminaries

In the two-player setting studied in this paper, there are two roles: the actor model and the critique model. Also, there are three primary tasks [22]: reasoning, critique, and refinement.

In the reasoning task, the actor model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT parameterized by θ𝜃\thetaitalic_θ is given a reasoning problem x𝑥xitalic_x and is expected to generate a response y=πθ(x)𝑦subscript𝜋𝜃𝑥y=\pi_{\theta}(x)italic_y = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ). This response includes both the answer to the problem and the reasoning trajectory. The accuracy of this response can be evaluated using a reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ).

Next, the critique model πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT parameterized by πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT performs the critique task, where, given the problem and response, it generates critical feedback c=πϕ(x,y)𝑐subscript𝜋italic-ϕ𝑥𝑦c=\pi_{\phi}(x,y)italic_c = italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ). Notably, if the oracle reward function of the response is not given, the critique task consists of two subtasks: the discriminative task and the feedback generation task. The former determines whether the response contains flaws, while the latter generates constructive natural language feedback.

Finally, we define the refinement task, in which, given the problem, response, and critique, the actor generates a new response y=πθ(x,y,c)superscript𝑦subscript𝜋𝜃𝑥𝑦𝑐y^{\prime}=\pi_{\theta}(x,y,c)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_c )—this is also known as conditional refinement. Alternatively, we can define direct refinement y=πθ(x,y)superscript𝑦subscript𝜋𝜃𝑥𝑦y^{\prime}=\pi_{\theta}(x,y)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ), where the actor provides an improved answer based on an existing answer without conditioning on a critique, which is also referred to as “self-correction” [18].

This process can proceed in multiple rounds. We define that in the initial round (round 00) only the actor operates, generating a response based on the problem. In round i𝑖iitalic_i, the critique model first generates a new critique based on the interaction history, which is represented as:

ci=πϕ(x,y0,c1,y1,,ci1,yi1).subscript𝑐𝑖subscript𝜋italic-ϕ𝑥subscript𝑦0subscript𝑐1subscript𝑦1subscript𝑐𝑖1subscript𝑦𝑖1c_{i}=\pi_{\phi}(x,y_{0},c_{1},y_{1},...,c_{i-1},y_{i-1}).italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) .

Then, the actor generates a new refinement based on the previous interaction history, represented as:

yi=πθ(x,y0,c0,y1,c1,,ci1).subscript𝑦𝑖subscript𝜋𝜃𝑥subscript𝑦0subscript𝑐0subscript𝑦1subscript𝑐1subscript𝑐𝑖1y_{i}=\pi_{\theta}(x,y_{0},c_{0},y_{1},c_{1},...,c_{i-1}).italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) .

3 AutoMathCritique: An Automated and Scalable Framework to Collect Step-level Critique Data

To train critique models capable of delivering step-level supervision and constructive feedback for reasoning, we introduce AutoMathCritique—an automated and scalable framework for collecting critique data (see Figure 3 for an overview of AutoMathCritique).This framework consists of three main stages: flawed reasoning path construction, critique generation, and data filtering. Using AutoMathCritique, we create a dataset containing 76,3217632176,32176 , 321 samples named MathCritique-76k. The statistics are listed in Table 1.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (3)

We focus on the field of mathematical reasoning, so we utilize two of the most widely used datasets: GSM8K [35] and MATH [36]. The queries used for our subsequent data construction primarily come from their training sets, and we also leverage their original annotated responses to train the actor reasoning models. Our in-domain test set is composed of their test sets.

3.1 Construction of Flawed Reasoning Paths

To create high-quality critique data, we first need to construct a dataset of reasoning paths that includes some flaws. To better control the quality and diversity of the generated flawed reasoning paths, and to facilitate the subsequent construction of critique data, we leverage several distinct response generation (RG) approaches. These strategies encompass different aspects of the errors, such as their location or specific details. We mainly use Llama3-8B [5] as our actor model for sampling.

RG1: sampling from scratch.

In this approach, the actor is provided with a query and tasked with generating a response. Given that the actor we used has already achieved high accuracy on the GSM8K and MATH training sets, we use repeated sampling to obtain flawed responses. However, this method has the limitation of not offering detailed information about the location or content of the mistakes, which means that the subsequent critique labeling heavily depends on the expertise of annotators.

RG2: generating error-location-aware response.

In this approach, given a query, we first sample a correct response from the actor model. Then, starting from a specific step of the response, we modify the model’s hyperparameters for flawed response sampling, such as increasing the temperature of the final softmax function. This ensures that the steps preceding the selected step remain consistent with the original correct response, while the subsequent steps are more likely to contain errors. If the sampled response remains correct, we select a different step and further increase the randomness of the generation process. This method strikes a balance between generating flawed responses and maintaining the coherence of the reasoning process. The correct responses we sample are later used to construct critiques, while for the flawed responses, we collect information about the error locations (e.g., identifying from which step the errors originate), thereby facilitating the annotation of high-quality critiques.

RG3: adding detailed mistakes.

In this approach, given a query, the actor model is instructed to sample a correct reasoning path first. We then instruct the model to introduce mistakes into the correct response. Inspired by previous work [37, 38], we enumerate various common reasoning errors in the instructions and include few-shot examples in the prompt. Each example consists of five components: the query, the correct reference response, the step where the error is introduced, the type of error, and the generated flawed response. After the error is inserted, we direct the model to continue reasoning from the erroneous step until it reaches a final answer. If a flawed response is not generated, we repeat the sampling process up to a maximum of 16 attempts. As in RG2, the correct answers obtained during this process can also be used to construct critiques. This approach allows us to easily capture information about the location of the first mistake and its specific details, thereby significantly reducing the complexity of subsequent critique construction.

DatasetQueryGolden reasoning pathCritique
GSM8K7,47374737,4737 , 4737,47374737,4737 , 47337,8733787337,87337 , 873
MATH7,50075007,5007 , 5007,49874987,4987 , 49838,4483844838,44838 , 448
Total15,9731597315,97315 , 97315,9711597115,97115 , 97176,3217632176,32176 , 321

3.2 Generation of Critiques

Step-level critique generation.

When generating critique data, we enhance quality by checking each step to identify the first error in the solution, which in turn facilitates the refinement process.Specifically, given a query and response, we employ two methods to generate step-level critique data: (1) We instruct the critique annotator (in our work, GPT-4o [2]) to directly identify the location of the first error and provide corresponding feedback. This method requires the annotator to assess the entire solution holistically, making it relatively more challenging. (2) We instruct the annotator model to later step by step, stopping the process once the first error is detected, at which point they provide the corresponding feedback. This strategy effectively decomposes the entire solution, reducing the difficulty of providing comments.

Critique generation based on varying information about errors.

When constructing responses, we employ different strategies that provide various types of information, helping annotators identify and analyze flaws. Such information plays a crucial role in generating critiques.

For responses that are correct, we do not provide any additional information but instead ask the annotator to critique step by step. Only when the critique annotator correctly labels every step will this critique data be collected. If the annotator makes an error in labeling, it indicates either the response is a false positive (i.e., the answer is correct but the reasoning process is flawed) or the annotator’s labeling is incorrect. In either case, the data is discarded.

For flawed responses, we design critique prompts based on the generation strategy used (RG1, RG2, RG3). For responses generated by RG1, we provide a correct reference response to directly assist the annotator in labeling. For flawed responses from RG2, we offer both the reference response and highlight the likely starting point of the error, helping the annotator identify the first critical mistake. For RG3-generated flawed responses, we not only specify the exact location of the error but also provide detailed information about the mistake, enabling a more precise critique.

3.3 Data Filtering

Although we have constructed a large amount of critique data paired with flawed responses, the quality of this data is not guaranteed, and low-quality data could weaken the performance of the critique model.To address this, we apply a filtering process. Specifically, we use Monte Carlo sampling: each (query, response, critique) tuple is fed into the actor model for refinement. The refinement process is repeated 10101010 times, and only when the accuracy exceeds a predefined threshold τ=0.3𝜏0.3\tau=0.3italic_τ = 0.3 is the critique data retained. This process is referred to as soft filtering. In contrast, hard filtering is employed when the critique is considered valid if at least one of the k refinements produces a correct result. In practice, we adopt soft filtering because it prevents the omission of high-quality critique data due to occasional model errors. Furthermore, it minimizes the risk of including low-quality critiques that the actor model does not follow, but instead refine based on its own knowledge, resulting in a correct response.Note that our method does not completely eliminate low-quality data, but we strive to achieve a balance between quality and quantity. Additionally, we randomly sampled 100100100100 data points 5555 times and had crowdsourced annotators perform the checking. We find that the rate of low-quality data is 1.21.21.21.2%.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (4)

4 Critique Models Improves LLM Reasoning through Test-time Supervision

In this section, we begin by training critique models to provide step-level supervisory signals and useful feedback on reasoning paths, along with the actor reasoning models that own reasoning and refinement ability (Section 4.1). We then explore the role of critique models in supporting the actor reasoning model at test-time (Section 4.2), showing that they significantly enhance the actor’s performance in tackling difficult problems. Furthermore, as we scale up inference-time computations, we observe that the critique model continues to raise the performance ceiling of the reasoning models.

4.1 Fine-tuning Critique Models and Actor Reasoning Models

Training critique models with MathCritique-76k.

We train the critique models through supervised fine-tuning with the collected MathCritique-76k. Specifically, we use the standard language modeling loss. Given a dataset 𝒟critique={x,y,c}j=1Nsubscript𝒟critiquesuperscriptsubscript𝑥𝑦𝑐𝑗1𝑁\mathcal{D}_{\text{critique}}=\{x,y,c\}_{j=1}^{N}caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT = { italic_x , italic_y , italic_c } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the loss for the critique model πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is as follows:

critique(ϕ)subscriptcritiqueitalic-ϕ\displaystyle\mathcal{L}_{\text{critique}}(\phi)caligraphic_L start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT ( italic_ϕ )=𝔼(x,y,c)𝒟critique[logπϕ(c|x,y)],absentsubscript𝔼similar-to𝑥𝑦𝑐subscript𝒟critiquedelimited-[]subscript𝜋italic-ϕconditional𝑐𝑥𝑦\displaystyle=\mathbb{E}_{(x,y,c)\sim\mathcal{D}_{\text{critique}}}\Big{[}\log%{\pi_{\phi}(c|x,y)}\Big{]},= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y , italic_c ) ∼ caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x , italic_y ) ] ,(1)

In this way, we can obtain a critique model that provides step-level supervision and constructive feedback on reasoning paths for actor models.

Training actor models with basic reasoning and refinement ability.

We then train reasoning models in our two-player setting. The models are trained using the training sets of GSM8K and MATH, containing 7,47374737,4737 , 473 and 7,50075007,5007 , 500 samples, respectively. We denote the mixed response training set as 𝒟reason={(x,y)}j=1|𝒟reason|subscript𝒟reasonsuperscriptsubscript𝑥𝑦𝑗1subscript𝒟reason\mathcal{D}_{\text{reason}}=\{(x,y)\}_{j=1}^{|\mathcal{D}_{\text{reason}}|}caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT = { ( italic_x , italic_y ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT. Additionally, to equip the models with the ability to perform refinement tasks according to the critique feedback, we utilize GPT-4 to annotate 8888k refinement samples ( half of which are from MATH and the other half from GSM8K), denoted as 𝒟refine={(x,y,c,y)}j=1|𝒟refine|subscript𝒟refinesuperscriptsubscript𝑥𝑦𝑐superscript𝑦𝑗1subscript𝒟refine\mathcal{D}_{\text{refine}}=\{(x,y,c,y^{\prime})\}_{j=1}^{|\mathcal{D}_{\text{%refine}}|}caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT = { ( italic_x , italic_y , italic_c , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, where ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the refined reasoning path generated based on the critique c𝑐citalic_c. Each refinement sample is verified to ensure the correctness of its final answer. The loss of training actor reasoning model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is as follows:

actor(θ)subscriptactor𝜃\displaystyle\mathcal{L}_{\text{actor}}(\theta)caligraphic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_θ )=𝔼(x,y)𝒟reason[logπθ(y|x)]+β×𝔼(x,y,c,y)𝒟refine[logπθ(y|x,y,c)],absentsubscript𝔼similar-to𝑥𝑦subscript𝒟reasondelimited-[]subscript𝜋𝜃conditional𝑦𝑥𝛽subscript𝔼similar-to𝑥𝑦𝑐superscript𝑦subscript𝒟refinedelimited-[]subscript𝜋𝜃conditionalsuperscript𝑦𝑥𝑦𝑐\displaystyle=\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{reason}}}\Big{[}\log{\pi%_{\theta}(y|x)}\Big{]}+\beta\times\mathbb{E}_{(x,y,c,y^{\prime})\sim\mathcal{D%}_{\text{refine}}}\Big{[}\log{\pi_{\theta}(y^{\prime}|x,y,c)}\Big{]},= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] + italic_β × blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y , italic_c , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_y , italic_c ) ] ,(2)

where β𝛽\betaitalic_β is a hyper-parameter that balances the learning of reasoning and refining.

4.2 Critique-based Supervision Improves Test-time Reasoning Performance

In this section, we investigate the impact of trained critique models in supporting the reasoning model at test-time (illustrated on the left of Figure 4). Specifically, we examine their effectiveness in enhancing the actor’s reasoning performance, identify the types of problems where performance improvements are observed, and assess whether scaling up test-time computation further elevates the actor’s performance ceiling.

4.2.1 Experimental Setups

Backbone models.

In our main experiments, we fine-tune the actor models using Llama3-8B-Base, following previous work [16, 39, 17]. This model demonstrates non-trivial performance on mathematical reasoning tasks while leaving room for improvement, making it an ideal testbed for our study.We fine-tune the critique models using the fine-tuned models Llama3-8B and Llama3-70B, which have the instruction-following ability to serve as our critique backbone. Note that most of our experiments are performed with the 8B model.

Evaluation metrics.

In mathematical reasoning tasks, we primarily evaluate the accuracy, which measures whether a solution matches the ground truth with an oracle reward function. When critique models are not employed, we directly evaluate the accuracy of the actor’s responses. In contrast, when critique models are used, we evaluate the accuracy of the actor’s responses after refinement based on feedback provided by the critique model.

Additionally, to comprehensively assess a critique model, we evaluate its discriminability, i.e., the ability to determine whether a solution contains errors [22]. We also evaluate its helpfulness, which means whether it can provide constructive feedback that enables the actor to correct erroneous responses.

Implementation details.

The experiments are conducted on NVIDIA A100 GPUs and Ascend 910 processors. When fine-tuning the critique models and actor reasoning models, we set the learning rate to 2e52𝑒52e-52 italic_e - 5. During decoding, we set the model’s temperature to 00, which means the decoding process is done greedily. When we scale up inference-time computation, we set the temperature to 0.70.70.70.7.We evaluate the accuracy of the actor models, and the discriminability and helpfulness of the critique models.

4.2.2 Empirical Results and Findings

Critique ModelGSM8KMATH
Acc.Discrimin.HelpfulnessAcc.Discrimin.Helpfulness
No Critic54.8154.8154.8154.81--17.2217.2217.2217.22--
GPT-3.5-Turbo58.3858.3858.3858.3862.9%percent62.962.9\%62.9 %13.3%percent13.313.3\%13.3 %25.5625.5625.5625.5651.3%percent51.351.3\%51.3 %14.3%percent14.314.3\%14.3 %
GPT-4-Turbo77.8677.8677.8677.8691.6%percent91.691.6\%91.6 %57.5%percent57.557.5\%57.5 %36.0036.0036.0036.0087.6%percent87.687.6\%87.6 %26.2%percent26.226.2\%26.2 %
GPT-4o79.5279.5279.5279.5291.5%percent91.591.5\%91.5 %59.7%percent59.759.7\%59.7 %39.9839.9839.9839.9885.4%percent85.485.4\%85.4 %30.9%percent30.930.9\%30.9 %
Critique Model-8B63.3163.3163.3163.3179.4%percent79.479.4\%79.4 %31.0%percent31.031.0\%31.0 %24.2624.2624.2624.2675.7%percent75.775.7\%75.7 %16.2%percent16.216.2\%16.2 %
Critique Model-70B76.8876.8876.8876.8892.3%percent92.392.3\%92.3 %55.3%percent55.355.3\%55.3 %33.9433.9433.9433.9482.3%percent82.382.3\%82.3 %23.9%percent23.923.9\%23.9 %

Critique models are highly effective at identifying the correctness of reasoning, offering constructive feedback for erroneous responses, and improving the overall accuracy of the actor.

We compare our critique models with SOTA models used as critics, and the results are presented in Table 2. We observe that compared to current state-of-the-art (SOTA) models, our 8B critique model significantly outperforms GPT-3.5, while our Llama3-Critic-70B model achieves performance comparable to GPT-4 series models.

Specifically, the reasoning path judgment accuracy of our 8B critique model reaches 79.37%percent79.3779.37\%79.37 % on GSM8K and 75.74%percent75.7475.74\%75.74 % on MATH, exceeding GPT-3.5-Turbo by 16.5216.5216.5216.52 and 24.4624.4624.4624.46 percentage points, respectively. Additionally, in terms of helpfulness, it outperforms GPT-3.5-Turbo by 17.70%percent17.7017.70\%17.70 % and 1.93%percent1.931.93\%1.93 % on GSM8K and MATH, respectively.Moreover, our 70B critique model demonstrates even stronger performance. As to discriminability, it surpasses GPT-4-Turbo and GPT-4o on the GSM8K dataset and achieves results close to these SOTA models on MATH. Its correction accuracy on both datasets approaches that of GPT-4 series models, ultimately leading to comparable actor accuracy under its guidance.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (5)
Critique models assist the actor in better handling challenging queries.

Next, we investigate the distribution of performance gains brought by the critique model across different difficulty levels.The process involves generating 100100100100 responses from the actor model for each query and categorizing the queries into 5 difficulty levels based on the number of correct responses associated with each query [36].The results are illustrated in Figure 5.It is evident that on both the training and test sets of GSM8K and MATH, critique models provide minimal benefit for simpler queries, as the actor model can independently perform well in these cases. However, for more challenging problems, critique models offer significant support, resulting in overall improved performance. Furthermore, this phenomenon is even more pronounced in the training set, offering valuable insights for incorporating critique model supervision during training (Section 5) [40].

Scaling up inference-time computation consistently improves reasoning performance.

Recent studies have highlighted that scaling up inference-time computation can significantly enhance model performance [15, 32, 33]. Here, we investigate whether incorporating critique models can further elevate the reasoning performance ceiling as test-time computation scales.A widely used technique employed in test-time computation scaling is majority voting [41], denoted as Maj@K, which measures whether the most frequent answer among K𝐾Kitalic_K parallel samples is correct.This metric reflects the model’s consistency in generating high-quality responses across multiple samples, which is a critical aspect of interactive exploration and learning paradigms such as reinforcement learning and self-improvement.

As shown in Figure 1, without critique models, Maj@K performance improves with increased computation but quickly plateaus, even at higher levels of computation (e.g., Maj@2K, Maj@3K). In contrast, when critique models are utilized during test-time, performance surpasses the baseline by a significant margin under the same computation budget—showing a 12.4%percent12.412.4\%12.4 % improvement on GSM8K and a 14.8%percent14.814.8\%14.8 % improvement on MATH. These findings indicate that critique models effectively improve the exploration efficiency and quality of critique models, extending the performance ceiling when allocated more inference-time computation.

5 Critique-in-the-loop Self-Improvement for Better Reasoning Models

Motivated by the test-time findings in Section 4.2 that critique models significantly aid in solving challenging problems, and that they substantially raise the reasoning performance ceiling when scaling up computation, we integrate the critique-based supervision into the actor model’s iterative exploration and learning process. We present a critique-in-the-loop self-improvement method, which scales up exploration computation on challenging queries and leads to the development of stronger reasoning models (illustrated in Figure 4).

5.1 Vanilla Self-Improvement Method

Self-improvement is an exploration and learning method [42, 43, 16, 43, 44, 45]. It iteratively leverages the actor reasoning model’s correct responses to gradually enhance its problem-solving abilities. The process involves T𝑇Titalic_T iterations, where each iteration consists of two steps: exploration and learning.

In the exploration step of iteration t𝑡titalic_t, we sample N𝑁Nitalic_N responses for each query xj𝒟reasonsubscript𝑥𝑗subscript𝒟reasonx_{j}\in\mathcal{D}_{\text{reason}}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT from the previous model πθt1superscriptsubscript𝜋𝜃𝑡1\pi_{\theta}^{t-1}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT, i.e., y^j=πθt1(xj)subscript^𝑦𝑗superscriptsubscript𝜋𝜃𝑡1subscript𝑥𝑗\hat{y}_{j}=\pi_{\theta}^{t-1}(x_{j})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Each data point is then filtered using the reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ), where only correct solutions are retained to form a new dataset 𝒟t={(x,y)}j=1|𝒟t|superscript𝒟𝑡superscriptsubscript𝑥𝑦𝑗1superscript𝒟𝑡\mathcal{D}^{t}=\{(x,y)\}_{j=1}^{|\mathcal{D}^{t}|}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { ( italic_x , italic_y ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT.

In the learning step of iteration t𝑡titalic_t, the new dataset from the exploration step is used to fine-tune the actor reasoning model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.To mitigate overfitting, we follow recent work [46] and always fine-tune the original model πθ0superscriptsubscript𝜋𝜃0\pi_{\theta}^{0}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT instead of the model from the previous step, πθt1superscriptsubscript𝜋𝜃𝑡1\pi_{\theta}^{t-1}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT.The training loss is as in Equation 2 and we also include the original reasoning set 𝒟reasonsubscript𝒟reason\mathcal{D}_{\text{reason}}caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT and refinement set 𝒟refinesubscript𝒟refine\mathcal{D}_{\text{refine}}caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT [8]. After the improve step is performed, a new dataset of better quality samples can be created once again 47.

Limitations of vanilla self-improvement.

In self-improvement, the key challenge lies in identifying correct responses with high diversity for each query during the exploration step [42, 44].However, previous studies have highlighted the problem known as the tail narrowing [34]. Specifically, models tend to over-sample solutions for simpler queries while under-sampling solutions for harder queries.This results in a training set for the next iteration that contains a large number of solutions for simple problems but lacks solutions for more challenging problems, introducing sampling bias. As iterations progress, this bias deepens, leading to a long-tail distribution where solutions for harder queries are almost entirely absent. This ultimately causes the model to reach a performance plateau or even degrade [34].

5.2 Critique-in-the-loop Self-improvement

Input: Initialized actor reasoning model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, intialized critique model πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, reasoning dataset 𝒟reasonsubscript𝒟reason\mathcal{D}_{\text{reason}}caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT, refinement dataset 𝒟refinesubscript𝒟refine\mathcal{D}_{\text{refine}}caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT, critique dataset 𝒟critiquesubscript𝒟critique\mathcal{D}_{\text{critique}}caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT, oracle reward function r𝑟ritalic_r, the iteration number for self-improvement T𝑇Titalic_T, the sampling number for exploration N𝑁Nitalic_N, the sampling number for critique generation L𝐿Litalic_L.

ProcedureFine-tune the critique model and the actor reasoning model

Minimize the following loss objective to obtain critique model πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT:

critique(ϕ)=𝔼(x,y,c)𝒟critique[logπϕ(c|x,y)]subscriptcritiqueitalic-ϕsubscript𝔼similar-to𝑥𝑦𝑐subscript𝒟critiquedelimited-[]subscript𝜋italic-ϕconditional𝑐𝑥𝑦\mathcal{L}_{\text{critique}}(\phi)=\mathbb{E}_{(x,y,c)\sim\mathcal{D}_{\text{%critique}}}\Big{[}\log{\pi_{\phi}(c|x,y)}\Big{]}caligraphic_L start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y , italic_c ) ∼ caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x , italic_y ) ];

Minimize the following loss objective to obtain actor model πθbasesuperscriptsubscript𝜋𝜃base\pi_{\theta}^{\text{base}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT base end_POSTSUPERSCRIPT:

actor(θ)=𝔼(x,y)𝒟reason[logπθ(y|x)]+β×𝔼(x,y,c,y)𝒟refine[logπθ(y|x,y,c)]subscriptactor𝜃subscript𝔼similar-to𝑥𝑦subscript𝒟reasondelimited-[]subscript𝜋𝜃conditional𝑦𝑥𝛽subscript𝔼similar-to𝑥𝑦𝑐superscript𝑦subscript𝒟refinedelimited-[]subscript𝜋𝜃conditionalsuperscript𝑦𝑥𝑦𝑐\mathcal{L}_{\text{actor}}(\theta)=\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{%reason}}}\Big{[}\log{\pi_{\theta}(y|x)}\Big{]}+\beta\times\mathbb{E}_{(x,y,c,y%^{\prime})\sim\mathcal{D}_{\text{refine}}}\Big{[}\log{\pi_{\theta}(y^{\prime}|%x,y,c)}\Big{]}caligraphic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] + italic_β × blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y , italic_c , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_y , italic_c ) ];

ProcedureExploration and Learning with Critique Supervision

πθ0πθbasesuperscriptsubscript𝜋𝜃0superscriptsubscript𝜋𝜃base\pi_{\theta}^{0}\leftarrow\pi_{\theta}^{\text{base}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT base end_POSTSUPERSCRIPT;

foriteration t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do

ProcedureExploration Step

𝒟tsuperscript𝒟𝑡\mathcal{D}^{t}\leftarrow\varnothingcaligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← ∅;

// Sample N𝑁Nitalic_N solutions from the reasoning model and collect correct responses.

forsample num n=1𝑛1n=1italic_n = 1 to N𝑁Nitalic_N do

𝒟t,n={(xi,yi)|xi𝒟reason,yiπθt1(y|xi)}superscript𝒟𝑡𝑛conditional-setsubscript𝑥𝑖subscript𝑦𝑖formulae-sequencesimilar-tosubscript𝑥𝑖subscript𝒟reasonsimilar-tosubscript𝑦𝑖superscriptsubscript𝜋𝜃𝑡1conditional𝑦subscript𝑥𝑖\mathcal{D}^{t,n}=\{(x_{i},y_{i})|x_{i}\sim\mathcal{D}_{\text{reason}},y_{i}%\sim\pi_{\theta}^{{t-1}}(y|x_{i})\}caligraphic_D start_POSTSUPERSCRIPT italic_t , italic_n end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) };

𝒟t𝒟t𝒟t,nsuperscript𝒟𝑡superscript𝒟𝑡superscript𝒟𝑡𝑛\mathcal{D}^{t}\leftarrow\mathcal{D}^{t}\cup\mathcal{D}^{t,n}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT italic_t , italic_n end_POSTSUPERSCRIPT;

end for

Apply r𝑟ritalic_r to 𝒟tsuperscript𝒟𝑡\mathcal{D}^{t}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to get correct responses 𝒟correcttsubscriptsuperscript𝒟𝑡correct\mathcal{D}^{t}_{\text{correct}}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT and incorrect responses 𝒟incorrecttsubscriptsuperscript𝒟𝑡incorrect\mathcal{D}^{t}_{\text{incorrect}}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT incorrect end_POSTSUBSCRIPT;

// Generate critique and refinement for incorrect responses.

𝒟critiquet,𝒟refinetsuperscriptsubscript𝒟critique𝑡subscriptsuperscript𝒟𝑡refine\mathcal{D}_{\text{critique}}^{t},\mathcal{D}^{t}_{\text{refine}}\leftarrow\varnothingcaligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT ← ∅;

forcritique generation num l=1𝑙1l=1italic_l = 1 to L𝐿Litalic_L do

𝒟critiquet,l={(xi,yi,ci)|xi,yi𝒟incorrectt,ciπϕ(c|xi,yi)}superscriptsubscript𝒟critique𝑡𝑙conditional-setsubscript𝑥𝑖subscript𝑦𝑖subscript𝑐𝑖formulae-sequencesimilar-tosubscript𝑥𝑖subscript𝑦𝑖subscriptsuperscript𝒟𝑡incorrectsimilar-tosubscript𝑐𝑖subscript𝜋italic-ϕconditional𝑐subscript𝑥𝑖subscript𝑦𝑖\mathcal{D}_{\text{critique}}^{t,l}=\{(x_{i},y_{i},c_{i})|x_{i},y_{i}\sim%\mathcal{D}^{t}_{\text{incorrect}},c_{i}\sim\pi_{\phi}(c|x_{i},y_{i})\}caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_l end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT incorrect end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) };

𝒟critiquet=𝒟critiquet𝒟critiquet,lsuperscriptsubscript𝒟critique𝑡superscriptsubscript𝒟critique𝑡superscriptsubscript𝒟critique𝑡𝑙\mathcal{D}_{\text{critique}}^{t}=\mathcal{D}_{\text{critique}}^{t}\cup%\mathcal{D}_{\text{critique}}^{t,l}caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_l end_POSTSUPERSCRIPT;

𝒟refinet,l={(xi,yi,ci,yi)|xi,yi,ci𝒟critiquet,l,yiπθt1(y|xi,yi,ci)}subscriptsuperscript𝒟𝑡𝑙refineconditional-setsubscript𝑥𝑖subscript𝑦𝑖subscript𝑐𝑖superscriptsubscript𝑦𝑖formulae-sequencesimilar-tosubscript𝑥𝑖subscript𝑦𝑖subscript𝑐𝑖subscriptsuperscript𝒟𝑡𝑙critiquesimilar-tosuperscriptsubscript𝑦𝑖superscriptsubscript𝜋𝜃𝑡1conditional𝑦subscript𝑥𝑖subscript𝑦𝑖subscript𝑐𝑖\mathcal{D}^{t,l}_{\text{refine}}=\{(x_{i},y_{i},c_{i},y_{i}^{\prime})|x_{i},y%_{i},c_{i}\sim\mathcal{D}^{t,l}_{\text{critique}},y_{i}^{\prime}\sim\pi_{%\theta}^{{t-1}}(y|x_{i},y_{i},c_{i})\}caligraphic_D start_POSTSUPERSCRIPT italic_t , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_t , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT critique end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) };

𝒟refinet𝒟refinet𝒟refinet,lsubscriptsuperscript𝒟𝑡refinesubscriptsuperscript𝒟𝑡refinesubscriptsuperscript𝒟𝑡𝑙refine\mathcal{D}^{t}_{\text{refine}}\leftarrow\mathcal{D}^{t}_{\text{refine}}\cup%\mathcal{D}^{t,l}_{\text{refine}}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT italic_t , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT;

end for

// Combine correct original solutions and correct refined solutions.

Apply r𝑟ritalic_r to 𝒟refinetsubscriptsuperscript𝒟𝑡refine\mathcal{D}^{t}_{\text{refine}}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT to get the correct refine set 𝒟correct_refinetsubscriptsuperscript𝒟𝑡correct_refine\mathcal{D}^{t}_{\text{correct\_refine}}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT correct_refine end_POSTSUBSCRIPT;

𝒟correctt𝒟correctt{(xi,yi)|xi,yi𝒟correct_refinet}superscriptsubscript𝒟correct𝑡superscriptsubscript𝒟correct𝑡conditional-setsubscript𝑥𝑖superscriptsubscript𝑦𝑖similar-tosubscript𝑥𝑖superscriptsubscript𝑦𝑖subscriptsuperscript𝒟𝑡correct_refine\mathcal{D}_{\text{correct}}^{t}\leftarrow\mathcal{D}_{\text{correct}}^{t}\cup%\{(x_{i},y_{i}^{\prime})|x_{i},y_{i}^{\prime}\sim\mathcal{D}^{t}_{\text{%correct\_refine}}\}caligraphic_D start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← caligraphic_D start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∪ { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT correct_refine end_POSTSUBSCRIPT } ;

ProcedureLearning Step

𝒟traint=𝒟reason𝒟correcttsuperscriptsubscript𝒟train𝑡subscript𝒟reasonsuperscriptsubscript𝒟correct𝑡\mathcal{D}_{\text{train}}^{t}=\mathcal{D}_{\text{reason}}\cup\mathcal{D}_{%\text{correct}}^{t}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT;

Minimize the following loss objective to obtain πθtsuperscriptsubscript𝜋𝜃𝑡\pi_{\theta}^{t}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT:actor(θ)=𝔼(x,y)𝒟train[logπθ(y|x)]+β×𝔼(x,y,c,y)𝒟refine[logπθ(y|x,y,c)]subscriptactor𝜃subscript𝔼similar-to𝑥𝑦subscript𝒟traindelimited-[]subscript𝜋𝜃conditional𝑦𝑥𝛽subscript𝔼similar-to𝑥𝑦𝑐superscript𝑦subscript𝒟refinedelimited-[]subscript𝜋𝜃conditionalsuperscript𝑦𝑥𝑦𝑐\mathcal{L}_{\text{actor}}(\theta)=\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{%train}}}\Big{[}\log{\pi_{\theta}(y|x)}\Big{]}+\beta\times\mathbb{E}_{(x,y,c,y^%{\prime})\sim\mathcal{D}_{\text{refine}}}\Big{[}\log{\pi_{\theta}(y^{\prime}|x%,y,c)}\Big{]}caligraphic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] + italic_β × blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y , italic_c , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT refine end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_y , italic_c ) ];

end for

Introduce critique models for high-coverage exploration.

Motivated by our prior findings in Section 4.2 that critique models enable actors to achieve greater performance gains on harder queries, we introduce critique models to the self-improvement process and propose a critique-in-the-loop self-improvement approach.

This method is built upon self-improvement, and during the exploration step of iteration t𝑡titalic_t, critique models πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT are instructed to provide feedback on the responses of actor πθtsuperscriptsubscript𝜋𝜃𝑡\pi_{\theta}^{t}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and the actor is then prompted to perform refinements accordingly. After that, correct refinements are then added to the training set. Since we can assume the availability of an oracle reward function for the training set, critiques are only applied to incorrect responses, while correct responses are directly included in the dataset. This design minimizes the risk of low-quality critiques negatively affecting originally correct responses. In this way, we increase the coverage of solutions for harder queries, and significantly reduce the tail-narrowing problem [34].

Difficulty-aware computation allocation for exploration.

Furthermore, building on our previous findings in Section 4.2 that scaling inference-time computation can improve the efficiency and quality of exploration, we allocate more computation for exploration and critique to harder problems. This involves performing additional response generation, critique, and refinement to obtain high-quality and diverse solutions.

In practice, we employ a simple difficulty-based computation allocation strategy, as it has proven sufficiently effective.111Note that more complex strategies, such as distinguishing difficulty based on the accuracy observed after multiple samples, are expected to yield even better results.For incorrect initial responses, we classify them as difficult and allocate L𝐿Litalic_L times of critique and refinement. For correct initial responses, they are considered simple and are directly added to the training set without further critique or refinement. This approach further mitigates the long-tail issue of self-improvement, enhances sampling quality, and improves the overall performance [34].

We summarize the critique-in-the-loop self-improvement method in Algorithm 1.

5.3 Experimental Results and Findings

Implementation details.

We set the self-improvement process to run for 3333 iterations, as in previous works [48], the model’s performance tends to saturate after 3333 iterations of exploration and learning. During the exploration stage, we set the temperature to 0.70.70.70.7 and the number of samples to 5555 or 10101010. During the learning stage, we set the learning rate to 2e52𝑒52e-52 italic_e - 5 and the number of epochs to 1111.

Critique-in-the-loop self-improvement consistently improves reasoning performance.
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (6)

The evaluating results of our method are shown in Figure 6. We can observe that: (1) Increasing the number of samples during exploration improves performance, with the performance upper bound rising accordingly, underscoring the benefits of enhanced exploration computation. (2) Our method consistently outperforms vanilla self-improvement with stable and significant performance gains, especially when the sample number N𝑁Nitalic_N is larger. For example, when N=10𝑁10N=10italic_N = 10, our method achieves a performance advantage of 11.1%percent11.111.1\%11.1 % on both GSM8K and MATH. (3) While the vanilla method initially shows performance improvements during self-improvement, it quickly reaches a bottleneck or even starts to decline, which may be attributed to the tail narrowing issue [34]. In contrast, our method demonstrates consistent improvement, with performance saturation occurring much later, indicating the effectiveness of our method.

Critique-in-the-loop self-improvement balances the solution distribution across difficulty levels, and enhances performance on challenging queries in the test set.

Since our motivation for introducing critique-based supervision into training is to improve the efficiency and quality of exploration, we examine the distribution of solutions sampled by our method compared to vanilla self-improvement.As shown in Figure 7, we find that our approach samples a higher proportion of solutions for challenging queries during the exploration stage.This significantly balances the training data distribution for the learning stage, effectively mitigating the tail-narrowing issue. In Figure 8, we also present the model’s performance on the test set across different difficulty levels, and we observe that our method performs significantly better than the vanilla approach on harder problems, further demonstrating the potential of our approach.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (7)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (8)
Combining test-time supervision with training-time supervisions yields more performance gains.

Training-timeTest-timeGSM8KMATH
Acc.Pass@5MV@5AccPass@5MV@5
Supervised Fine-tuningresponse only54.854.854.854.875.275.275.275.254.554.554.554.517.217.217.217.235.035.035.035.015.615.615.615.6
w/ critique model63.363.363.363.387.687.687.687.675.475.475.475.424.324.324.324.347.447.447.447.430.730.730.730.7
Self-Correction Fine-tuningresponse only54.254.254.254.273.173.173.173.153.453.453.453.418.118.118.118.132.432.432.432.416.616.616.616.6
self-correction60.160.160.160.181.581.581.581.567.267.267.267.224.224.224.224.241.741.741.741.726.126.126.126.1
Vanilla Self-Improveresponse only64.664.664.664.683.483.483.483.470.670.670.670.620.220.220.220.238.538.538.538.523.023.023.023.0
w/ critique model70.270.270.270.290.8¯¯90.8\underline{90.8}under¯ start_ARG 90.8 end_ARG78.278.278.278.227.027.027.027.048.848.848.848.831.431.431.431.4
Critique-in-the-loop Self-Improveresponse only75.5¯¯75.5\underline{75.5}under¯ start_ARG 75.5 end_ARG89.189.189.189.180.1¯¯80.1\underline{80.1}under¯ start_ARG 80.1 end_ARG31.3¯¯31.3\underline{31.3}under¯ start_ARG 31.3 end_ARG51.0¯¯51.0\underline{51.0}under¯ start_ARG 51.0 end_ARG35.1¯¯35.1\underline{35.1}under¯ start_ARG 35.1 end_ARG
w/ critique model75.8¯¯75.8\underline{\textbf{75.8}}under¯ start_ARG 75.8 end_ARG91.8¯¯91.8\underline{\textbf{91.8}}under¯ start_ARG 91.8 end_ARG82.8¯¯82.8\underline{\textbf{82.8}}under¯ start_ARG 82.8 end_ARG31.4¯¯31.4\underline{\textbf{31.4}}under¯ start_ARG 31.4 end_ARG53.1¯¯53.1\underline{\textbf{53.1}}under¯ start_ARG 53.1 end_ARG36.8¯¯36.8\underline{\textbf{36.8}}under¯ start_ARG 36.8 end_ARG

Previously, we have evaluated the impact of incorporating critique model supervision at training and test-time, separately. Here, we combine them and evaluate the performance. Additionally, we include a self-correction baseline where the model refines its reasoning by itself. The training data for this baseline consists of original reasoning datasets (GSM8K and MATH) and correction data derived from our refinement data by removing critique elements, and reformatted into (query, original response, new response) triplets.

Evaluation results shown in Table 3 reveal that: (1)Integrating critique models during test-time consistently enhances performance under identical training conditions, particularly when critique supervision is not used during training. For example, applying critique models at test-time increases the MV@5 performance of SFT on GSM8K and MATH by 10.910.910.910.9 and 15.115.115.115.1 points, respectively. (2) When critique models are used during training, the additional benefit of test-time critique supervision becomes marginal, suggesting successful “distillation” of critique models into the actor during training. (3) The self-correction baseline underperforms compared to utilizing separate critique models, aligning findings in prior work that models struggle to accurately evaluate and refine their outputs without external feedback [24, 19]. Moreover, training a single model to handle both reasoning and correction capabilities may introduce conflicts, leading to performance degradation [24].(4) Compared to the traditional strategy of vanilla self-improvement + response-only, which increases computation during training, the approach of supervised fine-tuning + test-time critique supervision reduces training computation while increasing test-time computation and achieves better performance, particularly on the more challenging MATH dataset. This aligns with prior work highlighting the benefits of enhancing test-time computation [15, 32, 33].

Ablation study.

MethodGSM8KMATH
N=3
Ours69.1¯¯69.1\underline{\textbf{69.1}}under¯ start_ARG 69.1 end_ARG25.5¯¯25.5\underline{\textbf{25.5}}under¯ start_ARG 25.5 end_ARG
w/o Difficulty Aware65.865.865.865.823.923.923.923.9
More Refinement68.868.868.868.825.525.525.525.5
N=5
Ours72.3¯¯72.3\underline{\textbf{72.3}}under¯ start_ARG 72.3 end_ARG28.0¯¯28.0\underline{\textbf{28.0}}under¯ start_ARG 28.0 end_ARG
w/o Difficulty Aware69.469.469.469.426.626.626.626.6
More Refinement71.271.271.271.227.727.727.727.7
N=10
Ours75.4¯¯75.4\underline{\textbf{75.4}}under¯ start_ARG 75.4 end_ARG31.331.331.331.3
w/o Difficulty Aware70.170.170.170.127.927.927.927.9
More Refinement73.873.873.873.831.6¯¯31.6\underline{\textbf{31.6}}under¯ start_ARG 31.6 end_ARG

We evaluate the impact of using the strategy of difficulty-aware computation allocation for exploration, as well as the performance differences when allocating more computation to critique generation v.s. refinement generation. The experimental results are presented in Table 4, and we observe that:(1) If the difficulty-aware allocation of exploration computation is removed, performance drops significantly. (2) If more allocation is used to generate multiple refinements for the same critique instead of generating diverse critiques, performance also decreases. Therefore, both difficulty-aware computation allocation and diversified critiques are crucial components for achieving the final performance.

6 Discussion and Analysis

6.1 Scaling Properties of Critique Models

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (9)

As in previous work [35], we study the scaling properties of critique models, trying to investigate whether they can supervise models of different scales, particularly those larger and stronger than themselves. In this study, we conduct experiments using the Qwen-2.5 series of models [49], which span a wide range of scales (1.5B, 3B, 7B, and 14B). We train a critique model of 3B scale and use it to supervise trained actor reasoning models of all sizes. Other experimental settings are consistent with Section 4.2.

The evaluating results are shown in Figure 9. In the figure, “oracle” indicates whether we have an oracle reward function to assist the critique model in making judgments. With an oracle reward function, only incorrect responses are passed to the critique model; otherwise, all responses are passed to the critique model. From the results, we can observe that: (1) Regardless of scale, the 3B critique model can provide effective supervision, indicating that smaller critique models can help supervise larger actors to a certain extent. (2) With the oracle reward function, the critique model does not need to perform discriminative tasks and only needs to provide useful feedback, resulting in greater performance improvements. (3) As the model scale increases, the performance improvement provided by the critique model on simpler datasets like GSM8K becomes marginal. However, on the more challenging MATH, the critique model continues to deliver significant performance gains even for the largest model.

6.2 How do Critique Models improve Majority Voting?

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (10)

Majority voting is one of the most commonly used techniques for scaling test-time computation. Following previous work [32], we study the relationship between the correct frequency of multiple samples and the performance of majority voting, while also examining the impact of critique models. Specifically, consistent with the settings in 4.2, we use an actor reasoning model and a critique model trained with supervised fine-tuning. For each query, we sample 1,00010001,0001 , 000 responses in parallel. The experimental results are shown in Figure 10.

We observe that critique models improve both the overall correct frequency and the performance of majority voting. Delving deeper, we can find a significant failure mode when critique models are not used, where the correct answer appears with a relatively high frequency (e.g., 40%percent4040\%40 %), but a specific incorrect answer dominates in frequency (e.g., 45%percent4545\%45 %), causing majority voting to select the incorrect result. By incorporating critique models, this failure mode is effectively addressed through critique and refinement. Specifically, the discriminative ability of critique models helps suppress the occurrence of high-frequency incorrect answers, while the feedback provided by these models increases the relative frequency of correct answers. Together, these mechanisms contribute to a substantial improvement in the performance of majority voting.

6.3 Should test-time computation be scaled sequentially or in parallel?

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (11)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (12)

In the two-player paradigm, test-time computation can be scaled either in parallel by sampling multiple (response, critique, refinement) triplets [22, 50, 15], or sequentially by generating critiques and refinements iteratively after an initial response [14, 16, 15]. Here, we explore the performance of these two approaches. Notably, in our implementation of the sequential approach, to avoid potential issues of context window length limits, we use the following strategy: given a query x𝑥xitalic_x, the actor first generates a response y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For the i𝑖iitalic_i-th critique task (i>0𝑖0i>0italic_i > 0), only the query and the (i1)𝑖1(i-1)( italic_i - 1 )-th response or refinement are provided. Similarly, for the i𝑖iitalic_i-th refinement task (i>0𝑖0i>0italic_i > 0), only the query, the (i1)𝑖1(i-1)( italic_i - 1 )-th response or refinement, and the i𝑖iitalic_i-th critique are provided.

Figure 11 illustrates the Pass@K performance trends. Pass@K is a metric that measures whether at least one correct answer exists among K𝐾Kitalic_K samples. As computation increases, Pass@K performance improves, though the gains become marginal progressively. The performance of sequential computation scaling is slightly worse than that of parallel computation scaling. This may stem from the fact that in the parallel approach, the actor has more opportunities to generate original solutions directly from the queries, leading to greater sampling diversity and a higher chance of obtaining at least one correct answer. In contrast, the sequential approach allows the actor only one opportunity for original reasoning, with subsequent revisions relying on critical feedback, which may reduce diversity and limit the chances of finding correct solutions.

In Figure 12, we present the performance trends of three test-time techniques as computation increases: parallel majority voting, sequential majority voting, and sequential final. Here, sequential majority voting refers to selecting the most frequent answer from all the linearly generated responses and refinements, while sequential final selects the final answer after the K𝐾Kitalic_K-th refinement. From the experimental results, we observe that: (1) Overall, compared to majority voting, selecting the final answer of a sequence of revisions performs weaker. This may be because the sequential critique and refinement process occasionally modifies a previously correct answer, resulting in an incorrect final result. (2) For smaller K𝐾Kitalic_K, parallel majority voting outperforms sequential majority voting. However, as computation scales up, sequential majority voting surpasses its parallel counterpart, especially on the more challenging MATH dataset. This indicates a trade-off between the parallel and sequential approaches [15], providing inspiration for our critique-in-the-loop self-improvement. Specifically, depending on the computation budget, the exploration strategy can be adapted to balance solution quality and diversity, ultimately leading to stronger actor reasoning models. We leave this study to future work.

7 A Step Further: Training Step-level Self-Talk Reasoning Models via Critique Data

Motivation and method.

In this work, we focus on the two-player paradigm, leveraging critique models to provide step-level supervision and feedback for actor models. Recently, OpenAI’s o1 model [13] has pushed the boundaries of large reasoning models’ capabilities. With its self-talk output format, it can autonomously plan, reflect, critique, correct, backtrack, and more during the thinking process, marked by phrases such as “wait” and “alternatively”. Therefore, we investigate whether it is possible to construct self-talk data with step-level critique supervision, and propose the preliminary self-talk-via-critique method. Specifically, it has three main steps:

  1. 1.

    Construct an initial thinking chain that has step-level reflection. Given a query and a reasoning path, we first use AutoMathCritique to generate critique data. Feedback on each reasoning step provided in the critique is then inserted into the reasoning path, constructing a thinking chain that includes step-level reflections. At this stage, the thought process may lack smoothness but achieve an initial structure.

  2. 2.

    Iterative refine and critique the thinking chain. For reasoning paths without errors, they are directly passed to the next stage. For paths containing errors, actors perform refinement from the first identified erroneous step, continuing the reasoning from that point onward. Starting from the first refined step, we utilize the critique model to re-critique the partial reasoning process—spanning from the refined step to the final step. Step-level feedback from this critique is again integrated into the thought process. If the critique model identifies new errors in the refined reasoning steps, the process is repeated iteratively. The reasoning path is continuously optimized until all errors are resolved by reflection and refinement. Only then is the thinking chain passed to the next stage.

  3. 3.

    Smooth the thinking chain into self-talk data.Next, we prompt the LLMs to smooth out the previously rigid thinking chain. This involves adding transitional phrases and connectors to make the reasoning and reflection flow more naturally. Finally, we verify the correctness of the final answer, ensuring that only accurate data is stored.

An overview of the method is shown in Figure 13. An illustrating example of the resulting self-talk data can be found in Figure 14.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (13)
Evaluation and findings.

Based on this approach, we construct a dataset of 4444k self-talk examples from the MATH training set and fine-tune the model for evaluation. As in the previous section, we used the Llama3-8B base model as the backbone for our experiments. We compared our method with the self-correction baseline and the baseline of SFT with test-time critique supervision. These two baselines fall under the one-player and two-player settings, respectively. Note that for the two baselines, we only used the MATH dataset for training, without using GSM8K data. The experimental results are shown in Table 5.We observe that, in the one-player setting, the step-level self-talk approach outperforms trajectory-level self-correction by a significant margin, demonstrating its potential. However, it still lags behind the two-player setting, indicating that this direction requires further exploration, which we leave to future work.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (14)

MethodAccPass@5MV@5Pass@10MV@10
One-player Setting
Trajectory-level Self-Correction21.0821.0821.0821.0838.1238.1238.1238.1220.5620.5620.5620.5646.6446.6446.6446.6430.9030.9030.9030.90
Step-level Self-talk24.9024.9024.9024.9045.4845.4845.4845.4828.4428.4428.4428.4454.4854.4854.4854.4831.2831.2831.2831.28
Two-player Setting
SFT w/ test-time critic27.2227.2227.2227.2249.1649.1649.1649.1632.2432.2432.2432.2458.8658.8658.8658.8635.9235.9235.9235.92

8 Related Work

Training LLMs for reasoning through exploration and learning.

Multi-step reasoning, such as mathematical reasoning and logical reasoning, is a challenging task for large language models (LLMs). Researchers have proposed prompting methods represented by Chain-of-Thought (CoT) to enable LLMs to think and reason step by step like humans, and then produce answers based on the reasoning process, significantly improving the model’s reasoning performance [6, 41].To enhance the reasoning ability of models, previous work has focused on collecting large amounts of expert-labeled reasoning trajectories, allowing models to mimic step-by-step reasoning [51]. However, these methods are often difficult to scale up, as annotation is highly expensive, especially for very challenging and complex problems [52].

Another category of methods, exploration and learning, seeks to address this issue by using model-generated data to train the model itself. Specifically, given a query, the model generates its own reasoning paths, and external supervision signals are used to filter out high-quality solutions, which are then used to train the model [46, 42, 3, 53, 54]. This approach, also known as self-improvement or rejection sampling, often encounters the tail-narrowing problem, which can lead to performance bottlenecks [55, 56, 34]. Some researchers have proposed reinforcement learning-based approaches, where reward models are trained or oracle reward functions are used to provide supervision signals, enabling the model to explore and learn, thereby significantly improving reasoning performance [57, 10, 58, 11, 30]. However, reinforcement learning typically converges slowly, is costly, and poses challenges in providing reliable and dense process-level supervision signals.

In this work, we fine-tune critique models to provide reliable step-level supervision signals and helpful feedback during both training and test time. This approach improves sampling efficiency and quality, ultimately enhancing the actor’s reasoning performance.

Developing models for critique, reflection and correction.

Developing models with the ability to critique, reflect, and correct is an important way for scalable supervision and has been explored in various domains, such as summarization, mathematical reasoning, sequential decision-making, and coding [23, 59, 18, 7, 60].Most previous work has used prompting techniques to have models generate critical comments or corrections about their own outputs [28, 23, 61].However, these methods typically perform poorly without assuming an oracle reward function. This is because they struggle to assess their outputs correctly in the absence of external feedback, especially when the problems are more challenging [24, 19]. As a result, many fine-tuning or RL approaches are proposed to trained models to develop the capabilities [62, 63, 64, 27, 65, 66].The former often requires extensive human annotation, while the latter necessitates engineered and cumbersome reward designs.Another line of work leverages self-training ways to develop self-correction capabilities, e.g., [18, 67, 17, 16].In contrast to these methods, in this paper, we delve into a two-player framework (e.g., [25, 26, 68]), distinguishing the roles of the critique model and the actor model. We propose a scalable data synthesis framework, AutoMathCritique, to generate critique data and train critique models. The trained critique models can provide supervision to yield stable performance gains for the actor model, both at test time and training time.

Scaling test-time computation for LLM Reasoning.

Recent studies have shown that scaling up computation during test-time/inference-time can effectively improve a model’s reasoning performance, as exemplified by OpenAI’s o1 [13, 32, 15, 33]. These studies typically increase inference computation by extending the model’s thinking chains or employing other techniques such as majority voting [41], Best-of-N with reward models [35, 65, 69], and tree search [70, 71, 39]. Some other works train correction models to allocate more computation toward sequential corrections during test-time, thereby enhancing the model’s final performance [20, 60, 72].In this paper, we take a different perspective by training additional critique models and delegating the responsibility for refinement to the actor itself. We investigate the significant performance gains achieved by leveraging critique models’ supervision during test-time, especially for challenging problems. Motivated by these key findings during test-time, we incorporate this supervision into the exploration phase of the self-improvement process, leading to the training of stronger models.

9 Conclusion and Future Work

In this work, we take the preliminary step to explore how to construct high-quality and diverse training data to train critique models capable of providing step-level supervision and effective feedback without additional human annotations. By introducing critique-based supervision at test time, we demonstrate that critique models can significantly enhance the performance of reasoning models, particularly on challenging tasks. When inference computation is scaled up, critique models also yield continuous improvements and enhance the capability ceiling of reasoning models. Building on the findings from test-time, we integrate critique model supervision into the self-training process of reasoning models, proposing critique-based self-improvement and validating its effectiveness through extensive experiments. Lastly, we propose constructing step-level self-talk data based on critique data and showcasing its potential.

Despite this, there are still limitations and future directions to work on. Specifically, (1) our method of building critique models primarily involves data construction and fine-tuning. Future work should include optimizing critique models from more perspectives, such as employing more advanced algorithms. Additionally, due to resource constraints, we are unable to conduct extensive experiments with larger-scale critique models, which may deliver superior performance and provide more reliable supervision signals. (2) We need to develop more advanced test-time scaling techniques to improve efficiency and reliability, reducing hallucinations in long thinking chains. (3) In Section 7, we show the potential of self-talk models, and in the future, we expect to further optimize them to enhance their performance. (4) Our work can be extended or applied to other application domains, enhancing model capabilities through reliable supervision and helpful feedback, including scientific research [73, 74], software engineering [75, 76], and agentic tasks [77, 78, 47]. However, as we broaden the scope of applications, substantial efforts will be required to ensure safety and robustness.

Acknowledgements

The authors would like to thank Huawei Ascend Cloud Ecological Development Project for the support of Ascend 910 processors.

References

  • [1]Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, PaulF. Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • [2]OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.
  • [3]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288, 2023.
  • [4]AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego deLasCasas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mistral 7b.CoRR, abs/2310.06825, 2023.
  • [5]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, HugoTouvron, Iliyan Zarov, ImanolArrieta Ibarra, IsabelM. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, and etal.The llama 3 herd of models.CoRR, abs/2407.21783, 2024.
  • [6]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdH. Chi, QuocV. Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large language models.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • [7]Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Jia Liu, Tao Gui, QiZhang, and Xuanjing Huang.Self-polish: Enhance reasoning in large language models via problem refinement.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 11383–11406. Association for Computational Linguistics, 2023.
  • [8]Çaglar Gülçehre, TomLe Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando deFreitas.Reinforced self-training (rest) for language modeling.CoRR, abs/2308.08998, 2023.
  • [9]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan.Tree of thoughts: Deliberate problem solving with large language models.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • [10]Haipeng Luo, Qingfeng Sun, Can Xu, PuZhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang.Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.CoRR, abs/2308.09583, 2023.
  • [11]Zhiheng Xi, Wenxiang Chen, Boyang Hong, Senjie Jin, Rui Zheng, Wei He, Yiwen Ding, Shichun Liu, Xin Guo, Junzhe Wang, Honglin Guo, Wei Shen, Xiaoran Fan, Yuhao Zhou, Shihan Dou, Xiao Wang, Xinbo Zhang, Peng Sun, Tao Gui, QiZhang, and Xuanjing Huang.Training large language models for reasoning through reverse curriculum reinforcement learning.CoRR, abs/2402.05808, 2024.
  • [12]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, KarthikR. Narasimhan, and Yuan Cao.React: Synergizing reasoning and acting in language models.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  • [13]OpenAI.Learning to reason with llms, 9 2024.
  • [14]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: language agents with verbal reinforcement learning.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • [15]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar.Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024.
  • [16]Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar.Recursive introspection: Teaching language model agents how to self-improve.CoRR, abs/2407.18219, 2024.
  • [17]Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, YiSu, JohnD. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, LeiM. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal M.P. Behbahani, and Aleksandra Faust.Training language models to self-correct via reinforcement learning.CoRR, abs/2409.12917, 2024.
  • [18]Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi.Generating sequences by learning to self-correct.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  • [19]Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Wang.Pride and prejudice: LLM amplifies self-bias in self-refinement.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15474–15492. Association for Computational Linguistics, 2024.
  • [20]Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, and Minjoon Seo.Selfee: Iterative self-revising llm empowered by self-feedback generation.Blog post, May 2023.
  • [21]Geunwoo Kim, Pierre Baldi, and Stephen McAleer.Language models can solve computer tasks.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • [22]William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike.Self-critiquing models for assisting human evaluators.CoRR, abs/2206.05802, 2022.
  • [23]Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, BodhisattwaPrasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark.Self-refine: Iterative refinement with self-feedback.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • [24]Jie Huang, Xinyun Chen, Swaroop Mishra, HuaixiuSteven Zheng, AdamsWei Yu, Xinying Song, and Denny Zhou.Large language models cannot self-correct reasoning yet.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
  • [25]AfraFeyza Akyürek, Ekin Akyürek, Ashwin Kalyan, Peter Clark, DerryTanti Wijaya, and Niket Tandon.RL4F: generating natural language feedback with reinforcement learning for repairing model outputs.In Anna Rogers, JordanL. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 7716–7733. Association for Computational Linguistics, 2023.
  • [26]Weiran Yao, Shelby Heinecke, JuanCarlos Niebles, Zhiwei Liu, Yihao Feng, LeXue, RitheshR. N., Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese.Retroformer: Retrospective large language agents with policy gradient optimization.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
  • [27]Alexander Havrilla, SharathChandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, and Roberta Raileanu.Glore: When, where, and how to improve LLM reasoning via global and local refinements.In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
  • [28]Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, etal.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022.
  • [29]SamuelR. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, SheerEl Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, and Jared Kaplan.Measuring progress on scalable oversight for large language models.CoRR, abs/2211.03540, 2022.
  • [30]Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, YuWu, and Zhifang Sui.Math-shepherd: Verify and reinforce llms step-by-step without human annotations.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 9426–9439. Association for Computational Linguistics, 2024.
  • [31]Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar.Rewarding progress: Scaling automated process verifiers for LLM reasoning.CoRR, abs/2410.08146, 2024.
  • [32]Bradley C.A. Brown, Jordan Juravsky, RyanSaul Ehrlich, Ronald Clark, QuocV. Le, Christopher Ré, and Azalia Mirhoseini.Large language monkeys: Scaling inference compute with repeated sampling.CoRR, abs/2407.21787, 2024.
  • [33]Hritik Bansal, Arian Hosseini, Rishabh Agarwal, VinhQ. Tran, and Mehran Kazemi.Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling.CoRR, abs/2408.16737, 2024.
  • [34]Yiwen Ding, Zhiheng Xi, Wei He, Zhuoyuan Li, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, QiZhang, and Xuanjing Huang.Mitigating tail narrowing in llm self-improvement via socratic-guided sampling.arXiv preprint arXiv:2411.00750, 2024.
  • [35]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021.
  • [36]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the MATH dataset.In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
  • [37]Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamile Lukosiute, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, SamuelR. Bowman, and Ethan Perez.Measuring faithfulness in chain-of-thought reasoning.CoRR, abs/2307.13702, 2023.
  • [38]Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng.Evaluating mathematical reasoning of large language models: A focus on error identification and correction.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11316–11360. Association for Computational Linguistics, 2024.
  • [39]Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, and Jie Tang.Rest-mcts*: LLM self-training via process reward guided tree search.CoRR, abs/2406.03816, 2024.
  • [40]Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He.Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving.CoRR, abs/2407.13690, 2024.
  • [41]Xuezhi Wang, Jason Wei, Dale Schuurmans, QuocV. Le, EdH. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou.Self-consistency improves chain of thought reasoning in language models.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  • [42]Jiaxin Huang, Shixiang Gu, LeHou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han.Large language models can self-improve.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1051–1068. Association for Computational Linguistics, 2023.
  • [43]Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, FelixX. Yu, and Sanjiv Kumar.Rest meets react: Self-improvement for multi-step reasoning LLM agent.CoRR, abs/2312.10003, 2023.
  • [44]YeTian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu.Toward self-improvement of llms via imagination, searching, and criticizing.arXiv preprint arXiv:2404.12253, 2024.
  • [45]Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng.Re-rest: Reflection-reinforced self-training for language agents.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 15394–15411. Association for Computational Linguistics, 2024.
  • [46]Eric Zelikman, Yuhuai Wu, Jesse Mu, and NoahD. Goodman.Star: Bootstrapping reasoning with reasoning.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • [47]Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, LuChen, Rui Zheng, Yicheng Zou, Tao Gui, QiZhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang.Agentgym: Evolving large language model-based agents across diverse environments.CoRR, abs/2406.04151, 2024.
  • [48]Ting Wu, Xuefeng Li, and Pengfei Liu.Progress or regress? self-improvement reversal in post-training.CoRR, abs/2407.05013, 2024.
  • [49]Qwen Team.Qwen2.5: A party of foundation models, September 2024.
  • [50]Ning Miao, YeeWhye Teh, and Tom Rainforth.Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
  • [51]Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer.Rethinking the role of demonstrations: What makes in-context learning work?In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11048–11064. Association for Computational Linguistics, 2022.
  • [52]Yisheng Song, Ting Wang, Puyu Cai, SubrotaK. Mondal, and JyotiPrakash Sahoo.A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities.ACM Comput. Surv., 55(13s):271:1–271:40, 2023.
  • [53]Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, NoahA. Smith, Daniel Khashabi, and Hannaneh Hajishirzi.Self-instruct: Aligning language models with self-generated instructions.In Anna Rogers, JordanL. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics, 2023.
  • [54]Avi Singh, JohnD. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, PeterJ. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, GamaleldinF. Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, MaxwellL. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel.Beyond human data: Scaling self-training for problem-solving with language models.CoRR, abs/2312.06585, 2023.
  • [55]Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and RossJ. Anderson.The curse of recursion: Training on generated data makes models forget.CoRR, abs/2305.17493, 2023.
  • [56]Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, AhmedImtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and RichardG. Baraniuk.Self-consuming generative models go MAD.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
  • [57]Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe.Let’s verify step by step.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
  • [58]LuChen, Rui Zheng, Binghai Wang, Senjie Jin, Caishuang Huang, Junjie Ye, Zhihao Zhang, Yuhao Zhou, Zhiheng Xi, Tao Gui, QiZhang, and Xuanjing Huang.Improving discriminative capability of reward models in RLHF using contrastive learning.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 15270–15283. Association for Computational Linguistics, 2024.
  • [59]Angelica Chen, Jérémy Scheurer, Tomasz Korbak, JonAnder Campos, JunShern Chan, SamuelR. Bowman, Kyunghyun Cho, and Ethan Perez.Improving code generation by training with natural language feedback.CoRR, abs/2303.16749, 2023.
  • [60]Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings.REFINER: reasoning feedback on intermediate representations.In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, pages 1100–1126. Association for Computational Linguistics, 2024.
  • [61]Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston.Chain-of-verification reduces hallucination in large language models.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 3563–3578. Association for Computational Linguistics, 2024.
  • [62]Tianlu Wang, Ping Yu, XiaoqingEllen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz.Shepherd: A critic for language model generation.CoRR, abs/2308.04592, 2023.
  • [63]Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, CeZheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang.LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback.CoRR, abs/2406.14024, 2024.
  • [64]Aojun Zhou, KeWang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li.Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
  • [65]Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal.Generative verifiers: Reward modeling as next-token prediction.CoRR, abs/2408.15240, 2024.
  • [66]Zachary Ankner, Mansheej Paul, Brandon Cui, JonathanD. Chang, and Prithviraj Ammanabrolu.Critique-out-loud reward models.CoRR, abs/2408.11791, 2024.
  • [67]Xin Zheng, Jie Lou, Boxi Cao, Xueru Wen, Yuqiu Ji, Hongyu Lin, Yaojie Lu, Xianpei Han, Debing Zhang, and LeSun.Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, 2024.
  • [68]Runlong Zhou, SimonS. Du, and Beibin Li.Reflect-rl: Two-player online RL fine-tuning for lms.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 995–1015. Association for Computational Linguistics, 2024.
  • [69]Fei Yu, Anningzhe Gao, and Benyou Wang.Ovm, outcome-supervised value models for planning in mathematical reasoning.In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 858–875. Association for Computational Linguistics, 2024.
  • [70]Shibo Hao, YiGu, Haodi Ma, JoshuaJiahua Hong, Zhen Wang, DaisyZhe Wang, and Zhiting Hu.Reasoning with language model is planning with world model.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 8154–8173. Association for Computational Linguistics, 2023.
  • [71]Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan.Alphamath almost zero: process supervision without process.CoRR, abs/2405.03553, 2024.
  • [72]DiZhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou.Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning.CoRR, abs/2410.02884, 2024.
  • [73]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  • [74]David Rein, BettyLi Hou, AsaCooper Stickland, Jackson Petty, RichardYuanzhe Pang, Julien Dirani, Julian Michael, and SamuelR. Bowman.GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023.
  • [75]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondé deOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, FelipePetroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, AndrewN. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.Evaluating large language models trained on code.CoRR, abs/2107.03374, 2021.
  • [76]Jacob Austin, Augustus Odena, MaxwellI. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, CarrieJ. Cai, Michael Terry, QuocV. Le, and Charles Sutton.Program synthesis with large language models.CoRR, abs/2108.07732, 2021.
  • [77]Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.Webshop: Towards scalable real-world web interaction with grounded language agents.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • [78]Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, QiZhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huan, and Tao Gui.The rise and potential of large language model based agents: A survey.CoRR, abs/2309.07864, 2023.
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Zonia Mosciski DO

Last Updated:

Views: 6407

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Zonia Mosciski DO

Birthday: 1996-05-16

Address: Suite 228 919 Deana Ford, Lake Meridithberg, NE 60017-4257

Phone: +2613987384138

Job: Chief Retail Officer

Hobby: Tai chi, Dowsing, Poi, Letterboxing, Watching movies, Video gaming, Singing

Introduction: My name is Zonia Mosciski DO, I am a enchanting, joyous, lovely, successful, hilarious, tender, outstanding person who loves writing and wants to share my knowledge and understanding with you.