Zhiheng Xi1 , Dingwen Yang1∗, Jixuan Huang1, Jiafu Tang1, Guanyu Li1, Yiwen Ding1, 1Fudan University 2Meituan
Wei He1, Boyang Hong1, Shihan Dou1, Wenyu Zhan1, Xiao Wang1, Rui Zheng1, Tao Ji1,
Xiaowei Shi2, Yitao Zhai2, Rongxiang Weng2, Jingang Wang2, Xunliang Cai2,
Tao Gui1†, Zuxuan Wu1, Qi Zhang1, Xipeng Qiu1, Xuanjing Huang1, Yu-Gang Jiang1
Abstract
Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model’s capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and training-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor’s performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor’s self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor’s exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase their potential. Our code and datasets are at https://mathcritique.github.io/.
1 Introduction
With the rapid advancement of large language models (LLMs) [1, 2, 3, 4, 5], significant progress has been made in enhancing their reasoning capabilities [6, 7, 8, 9, 10, 11]. By prompting or training language models to reason step-by-step like humans (i.e., chain-of-thought, CoT), these models have demonstrated impressive reasoning abilities [6, 9, 12]. Recently, OpenAI’s o1 model has introduced a new paradigm shift, exploring to increase inference-time computation in language models and explicitly generate longer chains of thought [13]. This enables them to tackle more complex reasoning tasks that even humans find challenging, such as problems in the domains of science, coding, and mathematics [14, 15, 16, 17].
At the same time, many studies have explored test-time scaling by employing mechanisms like self-reflection, self-correction, and self-critique to generate longer thinking chains [18, 14, 19, 12, 20, 21], similar to OpenAI’s o1. However, the effectiveness of these mechanisms depends on the models’ ability to accurately evaluate their own performance. This ability can be limited by factors such as initial accuracy, problem complexity, and the lack of external feedback [17, 22, 23, 18]. As a result, their performance remains constrained, even with increased inference-time computation [24].
In light of this, to reliably increase reasoning models’ performance with increased inference-time computation, we delve into a two-player paradigm, where the actor model engages in reasoning while the critique model provides supervisory feedback on the thought chains [18, 25, 26, 27]. This approach represents a scalable oversight technique aiming at providing reliable and effective supervision for the continued development of LLMs [22, 28, 29]. The goal is to help the actor model identify errors and refine its outputs, ultimately leading to higher-quality results. In this paper, We aim to explore the research question of how to develop effective and reliable critique models, and how to enhance the actor’s reasoning performance through collaboration with the critique model at test-time. Additionally, we explore incorporating supervision from critique models into the actor’s training process to build more capable reasoning models.
We first propose an automated and scalable framework called AutoMathCritique to collect diverse and high-quality step-level critique data without additional human supervision (Section 3). The framework consists of three main stages: flawed reasoning path construction, critique generation, and data filtering. In the first step, we leverage several approaches for controlled error synthesis, each of which targets different aspects of reasoning errors, such as their location or specific content. This controlled process ensures the diversity and comprehensiveness of the reasoning paths and provides informative and precise hints to guide the subsequent critique generation. In the second step, annotator models are provided with the original reasoning path, and possible hints about the mistakes to label step-level correctness and offer constructive feedback. In the second step, the reasoning model revises the response according to the critiques, and Monte Carlo sampling [30, 31] is used to eliminate low-quality or non-informative critique data, while preventing high-quality data from being accidentally discarded. A case of the resulting data is illustrated in Figure 2.
Next, using AutoMathCritique, we create a critique dataset containing samples named MathCritique-76k, which is subsequently used to fine-tune a language model to obtain the critique model. We demonstrate that the critique models can assist the actor model in improving exploration efficiency and reasoning quality during test time, leading to a significant enhancement in its reasoning performance (Section 4). Through in-depth analysis, we find that the critique models are particularly effective in helping the actor achieve better results on difficult queries. Additionally, by scaling inference-time computation [15, 32, 33], the performance gains brought by the critique models continue to grow.
Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method (Section 5). With the supervision of critique models and by scaling exploration computation for difficult queries, our method improves the actor’s exploration efficiency and solution diversity, alleviating the issue of tail narrowing [34] in reasoning models during iterative exploration and learning. We perform extensive experiments to demonstrate the effectiveness of our method. Additionally, we conduct further analysis of the critique models (Section 6), e.g., the scaling properties, and whether we should scale test-time computation in sequential or parallel.
Finally, we take a step further and conduct preliminary explorations on how to leverage critique data to construct step-level self-talk data (Section 7). We propose the self-talk-via-critique method, and train a single language model to reflect and self-correct at each step, demonstrating the potential of this approach.
In summary, our main contributions are:
- •
We introduce AutoMathCritique, an automated and scalable framework for collecting step-level critique data without additional human supervision, which we use to build the large-scale critique dataset MathCritique-76k.
- •
We fine-tune the critique model with MathCritique-76k to offer constructive feedback on reasoning paths. We demonstrate and analyze the performance gains of the trained critique models in enhancing the actor’s reasoning during test time, particularly when scaling test-time computation.
- •
Motivated by the insights from test-time analysis, we introduce the critique model to the actor’s self-training process, and propose the critique-in-the-loop self-improvement method to enhance exploration efficiency and solution diversity, ultimately training better reasoning models.
- •
We conduct extensive experiments to validate the effectiveness of our method and perform in-depth analysis of critique models, e.g., their scaling properties, and whether we should scale test-time computation in sequential or parallel.
- •
We propose the self-talk-via-critique method, and take the preliminary step to train models that can perform step-level reasoning, reflection and correction, and demonstrate their potential. We hope our work offers valuable insights for future research on LLM reasoning and scalable supervision.
2 Preliminaries
In the two-player setting studied in this paper, there are two roles: the actor model and the critique model. Also, there are three primary tasks [22]: reasoning, critique, and refinement.
In the reasoning task, the actor model parameterized by is given a reasoning problem and is expected to generate a response . This response includes both the answer to the problem and the reasoning trajectory. The accuracy of this response can be evaluated using a reward function .
Next, the critique model parameterized by performs the critique task, where, given the problem and response, it generates critical feedback . Notably, if the oracle reward function of the response is not given, the critique task consists of two subtasks: the discriminative task and the feedback generation task. The former determines whether the response contains flaws, while the latter generates constructive natural language feedback.
Finally, we define the refinement task, in which, given the problem, response, and critique, the actor generates a new response —this is also known as conditional refinement. Alternatively, we can define direct refinement , where the actor provides an improved answer based on an existing answer without conditioning on a critique, which is also referred to as “self-correction” [18].
This process can proceed in multiple rounds. We define that in the initial round (round ) only the actor operates, generating a response based on the problem. In round , the critique model first generates a new critique based on the interaction history, which is represented as:
Then, the actor generates a new refinement based on the previous interaction history, represented as:
3 AutoMathCritique: An Automated and Scalable Framework to Collect Step-level Critique Data
To train critique models capable of delivering step-level supervision and constructive feedback for reasoning, we introduce AutoMathCritique—an automated and scalable framework for collecting critique data (see Figure 3 for an overview of AutoMathCritique).This framework consists of three main stages: flawed reasoning path construction, critique generation, and data filtering. Using AutoMathCritique, we create a dataset containing samples named MathCritique-76k. The statistics are listed in Table 1.
We focus on the field of mathematical reasoning, so we utilize two of the most widely used datasets: GSM8K [35] and MATH [36]. The queries used for our subsequent data construction primarily come from their training sets, and we also leverage their original annotated responses to train the actor reasoning models. Our in-domain test set is composed of their test sets.
3.1 Construction of Flawed Reasoning Paths
To create high-quality critique data, we first need to construct a dataset of reasoning paths that includes some flaws. To better control the quality and diversity of the generated flawed reasoning paths, and to facilitate the subsequent construction of critique data, we leverage several distinct response generation (RG) approaches. These strategies encompass different aspects of the errors, such as their location or specific details. We mainly use Llama3-8B [5] as our actor model for sampling.
RG1: sampling from scratch.
In this approach, the actor is provided with a query and tasked with generating a response. Given that the actor we used has already achieved high accuracy on the GSM8K and MATH training sets, we use repeated sampling to obtain flawed responses. However, this method has the limitation of not offering detailed information about the location or content of the mistakes, which means that the subsequent critique labeling heavily depends on the expertise of annotators.
RG2: generating error-location-aware response.
In this approach, given a query, we first sample a correct response from the actor model. Then, starting from a specific step of the response, we modify the model’s hyperparameters for flawed response sampling, such as increasing the temperature of the final softmax function. This ensures that the steps preceding the selected step remain consistent with the original correct response, while the subsequent steps are more likely to contain errors. If the sampled response remains correct, we select a different step and further increase the randomness of the generation process. This method strikes a balance between generating flawed responses and maintaining the coherence of the reasoning process. The correct responses we sample are later used to construct critiques, while for the flawed responses, we collect information about the error locations (e.g., identifying from which step the errors originate), thereby facilitating the annotation of high-quality critiques.
RG3: adding detailed mistakes.
In this approach, given a query, the actor model is instructed to sample a correct reasoning path first. We then instruct the model to introduce mistakes into the correct response. Inspired by previous work [37, 38], we enumerate various common reasoning errors in the instructions and include few-shot examples in the prompt. Each example consists of five components: the query, the correct reference response, the step where the error is introduced, the type of error, and the generated flawed response. After the error is inserted, we direct the model to continue reasoning from the erroneous step until it reaches a final answer. If a flawed response is not generated, we repeat the sampling process up to a maximum of 16 attempts. As in RG2, the correct answers obtained during this process can also be used to construct critiques. This approach allows us to easily capture information about the location of the first mistake and its specific details, thereby significantly reducing the complexity of subsequent critique construction.
Dataset Query Golden reasoning path Critique GSM8K MATH Total
3.2 Generation of Critiques
Step-level critique generation.
When generating critique data, we enhance quality by checking each step to identify the first error in the solution, which in turn facilitates the refinement process.Specifically, given a query and response, we employ two methods to generate step-level critique data: (1) We instruct the critique annotator (in our work, GPT-4o [2]) to directly identify the location of the first error and provide corresponding feedback. This method requires the annotator to assess the entire solution holistically, making it relatively more challenging. (2) We instruct the annotator model to later step by step, stopping the process once the first error is detected, at which point they provide the corresponding feedback. This strategy effectively decomposes the entire solution, reducing the difficulty of providing comments.
Critique generation based on varying information about errors.
When constructing responses, we employ different strategies that provide various types of information, helping annotators identify and analyze flaws. Such information plays a crucial role in generating critiques.
For responses that are correct, we do not provide any additional information but instead ask the annotator to critique step by step. Only when the critique annotator correctly labels every step will this critique data be collected. If the annotator makes an error in labeling, it indicates either the response is a false positive (i.e., the answer is correct but the reasoning process is flawed) or the annotator’s labeling is incorrect. In either case, the data is discarded.
For flawed responses, we design critique prompts based on the generation strategy used (RG1, RG2, RG3). For responses generated by RG1, we provide a correct reference response to directly assist the annotator in labeling. For flawed responses from RG2, we offer both the reference response and highlight the likely starting point of the error, helping the annotator identify the first critical mistake. For RG3-generated flawed responses, we not only specify the exact location of the error but also provide detailed information about the mistake, enabling a more precise critique.
3.3 Data Filtering
Although we have constructed a large amount of critique data paired with flawed responses, the quality of this data is not guaranteed, and low-quality data could weaken the performance of the critique model.To address this, we apply a filtering process. Specifically, we use Monte Carlo sampling: each (query, response, critique) tuple is fed into the actor model for refinement. The refinement process is repeated times, and only when the accuracy exceeds a predefined threshold is the critique data retained. This process is referred to as soft filtering. In contrast, hard filtering is employed when the critique is considered valid if at least one of the k refinements produces a correct result. In practice, we adopt soft filtering because it prevents the omission of high-quality critique data due to occasional model errors. Furthermore, it minimizes the risk of including low-quality critiques that the actor model does not follow, but instead refine based on its own knowledge, resulting in a correct response.Note that our method does not completely eliminate low-quality data, but we strive to achieve a balance between quality and quantity. Additionally, we randomly sampled data points times and had crowdsourced annotators perform the checking. We find that the rate of low-quality data is %.
4 Critique Models Improves LLM Reasoning through Test-time Supervision
In this section, we begin by training critique models to provide step-level supervisory signals and useful feedback on reasoning paths, along with the actor reasoning models that own reasoning and refinement ability (Section 4.1). We then explore the role of critique models in supporting the actor reasoning model at test-time (Section 4.2), showing that they significantly enhance the actor’s performance in tackling difficult problems. Furthermore, as we scale up inference-time computations, we observe that the critique model continues to raise the performance ceiling of the reasoning models.
4.1 Fine-tuning Critique Models and Actor Reasoning Models
Training critique models with MathCritique-76k.
We train the critique models through supervised fine-tuning with the collected MathCritique-76k. Specifically, we use the standard language modeling loss. Given a dataset , the loss for the critique model is as follows:
(1) |
In this way, we can obtain a critique model that provides step-level supervision and constructive feedback on reasoning paths for actor models.
Training actor models with basic reasoning and refinement ability.
We then train reasoning models in our two-player setting. The models are trained using the training sets of GSM8K and MATH, containing and samples, respectively. We denote the mixed response training set as . Additionally, to equip the models with the ability to perform refinement tasks according to the critique feedback, we utilize GPT-4 to annotate k refinement samples ( half of which are from MATH and the other half from GSM8K), denoted as , where represents the refined reasoning path generated based on the critique . Each refinement sample is verified to ensure the correctness of its final answer. The loss of training actor reasoning model is as follows:
(2) |
where is a hyper-parameter that balances the learning of reasoning and refining.
4.2 Critique-based Supervision Improves Test-time Reasoning Performance
In this section, we investigate the impact of trained critique models in supporting the reasoning model at test-time (illustrated on the left of Figure 4). Specifically, we examine their effectiveness in enhancing the actor’s reasoning performance, identify the types of problems where performance improvements are observed, and assess whether scaling up test-time computation further elevates the actor’s performance ceiling.
4.2.1 Experimental Setups
Backbone models.
In our main experiments, we fine-tune the actor models using Llama3-8B-Base, following previous work [16, 39, 17]. This model demonstrates non-trivial performance on mathematical reasoning tasks while leaving room for improvement, making it an ideal testbed for our study.We fine-tune the critique models using the fine-tuned models Llama3-8B and Llama3-70B, which have the instruction-following ability to serve as our critique backbone. Note that most of our experiments are performed with the 8B model.
Evaluation metrics.
In mathematical reasoning tasks, we primarily evaluate the accuracy, which measures whether a solution matches the ground truth with an oracle reward function. When critique models are not employed, we directly evaluate the accuracy of the actor’s responses. In contrast, when critique models are used, we evaluate the accuracy of the actor’s responses after refinement based on feedback provided by the critique model.
Additionally, to comprehensively assess a critique model, we evaluate its discriminability, i.e., the ability to determine whether a solution contains errors [22]. We also evaluate its helpfulness, which means whether it can provide constructive feedback that enables the actor to correct erroneous responses.
Implementation details.
The experiments are conducted on NVIDIA A100 GPUs and Ascend 910 processors. When fine-tuning the critique models and actor reasoning models, we set the learning rate to . During decoding, we set the model’s temperature to , which means the decoding process is done greedily. When we scale up inference-time computation, we set the temperature to .We evaluate the accuracy of the actor models, and the discriminability and helpfulness of the critique models.
4.2.2 Empirical Results and Findings
Critique Model GSM8K MATH Acc. Discrimin. Helpfulness Acc. Discrimin. Helpfulness No Critic - - - - GPT-3.5-Turbo GPT-4-Turbo GPT-4o Critique Model-8B Critique Model-70B
Critique models are highly effective at identifying the correctness of reasoning, offering constructive feedback for erroneous responses, and improving the overall accuracy of the actor.
We compare our critique models with SOTA models used as critics, and the results are presented in Table 2. We observe that compared to current state-of-the-art (SOTA) models, our 8B critique model significantly outperforms GPT-3.5, while our Llama3-Critic-70B model achieves performance comparable to GPT-4 series models.
Specifically, the reasoning path judgment accuracy of our 8B critique model reaches on GSM8K and on MATH, exceeding GPT-3.5-Turbo by and percentage points, respectively. Additionally, in terms of helpfulness, it outperforms GPT-3.5-Turbo by and on GSM8K and MATH, respectively.Moreover, our 70B critique model demonstrates even stronger performance. As to discriminability, it surpasses GPT-4-Turbo and GPT-4o on the GSM8K dataset and achieves results close to these SOTA models on MATH. Its correction accuracy on both datasets approaches that of GPT-4 series models, ultimately leading to comparable actor accuracy under its guidance.
Critique models assist the actor in better handling challenging queries.
Next, we investigate the distribution of performance gains brought by the critique model across different difficulty levels.The process involves generating responses from the actor model for each query and categorizing the queries into 5 difficulty levels based on the number of correct responses associated with each query [36].The results are illustrated in Figure 5.It is evident that on both the training and test sets of GSM8K and MATH, critique models provide minimal benefit for simpler queries, as the actor model can independently perform well in these cases. However, for more challenging problems, critique models offer significant support, resulting in overall improved performance. Furthermore, this phenomenon is even more pronounced in the training set, offering valuable insights for incorporating critique model supervision during training (Section 5) [40].
Scaling up inference-time computation consistently improves reasoning performance.
Recent studies have highlighted that scaling up inference-time computation can significantly enhance model performance [15, 32, 33]. Here, we investigate whether incorporating critique models can further elevate the reasoning performance ceiling as test-time computation scales.A widely used technique employed in test-time computation scaling is majority voting [41], denoted as Maj@K, which measures whether the most frequent answer among parallel samples is correct.This metric reflects the model’s consistency in generating high-quality responses across multiple samples, which is a critical aspect of interactive exploration and learning paradigms such as reinforcement learning and self-improvement.
As shown in Figure 1, without critique models, Maj@K performance improves with increased computation but quickly plateaus, even at higher levels of computation (e.g., Maj@2K, Maj@3K). In contrast, when critique models are utilized during test-time, performance surpasses the baseline by a significant margin under the same computation budget—showing a improvement on GSM8K and a improvement on MATH. These findings indicate that critique models effectively improve the exploration efficiency and quality of critique models, extending the performance ceiling when allocated more inference-time computation.
5 Critique-in-the-loop Self-Improvement for Better Reasoning Models
Motivated by the test-time findings in Section 4.2 that critique models significantly aid in solving challenging problems, and that they substantially raise the reasoning performance ceiling when scaling up computation, we integrate the critique-based supervision into the actor model’s iterative exploration and learning process. We present a critique-in-the-loop self-improvement method, which scales up exploration computation on challenging queries and leads to the development of stronger reasoning models (illustrated in Figure 4).
5.1 Vanilla Self-Improvement Method
Self-improvement is an exploration and learning method [42, 43, 16, 43, 44, 45]. It iteratively leverages the actor reasoning model’s correct responses to gradually enhance its problem-solving abilities. The process involves iterations, where each iteration consists of two steps: exploration and learning.
In the exploration step of iteration , we sample responses for each query from the previous model , i.e., . Each data point is then filtered using the reward function , where only correct solutions are retained to form a new dataset .
In the learning step of iteration , the new dataset from the exploration step is used to fine-tune the actor reasoning model .To mitigate overfitting, we follow recent work [46] and always fine-tune the original model instead of the model from the previous step, .The training loss is as in Equation 2 and we also include the original reasoning set and refinement set [8]. After the improve step is performed, a new dataset of better quality samples can be created once again 47.
Limitations of vanilla self-improvement.
In self-improvement, the key challenge lies in identifying correct responses with high diversity for each query during the exploration step [42, 44].However, previous studies have highlighted the problem known as the tail narrowing [34]. Specifically, models tend to over-sample solutions for simpler queries while under-sampling solutions for harder queries.This results in a training set for the next iteration that contains a large number of solutions for simple problems but lacks solutions for more challenging problems, introducing sampling bias. As iterations progress, this bias deepens, leading to a long-tail distribution where solutions for harder queries are almost entirely absent. This ultimately causes the model to reach a performance plateau or even degrade [34].
5.2 Critique-in-the-loop Self-improvement
Introduce critique models for high-coverage exploration.
Motivated by our prior findings in Section 4.2 that critique models enable actors to achieve greater performance gains on harder queries, we introduce critique models to the self-improvement process and propose a critique-in-the-loop self-improvement approach.
This method is built upon self-improvement, and during the exploration step of iteration , critique models are instructed to provide feedback on the responses of actor , and the actor is then prompted to perform refinements accordingly. After that, correct refinements are then added to the training set. Since we can assume the availability of an oracle reward function for the training set, critiques are only applied to incorrect responses, while correct responses are directly included in the dataset. This design minimizes the risk of low-quality critiques negatively affecting originally correct responses. In this way, we increase the coverage of solutions for harder queries, and significantly reduce the tail-narrowing problem [34].
Difficulty-aware computation allocation for exploration.
Furthermore, building on our previous findings in Section 4.2 that scaling inference-time computation can improve the efficiency and quality of exploration, we allocate more computation for exploration and critique to harder problems. This involves performing additional response generation, critique, and refinement to obtain high-quality and diverse solutions.
In practice, we employ a simple difficulty-based computation allocation strategy, as it has proven sufficiently effective.111Note that more complex strategies, such as distinguishing difficulty based on the accuracy observed after multiple samples, are expected to yield even better results.For incorrect initial responses, we classify them as difficult and allocate times of critique and refinement. For correct initial responses, they are considered simple and are directly added to the training set without further critique or refinement. This approach further mitigates the long-tail issue of self-improvement, enhances sampling quality, and improves the overall performance [34].
We summarize the critique-in-the-loop self-improvement method in Algorithm 1.
5.3 Experimental Results and Findings
Implementation details.
We set the self-improvement process to run for iterations, as in previous works [48], the model’s performance tends to saturate after iterations of exploration and learning. During the exploration stage, we set the temperature to and the number of samples to or . During the learning stage, we set the learning rate to and the number of epochs to .
Critique-in-the-loop self-improvement consistently improves reasoning performance.
The evaluating results of our method are shown in Figure 6. We can observe that: (1) Increasing the number of samples during exploration improves performance, with the performance upper bound rising accordingly, underscoring the benefits of enhanced exploration computation. (2) Our method consistently outperforms vanilla self-improvement with stable and significant performance gains, especially when the sample number is larger. For example, when , our method achieves a performance advantage of on both GSM8K and MATH. (3) While the vanilla method initially shows performance improvements during self-improvement, it quickly reaches a bottleneck or even starts to decline, which may be attributed to the tail narrowing issue [34]. In contrast, our method demonstrates consistent improvement, with performance saturation occurring much later, indicating the effectiveness of our method.
Critique-in-the-loop self-improvement balances the solution distribution across difficulty levels, and enhances performance on challenging queries in the test set.
Since our motivation for introducing critique-based supervision into training is to improve the efficiency and quality of exploration, we examine the distribution of solutions sampled by our method compared to vanilla self-improvement.As shown in Figure 7, we find that our approach samples a higher proportion of solutions for challenging queries during the exploration stage.This significantly balances the training data distribution for the learning stage, effectively mitigating the tail-narrowing issue. In Figure 8, we also present the model’s performance on the test set across different difficulty levels, and we observe that our method performs significantly better than the vanilla approach on harder problems, further demonstrating the potential of our approach.
Combining test-time supervision with training-time supervisions yields more performance gains.
Training-time Test-time GSM8K MATH Acc. Pass@5 MV@5 Acc Pass@5 MV@5 Supervised Fine-tuning response only w/ critique model Self-Correction Fine-tuning response only self-correction Vanilla Self-Improve response only w/ critique model Critique-in-the-loop Self-Improve response only w/ critique model
Previously, we have evaluated the impact of incorporating critique model supervision at training and test-time, separately. Here, we combine them and evaluate the performance. Additionally, we include a self-correction baseline where the model refines its reasoning by itself. The training data for this baseline consists of original reasoning datasets (GSM8K and MATH) and correction data derived from our refinement data by removing critique elements, and reformatted into (query, original response, new response) triplets.
Evaluation results shown in Table 3 reveal that: (1)Integrating critique models during test-time consistently enhances performance under identical training conditions, particularly when critique supervision is not used during training. For example, applying critique models at test-time increases the MV@5 performance of SFT on GSM8K and MATH by and points, respectively. (2) When critique models are used during training, the additional benefit of test-time critique supervision becomes marginal, suggesting successful “distillation” of critique models into the actor during training. (3) The self-correction baseline underperforms compared to utilizing separate critique models, aligning findings in prior work that models struggle to accurately evaluate and refine their outputs without external feedback [24, 19]. Moreover, training a single model to handle both reasoning and correction capabilities may introduce conflicts, leading to performance degradation [24].(4) Compared to the traditional strategy of vanilla self-improvement + response-only, which increases computation during training, the approach of supervised fine-tuning + test-time critique supervision reduces training computation while increasing test-time computation and achieves better performance, particularly on the more challenging MATH dataset. This aligns with prior work highlighting the benefits of enhancing test-time computation [15, 32, 33].
Ablation study.
Method GSM8K MATH N=3 Ours w/o Difficulty Aware More Refinement N=5 Ours w/o Difficulty Aware More Refinement N=10 Ours w/o Difficulty Aware More Refinement
We evaluate the impact of using the strategy of difficulty-aware computation allocation for exploration, as well as the performance differences when allocating more computation to critique generation v.s. refinement generation. The experimental results are presented in Table 4, and we observe that:(1) If the difficulty-aware allocation of exploration computation is removed, performance drops significantly. (2) If more allocation is used to generate multiple refinements for the same critique instead of generating diverse critiques, performance also decreases. Therefore, both difficulty-aware computation allocation and diversified critiques are crucial components for achieving the final performance.
6 Discussion and Analysis
6.1 Scaling Properties of Critique Models
As in previous work [35], we study the scaling properties of critique models, trying to investigate whether they can supervise models of different scales, particularly those larger and stronger than themselves. In this study, we conduct experiments using the Qwen-2.5 series of models [49], which span a wide range of scales (1.5B, 3B, 7B, and 14B). We train a critique model of 3B scale and use it to supervise trained actor reasoning models of all sizes. Other experimental settings are consistent with Section 4.2.
The evaluating results are shown in Figure 9. In the figure, “oracle” indicates whether we have an oracle reward function to assist the critique model in making judgments. With an oracle reward function, only incorrect responses are passed to the critique model; otherwise, all responses are passed to the critique model. From the results, we can observe that: (1) Regardless of scale, the 3B critique model can provide effective supervision, indicating that smaller critique models can help supervise larger actors to a certain extent. (2) With the oracle reward function, the critique model does not need to perform discriminative tasks and only needs to provide useful feedback, resulting in greater performance improvements. (3) As the model scale increases, the performance improvement provided by the critique model on simpler datasets like GSM8K becomes marginal. However, on the more challenging MATH, the critique model continues to deliver significant performance gains even for the largest model.
6.2 How do Critique Models improve Majority Voting?
Majority voting is one of the most commonly used techniques for scaling test-time computation. Following previous work [32], we study the relationship between the correct frequency of multiple samples and the performance of majority voting, while also examining the impact of critique models. Specifically, consistent with the settings in 4.2, we use an actor reasoning model and a critique model trained with supervised fine-tuning. For each query, we sample responses in parallel. The experimental results are shown in Figure 10.
We observe that critique models improve both the overall correct frequency and the performance of majority voting. Delving deeper, we can find a significant failure mode when critique models are not used, where the correct answer appears with a relatively high frequency (e.g., ), but a specific incorrect answer dominates in frequency (e.g., ), causing majority voting to select the incorrect result. By incorporating critique models, this failure mode is effectively addressed through critique and refinement. Specifically, the discriminative ability of critique models helps suppress the occurrence of high-frequency incorrect answers, while the feedback provided by these models increases the relative frequency of correct answers. Together, these mechanisms contribute to a substantial improvement in the performance of majority voting.
6.3 Should test-time computation be scaled sequentially or in parallel?
In the two-player paradigm, test-time computation can be scaled either in parallel by sampling multiple (response, critique, refinement) triplets [22, 50, 15], or sequentially by generating critiques and refinements iteratively after an initial response [14, 16, 15]. Here, we explore the performance of these two approaches. Notably, in our implementation of the sequential approach, to avoid potential issues of context window length limits, we use the following strategy: given a query , the actor first generates a response . For the -th critique task (), only the query and the -th response or refinement are provided. Similarly, for the -th refinement task (), only the query, the -th response or refinement, and the -th critique are provided.
Figure 11 illustrates the Pass@K performance trends. Pass@K is a metric that measures whether at least one correct answer exists among samples. As computation increases, Pass@K performance improves, though the gains become marginal progressively. The performance of sequential computation scaling is slightly worse than that of parallel computation scaling. This may stem from the fact that in the parallel approach, the actor has more opportunities to generate original solutions directly from the queries, leading to greater sampling diversity and a higher chance of obtaining at least one correct answer. In contrast, the sequential approach allows the actor only one opportunity for original reasoning, with subsequent revisions relying on critical feedback, which may reduce diversity and limit the chances of finding correct solutions.
In Figure 12, we present the performance trends of three test-time techniques as computation increases: parallel majority voting, sequential majority voting, and sequential final. Here, sequential majority voting refers to selecting the most frequent answer from all the linearly generated responses and refinements, while sequential final selects the final answer after the -th refinement. From the experimental results, we observe that: (1) Overall, compared to majority voting, selecting the final answer of a sequence of revisions performs weaker. This may be because the sequential critique and refinement process occasionally modifies a previously correct answer, resulting in an incorrect final result. (2) For smaller , parallel majority voting outperforms sequential majority voting. However, as computation scales up, sequential majority voting surpasses its parallel counterpart, especially on the more challenging MATH dataset. This indicates a trade-off between the parallel and sequential approaches [15], providing inspiration for our critique-in-the-loop self-improvement. Specifically, depending on the computation budget, the exploration strategy can be adapted to balance solution quality and diversity, ultimately leading to stronger actor reasoning models. We leave this study to future work.
7 A Step Further: Training Step-level Self-Talk Reasoning Models via Critique Data
Motivation and method.
In this work, we focus on the two-player paradigm, leveraging critique models to provide step-level supervision and feedback for actor models. Recently, OpenAI’s o1 model [13] has pushed the boundaries of large reasoning models’ capabilities. With its self-talk output format, it can autonomously plan, reflect, critique, correct, backtrack, and more during the thinking process, marked by phrases such as “wait” and “alternatively”. Therefore, we investigate whether it is possible to construct self-talk data with step-level critique supervision, and propose the preliminary self-talk-via-critique method. Specifically, it has three main steps:
- 1.
Construct an initial thinking chain that has step-level reflection. Given a query and a reasoning path, we first use AutoMathCritique to generate critique data. Feedback on each reasoning step provided in the critique is then inserted into the reasoning path, constructing a thinking chain that includes step-level reflections. At this stage, the thought process may lack smoothness but achieve an initial structure.
- 2.
Iterative refine and critique the thinking chain. For reasoning paths without errors, they are directly passed to the next stage. For paths containing errors, actors perform refinement from the first identified erroneous step, continuing the reasoning from that point onward. Starting from the first refined step, we utilize the critique model to re-critique the partial reasoning process—spanning from the refined step to the final step. Step-level feedback from this critique is again integrated into the thought process. If the critique model identifies new errors in the refined reasoning steps, the process is repeated iteratively. The reasoning path is continuously optimized until all errors are resolved by reflection and refinement. Only then is the thinking chain passed to the next stage.
- 3.
Smooth the thinking chain into self-talk data.Next, we prompt the LLMs to smooth out the previously rigid thinking chain. This involves adding transitional phrases and connectors to make the reasoning and reflection flow more naturally. Finally, we verify the correctness of the final answer, ensuring that only accurate data is stored.
An overview of the method is shown in Figure 13. An illustrating example of the resulting self-talk data can be found in Figure 14.
Evaluation and findings.
Based on this approach, we construct a dataset of k self-talk examples from the MATH training set and fine-tune the model for evaluation. As in the previous section, we used the Llama3-8B base model as the backbone for our experiments. We compared our method with the self-correction baseline and the baseline of SFT with test-time critique supervision. These two baselines fall under the one-player and two-player settings, respectively. Note that for the two baselines, we only used the MATH dataset for training, without using GSM8K data. The experimental results are shown in Table 5.We observe that, in the one-player setting, the step-level self-talk approach outperforms trajectory-level self-correction by a significant margin, demonstrating its potential. However, it still lags behind the two-player setting, indicating that this direction requires further exploration, which we leave to future work.
Method Acc Pass@5 MV@5 Pass@10 MV@10 One-player Setting Trajectory-level Self-Correction Step-level Self-talk Two-player Setting SFT w/ test-time critic
8 Related Work
Training LLMs for reasoning through exploration and learning.
Multi-step reasoning, such as mathematical reasoning and logical reasoning, is a challenging task for large language models (LLMs). Researchers have proposed prompting methods represented by Chain-of-Thought (CoT) to enable LLMs to think and reason step by step like humans, and then produce answers based on the reasoning process, significantly improving the model’s reasoning performance [6, 41].To enhance the reasoning ability of models, previous work has focused on collecting large amounts of expert-labeled reasoning trajectories, allowing models to mimic step-by-step reasoning [51]. However, these methods are often difficult to scale up, as annotation is highly expensive, especially for very challenging and complex problems [52].
Another category of methods, exploration and learning, seeks to address this issue by using model-generated data to train the model itself. Specifically, given a query, the model generates its own reasoning paths, and external supervision signals are used to filter out high-quality solutions, which are then used to train the model [46, 42, 3, 53, 54]. This approach, also known as self-improvement or rejection sampling, often encounters the tail-narrowing problem, which can lead to performance bottlenecks [55, 56, 34]. Some researchers have proposed reinforcement learning-based approaches, where reward models are trained or oracle reward functions are used to provide supervision signals, enabling the model to explore and learn, thereby significantly improving reasoning performance [57, 10, 58, 11, 30]. However, reinforcement learning typically converges slowly, is costly, and poses challenges in providing reliable and dense process-level supervision signals.
In this work, we fine-tune critique models to provide reliable step-level supervision signals and helpful feedback during both training and test time. This approach improves sampling efficiency and quality, ultimately enhancing the actor’s reasoning performance.
Developing models for critique, reflection and correction.
Developing models with the ability to critique, reflect, and correct is an important way for scalable supervision and has been explored in various domains, such as summarization, mathematical reasoning, sequential decision-making, and coding [23, 59, 18, 7, 60].Most previous work has used prompting techniques to have models generate critical comments or corrections about their own outputs [28, 23, 61].However, these methods typically perform poorly without assuming an oracle reward function. This is because they struggle to assess their outputs correctly in the absence of external feedback, especially when the problems are more challenging [24, 19]. As a result, many fine-tuning or RL approaches are proposed to trained models to develop the capabilities [62, 63, 64, 27, 65, 66].The former often requires extensive human annotation, while the latter necessitates engineered and cumbersome reward designs.Another line of work leverages self-training ways to develop self-correction capabilities, e.g., [18, 67, 17, 16].In contrast to these methods, in this paper, we delve into a two-player framework (e.g., [25, 26, 68]), distinguishing the roles of the critique model and the actor model. We propose a scalable data synthesis framework, AutoMathCritique, to generate critique data and train critique models. The trained critique models can provide supervision to yield stable performance gains for the actor model, both at test time and training time.
Scaling test-time computation for LLM Reasoning.
Recent studies have shown that scaling up computation during test-time/inference-time can effectively improve a model’s reasoning performance, as exemplified by OpenAI’s o1 [13, 32, 15, 33]. These studies typically increase inference computation by extending the model’s thinking chains or employing other techniques such as majority voting [41], Best-of-N with reward models [35, 65, 69], and tree search [70, 71, 39]. Some other works train correction models to allocate more computation toward sequential corrections during test-time, thereby enhancing the model’s final performance [20, 60, 72].In this paper, we take a different perspective by training additional critique models and delegating the responsibility for refinement to the actor itself. We investigate the significant performance gains achieved by leveraging critique models’ supervision during test-time, especially for challenging problems. Motivated by these key findings during test-time, we incorporate this supervision into the exploration phase of the self-improvement process, leading to the training of stronger models.
9 Conclusion and Future Work
In this work, we take the preliminary step to explore how to construct high-quality and diverse training data to train critique models capable of providing step-level supervision and effective feedback without additional human annotations. By introducing critique-based supervision at test time, we demonstrate that critique models can significantly enhance the performance of reasoning models, particularly on challenging tasks. When inference computation is scaled up, critique models also yield continuous improvements and enhance the capability ceiling of reasoning models. Building on the findings from test-time, we integrate critique model supervision into the self-training process of reasoning models, proposing critique-based self-improvement and validating its effectiveness through extensive experiments. Lastly, we propose constructing step-level self-talk data based on critique data and showcasing its potential.
Despite this, there are still limitations and future directions to work on. Specifically, (1) our method of building critique models primarily involves data construction and fine-tuning. Future work should include optimizing critique models from more perspectives, such as employing more advanced algorithms. Additionally, due to resource constraints, we are unable to conduct extensive experiments with larger-scale critique models, which may deliver superior performance and provide more reliable supervision signals. (2) We need to develop more advanced test-time scaling techniques to improve efficiency and reliability, reducing hallucinations in long thinking chains. (3) In Section 7, we show the potential of self-talk models, and in the future, we expect to further optimize them to enhance their performance. (4) Our work can be extended or applied to other application domains, enhancing model capabilities through reliable supervision and helpful feedback, including scientific research [73, 74], software engineering [75, 76], and agentic tasks [77, 78, 47]. However, as we broaden the scope of applications, substantial efforts will be required to ensure safety and robustness.
Acknowledgements
The authors would like to thank Huawei Ascend Cloud Ecological Development Project for the support of Ascend 910 processors.
References
- [1]Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, PaulF. Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- [2]OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.
- [3]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288, 2023.
- [4]AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego deLasCasas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mistral 7b.CoRR, abs/2310.06825, 2023.
- [5]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, HugoTouvron, Iliyan Zarov, ImanolArrieta Ibarra, IsabelM. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, and etal.The llama 3 herd of models.CoRR, abs/2407.21783, 2024.
- [6]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdH. Chi, QuocV. Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large language models.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- [7]Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Jia Liu, Tao Gui, QiZhang, and Xuanjing Huang.Self-polish: Enhance reasoning in large language models via problem refinement.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 11383–11406. Association for Computational Linguistics, 2023.
- [8]Çaglar Gülçehre, TomLe Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando deFreitas.Reinforced self-training (rest) for language modeling.CoRR, abs/2308.08998, 2023.
- [9]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan.Tree of thoughts: Deliberate problem solving with large language models.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- [10]Haipeng Luo, Qingfeng Sun, Can Xu, PuZhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang.Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.CoRR, abs/2308.09583, 2023.
- [11]Zhiheng Xi, Wenxiang Chen, Boyang Hong, Senjie Jin, Rui Zheng, Wei He, Yiwen Ding, Shichun Liu, Xin Guo, Junzhe Wang, Honglin Guo, Wei Shen, Xiaoran Fan, Yuhao Zhou, Shihan Dou, Xiao Wang, Xinbo Zhang, Peng Sun, Tao Gui, QiZhang, and Xuanjing Huang.Training large language models for reasoning through reverse curriculum reinforcement learning.CoRR, abs/2402.05808, 2024.
- [12]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, KarthikR. Narasimhan, and Yuan Cao.React: Synergizing reasoning and acting in language models.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- [13]OpenAI.Learning to reason with llms, 9 2024.
- [14]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: language agents with verbal reinforcement learning.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- [15]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar.Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024.
- [16]Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar.Recursive introspection: Teaching language model agents how to self-improve.CoRR, abs/2407.18219, 2024.
- [17]Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, YiSu, JohnD. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, LeiM. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal M.P. Behbahani, and Aleksandra Faust.Training language models to self-correct via reinforcement learning.CoRR, abs/2409.12917, 2024.
- [18]Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi.Generating sequences by learning to self-correct.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- [19]Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Wang.Pride and prejudice: LLM amplifies self-bias in self-refinement.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15474–15492. Association for Computational Linguistics, 2024.
- [20]Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, and Minjoon Seo.Selfee: Iterative self-revising llm empowered by self-feedback generation.Blog post, May 2023.
- [21]Geunwoo Kim, Pierre Baldi, and Stephen McAleer.Language models can solve computer tasks.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- [22]William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike.Self-critiquing models for assisting human evaluators.CoRR, abs/2206.05802, 2022.
- [23]Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, BodhisattwaPrasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark.Self-refine: Iterative refinement with self-feedback.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- [24]Jie Huang, Xinyun Chen, Swaroop Mishra, HuaixiuSteven Zheng, AdamsWei Yu, Xinying Song, and Denny Zhou.Large language models cannot self-correct reasoning yet.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- [25]AfraFeyza Akyürek, Ekin Akyürek, Ashwin Kalyan, Peter Clark, DerryTanti Wijaya, and Niket Tandon.RL4F: generating natural language feedback with reinforcement learning for repairing model outputs.In Anna Rogers, JordanL. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 7716–7733. Association for Computational Linguistics, 2023.
- [26]Weiran Yao, Shelby Heinecke, JuanCarlos Niebles, Zhiwei Liu, Yihao Feng, LeXue, RitheshR. N., Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese.Retroformer: Retrospective large language agents with policy gradient optimization.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- [27]Alexander Havrilla, SharathChandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, and Roberta Raileanu.Glore: When, where, and how to improve LLM reasoning via global and local refinements.In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
- [28]Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, etal.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022.
- [29]SamuelR. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, SheerEl Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, and Jared Kaplan.Measuring progress on scalable oversight for large language models.CoRR, abs/2211.03540, 2022.
- [30]Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, YuWu, and Zhifang Sui.Math-shepherd: Verify and reinforce llms step-by-step without human annotations.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 9426–9439. Association for Computational Linguistics, 2024.
- [31]Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar.Rewarding progress: Scaling automated process verifiers for LLM reasoning.CoRR, abs/2410.08146, 2024.
- [32]Bradley C.A. Brown, Jordan Juravsky, RyanSaul Ehrlich, Ronald Clark, QuocV. Le, Christopher Ré, and Azalia Mirhoseini.Large language monkeys: Scaling inference compute with repeated sampling.CoRR, abs/2407.21787, 2024.
- [33]Hritik Bansal, Arian Hosseini, Rishabh Agarwal, VinhQ. Tran, and Mehran Kazemi.Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling.CoRR, abs/2408.16737, 2024.
- [34]Yiwen Ding, Zhiheng Xi, Wei He, Zhuoyuan Li, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, QiZhang, and Xuanjing Huang.Mitigating tail narrowing in llm self-improvement via socratic-guided sampling.arXiv preprint arXiv:2411.00750, 2024.
- [35]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021.
- [36]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the MATH dataset.In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
- [37]Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamile Lukosiute, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, SamuelR. Bowman, and Ethan Perez.Measuring faithfulness in chain-of-thought reasoning.CoRR, abs/2307.13702, 2023.
- [38]Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng.Evaluating mathematical reasoning of large language models: A focus on error identification and correction.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11316–11360. Association for Computational Linguistics, 2024.
- [39]Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, and Jie Tang.Rest-mcts*: LLM self-training via process reward guided tree search.CoRR, abs/2406.03816, 2024.
- [40]Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He.Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving.CoRR, abs/2407.13690, 2024.
- [41]Xuezhi Wang, Jason Wei, Dale Schuurmans, QuocV. Le, EdH. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou.Self-consistency improves chain of thought reasoning in language models.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- [42]Jiaxin Huang, Shixiang Gu, LeHou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han.Large language models can self-improve.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1051–1068. Association for Computational Linguistics, 2023.
- [43]Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, FelixX. Yu, and Sanjiv Kumar.Rest meets react: Self-improvement for multi-step reasoning LLM agent.CoRR, abs/2312.10003, 2023.
- [44]YeTian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu.Toward self-improvement of llms via imagination, searching, and criticizing.arXiv preprint arXiv:2404.12253, 2024.
- [45]Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng.Re-rest: Reflection-reinforced self-training for language agents.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 15394–15411. Association for Computational Linguistics, 2024.
- [46]Eric Zelikman, Yuhuai Wu, Jesse Mu, and NoahD. Goodman.Star: Bootstrapping reasoning with reasoning.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- [47]Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, LuChen, Rui Zheng, Yicheng Zou, Tao Gui, QiZhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang.Agentgym: Evolving large language model-based agents across diverse environments.CoRR, abs/2406.04151, 2024.
- [48]Ting Wu, Xuefeng Li, and Pengfei Liu.Progress or regress? self-improvement reversal in post-training.CoRR, abs/2407.05013, 2024.
- [49]Qwen Team.Qwen2.5: A party of foundation models, September 2024.
- [50]Ning Miao, YeeWhye Teh, and Tom Rainforth.Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- [51]Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer.Rethinking the role of demonstrations: What makes in-context learning work?In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11048–11064. Association for Computational Linguistics, 2022.
- [52]Yisheng Song, Ting Wang, Puyu Cai, SubrotaK. Mondal, and JyotiPrakash Sahoo.A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities.ACM Comput. Surv., 55(13s):271:1–271:40, 2023.
- [53]Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, NoahA. Smith, Daniel Khashabi, and Hannaneh Hajishirzi.Self-instruct: Aligning language models with self-generated instructions.In Anna Rogers, JordanL. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics, 2023.
- [54]Avi Singh, JohnD. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, PeterJ. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, GamaleldinF. Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, MaxwellL. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel.Beyond human data: Scaling self-training for problem-solving with language models.CoRR, abs/2312.06585, 2023.
- [55]Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and RossJ. Anderson.The curse of recursion: Training on generated data makes models forget.CoRR, abs/2305.17493, 2023.
- [56]Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, AhmedImtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and RichardG. Baraniuk.Self-consuming generative models go MAD.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- [57]Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe.Let’s verify step by step.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- [58]LuChen, Rui Zheng, Binghai Wang, Senjie Jin, Caishuang Huang, Junjie Ye, Zhihao Zhang, Yuhao Zhou, Zhiheng Xi, Tao Gui, QiZhang, and Xuanjing Huang.Improving discriminative capability of reward models in RLHF using contrastive learning.In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 15270–15283. Association for Computational Linguistics, 2024.
- [59]Angelica Chen, Jérémy Scheurer, Tomasz Korbak, JonAnder Campos, JunShern Chan, SamuelR. Bowman, Kyunghyun Cho, and Ethan Perez.Improving code generation by training with natural language feedback.CoRR, abs/2303.16749, 2023.
- [60]Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings.REFINER: reasoning feedback on intermediate representations.In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, pages 1100–1126. Association for Computational Linguistics, 2024.
- [61]Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston.Chain-of-verification reduces hallucination in large language models.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 3563–3578. Association for Computational Linguistics, 2024.
- [62]Tianlu Wang, Ping Yu, XiaoqingEllen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz.Shepherd: A critic for language model generation.CoRR, abs/2308.04592, 2023.
- [63]Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, CeZheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang.LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback.CoRR, abs/2406.14024, 2024.
- [64]Aojun Zhou, KeWang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li.Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- [65]Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal.Generative verifiers: Reward modeling as next-token prediction.CoRR, abs/2408.15240, 2024.
- [66]Zachary Ankner, Mansheej Paul, Brandon Cui, JonathanD. Chang, and Prithviraj Ammanabrolu.Critique-out-loud reward models.CoRR, abs/2408.11791, 2024.
- [67]Xin Zheng, Jie Lou, Boxi Cao, Xueru Wen, Yuqiu Ji, Hongyu Lin, Yaojie Lu, Xianpei Han, Debing Zhang, and LeSun.Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, 2024.
- [68]Runlong Zhou, SimonS. Du, and Beibin Li.Reflect-rl: Two-player online RL fine-tuning for lms.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 995–1015. Association for Computational Linguistics, 2024.
- [69]Fei Yu, Anningzhe Gao, and Benyou Wang.Ovm, outcome-supervised value models for planning in mathematical reasoning.In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 858–875. Association for Computational Linguistics, 2024.
- [70]Shibo Hao, YiGu, Haodi Ma, JoshuaJiahua Hong, Zhen Wang, DaisyZhe Wang, and Zhiting Hu.Reasoning with language model is planning with world model.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 8154–8173. Association for Computational Linguistics, 2023.
- [71]Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan.Alphamath almost zero: process supervision without process.CoRR, abs/2405.03553, 2024.
- [72]DiZhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou.Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning.CoRR, abs/2410.02884, 2024.
- [73]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- [74]David Rein, BettyLi Hou, AsaCooper Stickland, Jackson Petty, RichardYuanzhe Pang, Julien Dirani, Julian Michael, and SamuelR. Bowman.GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023.
- [75]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondé deOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, FelipePetroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, AndrewN. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.Evaluating large language models trained on code.CoRR, abs/2107.03374, 2021.
- [76]Jacob Austin, Augustus Odena, MaxwellI. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, CarrieJ. Cai, Michael Terry, QuocV. Le, and Charles Sutton.Program synthesis with large language models.CoRR, abs/2108.07732, 2021.
- [77]Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.Webshop: Towards scalable real-world web interaction with grounded language agents.In Sanmi Koyejo, S.Mohamed, A.Agarwal, Danielle Belgrave, K.Cho, and A.Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- [78]Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, QiZhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huan, and Tao Gui.The rise and potential of large language model based agents: A survey.CoRR, abs/2309.07864, 2023.