Recursive Self-Improvement
In recursive self-improvement (RSI) AI keeps building more and more powerful versions of itself. Where does RSI already work or will soon work, and where is it still out of reach?
On June 8 2026 Anthropic called for a pause on AI development. A few days later they released their new Fable model. Was it just a hype, or could it be that internally they saw something intriguing: did the so-called recursive self-improvement (RSI), where AI keeps building more and more powerful versions of itself, work?
I wrote this post to make it more clear for myself in which tasks RSI already works or could reasonably be expected to work soon, and where it’s still out of reach.
Share your thoughts with me!
Recursive Self-Improvement
The definition I use is simple and close to how one would like a human to improve over time: the model gets a task, generates an output, obtains feedback from an environment, reflects on this feedback and learns from it (through weight updates, or other ways). For this to ‘take off’, meaning we can significantly improve model capacity, we’d want to automate the source of feedback (if we rely on feedback from human labellers, we’ll always be bottlenecked there) and have the ability to both generate outputs and learn from the feedback fast (to explore as much as possible).
Reminder on training AI models
Models need a lot of data and a lot of compute. The pre-training phase consists of maximising the log-likelihood between the tokens predicted by the model, and trillions of tokens of diverse data (from the web, synthetically generated, distilled from other models — whatever you can find that is of reasonable quality). After the model has some base capabilities, you’d want to inject more specialised, complex knowledge: coding challenges, mathematical reasoning, general reasoning, instruction following; this is sometimes called mid-training. And finally, if you’ve run out of all the ‘labelled’ data or labelled data becomes very expensive as you need to pay human experts to create it, you move into the post-training stage, where reinforcement learning (RL) is used to improve model capabilities through minimal feedback and / or through cheaply available environmental feedback. Most of my focus will be on this latter stage, and specifically how one could make this work with as little human intervention as possible; let’s explore the tasks on which this is possible.
RSI in the game of Go
Already in 2017, David Silver and colleagues from DeepMind released AlphaZero, an AI model that would beat top human Go players. AlphaZero showed RSI can work. The AI simulated games of self-play where the moves were the output of a neural network (another network also tracks the value of the game, but I’ll forget about that here): starting from randomly initialised parameters, games are played until the end. The final state of the game is used as a score function to update the network parameters to improve which moves are worth considering.
Many researchers at Deepmind experienced how a game that was considered to be very difficult got resolved by learning from environmental feedback. Unsurprisingly, many of the top researchers working on these ideas left to start their own companies to keep progress going in this domain: Recursive SuperIntelligence, Ineffable Intelligence, and Inherent Labs.
RL in LLMs
But if the game of Go was deemed challenging due to the large number of possible moves, language and reasoning is even more challenging: the possible tokens you could generate as an answer to some maths or coding challenge, or frankly any other prompt, is huge. And just randomly generating tokens would never get to a correct model, hence never receive a reward, and thus would never result in improved performance. The paper “Front-Loading Reasoning” says that for post-training to improve reasoning, including reasoning capabilities already in pre-training is critical. Hence: for RL to work on LLMs, the base model had to have a certain amount of base knowledge to guide the search and to unlock recursive improvement through RL.
In 2024 this stage was reached: OpenAI released o1, showcasing that RL could indeed work to improve base model abilities. That same year, DeepSeek’s team released an open-source paper shedding more light on how RL may be successfully applied to LLMs. DeepSeekMath introduced Group-Relative Policy Optimisation (GRPO). The method worked as follows: take an LLM, pass in a prompt, generate several answers (a group), compare against some ground-truth to get a reward per answer and then use the value of each answer relative to the others in the group as a learning signal. DeepSeekMath reported their results on high-school and college math (GSM8k, MATH, SAT, OCW Courses, MMLU-STEM), formal maths (miniF2F), reasoning over diverse tasks (MMLU, BIG-Bench Hard) and code evals (HumanEval, MBPP).
Evals like GSM8k contain questions such as “Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?”, with calculations and the final answer. The reasoning required is minor and the exact answer is given, likely labelled by humans. HumanEval consists of ~150 hand-written Python programming problems requiring knowledge of Python and algorithms; the correctness of the generated Python code is checked with unit tests. RL was thus shown to be well-able to improve base model capability when short reasoning is required and a hard verifier (exact maths answer, unit test) exists.
Easily verifiable tasks
On which tasks can we obtain this reward in an ‘easy’ manner?
Simple mathematical reasoning: simple maths problems, like the clips problem above, have a single exact answer: 72. Asking the model to provide the final answer in a specified format (e.g. a box) allows to extract the answer, and use a rule-based check of correctness.
Code: similar as maths, we can ask the model to output the code in between ‘code’ tags and extract the code. If we have access to test cases, we can then run those, and generate objective feedback on correctness.
To conclude this part: base models have successfully learned with RL on problems requiring simple reasoning and where an exact verification for reward existed. The exact answers or unit tests still require construction by humans (or stronger base models, whose own training at some point likely relied on humans).
Scaling up to more complex tasks
Technical reports of open-source models released in the years thereafter, showed that this idea can be scaled up to more complex tasks; still in these verifiable domains, but tasks that require more and longer reasoning.
Magistral (Section 2.2.2 in the report) RL’s the base model over verifiable maths and coding tasks; for maths an exact ground-truth exists (potentially obtained from a stronger model, or a human) and for coding the code is compiled and ran through a number of unit tests (again somehow someone defined these at some point, a human or a stronger model). DeepSeek-R1 works on similar tasks in which precise feedback can be given (code, math, logical reasoning).
Qwen-3 provides details on how they actually created the RL dataset. Their data spans maths, code, logical reasoning and general STEM problems, and each problem is paired with a verified reference answer or code-based test cases. Where do these correct answers come from? They mention using larger models to curate answers, and human annotators manually assessing the accuracy of the responses.
Olmo-3 released their RL dataset: looking at it on HuggingFace, it contains prompts in the domain of math, coding, instruction following, and general conversation. In the paper they provide much detail on how this was created (Section 4.4.2 in their report). For maths, they used open-source math problems, for coding they took available coding questions; to construct the test cases, they used another AI model to generate (variations of the original problem, solution, test case) triplets and kept examples where the solution passed the test case; they also used instruction following data (where you can create functions that check if all instructions in the prompt are satisfied, e.g. ‘use maximum 300 characters’) and for RL’ing on general chat, they used some synthetically constructed data (more on non-verifiable things in a bit).
DeepSeek-R1 included the observation that the model outputted “Wait, wait. Wait. That’s an aha moment I can flag here.” when solving a specific maths question. Hence a big distinction with the previous simple coding and maths tasks I mentioned, is that these more complex tasks require the model to reason over it’s own outputs. This is very interesting, as it means that we’ve taught the models the right tools to reflect on it’s own reasoning process. This was not magic: the model learned to do this likely because human’s crafted reasoning chains, which the model then re-used (see again the paper “front-loading reasoning”, which shows how reasoning chains must be included somewhere in the train data to use reasoning in the RL stage). But still. The ability to reason and reflect on its own output was a critical component in enabling RL to work also on longer-horizon tasks.
RL’s inefficiency
An observation made in Toby Ord’s blog “The extreme inefficiency of RL for frontier models” noted that RL only gives a reward at the end of a generation, and some tasks required very long reasoning chains (10.000 tokens or more), so that only 1 bit of information was obtained per 10.000 tokens. Compare this to supervised fine-tuning that provides feedback per token. This efficiency of feedback could be a limiting factor for RSI.
Self-Distillation Policy Optimization (SDPO) released in early 2026 replaces the single reward at the end of the episode by a logit-level distillation loss. Specifically, on e.g. coding, we can obtain additional feedback such as runtime errors and failed tests, and condition the model on this feedback; the learning then attempts to match these feedback-conditioned logits. SDPO was shown to work on tasks that were again an iteration more complex: scientific reasoning from SciKnowEval, tool use from ToolAlpaca, and coding from LiveCodeBench. SciKnowEval has exact answers against which RL is performed, ToolAlpaca tests tool calls comparing to standard answers from human annotators.
RL’ing over yet again more complex and long tasks
One goal the community has, is to make models good at longer and longer tasks; or in other words, we want to train agents: LLMs that can reason over long horizons, write code and interact with execution environments, observe execution feedback, reflect on their progress and recover from failure. Now we’re really moving away from the static coding tasks we looked at above.
METR keeps an eye on this ability to complete long tasks, and many of the recent benchmarks that are intro’d in the community are exactly testing these abilities. If we want to get good at these kind of tasks, we need the model to interact with real codebases and environments, and this means we have to somewhere obtain these high-quality coding environments. The reports of the most recent open-source models describe how these environments are crafted.
Qwen3-Coder-Next built executable environments that reflected real-world bug-fixing tasks from GitHub PRs, and leveraged works such as SWE-Smith, SWE-Flow and Multi-SWE-RL which provide executable repositories, test suites and evaluation scripts. The point here is: to get good at longer, more realistic tasks we’d want agents to perform, we need to create environments that will provide the LLM we’re RL’ing with the right feedback. The more complex the task, the more complex the environment we have to build around it. This is also where the speed and parallelisation of execution environments becomes a big deal.
IMAI-Thinking-1 (Section 3.3.1 in the report) starts with 102 million public GitHub PRs, filters these to PRs that have been merged, and that contain code and test changes. These PRs are then passed to some other LLM agent (again, we rely on stronger models, while the PRs, tests and code are human-generated!) that reads the repo state and creates Docker files to build executable container images. NemoTron-3 (Section 3.1.1 in the report) follows a similar procedure.
For more complex agentic tasks, we need to construct executable environments to endow the AI with feedback that can be used in the RL reward. To construct these tasks and environments, we still rely on human-generated code (sourced from GitHub by the open-source community, perhaps sourced from human experts by the AI companies with bigger pockets).
Other fun & complex tasks
KellyBench by General Reasoning, which tests models in a long-horizon, non-stationary environment that evaluates sequential decision-making in sports betting markets, when given historic data on English Premier League season and tasked to built machine learning models for constructing trading decisions. The reward comes directly from the market. Hedge funds must be targeting similar setups (Jane Street, Two Sigma).
And finally, there’s a cohort of startups that is building AI by interacting with verifiable environments for mathematical reasoning (Logos Research, Harmonic AI). Lean4 is one of such environments, where more complex mathematical questions that don’t have an answer that can be checked in a rule-based manner, can be checked by compiling their proof.
So which tasks have already been shown to be RL-able?
Why am I making such a fuss of stressing the exact benchmarks? Because one can observe a clear trend in increasing complexity performance: the tasks on which we’re succeeding to RL are getting longer, the required reasoning is more extensive, intermediate code execution and tool calls happen, and reasonings over the received outputs are required. The best way to conclude on which tasks today’s model’s manage to get RSI, is to look at the benchmarks the open-source (and closed-source) models and newly proposed methodologies report improvements on. While I mentioned them a bit above too, let’s summarise them here. I’ll also add the years each benchmark was introduced; then the pattern of increasing complexity becomes even more clear.
HotpotQA, 2018, multi-hop question answering given supporting information, checked against exact answer
HoVer, 2020, extends HotpotQA by changing it into fact verification task, again checked against gold answer (fact is supported or not)
GSM8k, 2021, has high school math questions with the exact answer in the data
MiniF2F, 2021, formal olympiad-level maths, scored by Lean verifier
FiNER, 2023, financial named entity recognition in financial news, checked against manual gold annotations
MATH-500, 2023, competition maths, exact answer, graded by math SymPy-style checker to allow for equivalent formulations
Symptom2Disease, ~2023, patient symptoms with exact answer from 22 possible disease categories
SciKnowEval, 2024, scientific knowledge across different domains, scores against exact answer
LiveCodeBench, 2024, LeetCode-style coding tests, scored by running against tests or known outputs
MMLU-Pro, 2024, multiple-choice questions spanning different domains, exact option labelled in data
Aider Polyglot, 2024, code editing across various languages, scored against unit tests
AppWorld, 2024, interactive coding agents operating simulated apps, scored against unit tests
SWE Verified, 2024, GitHub repo’s and issues that need to be fixed, checked against unit tests verified by humans
IFBench, 2025, verifiable instruction-following, verifier functions check whether output satisfies constraints
AIME, 2025, olympiad-style high school math, still scored with exact integer match
Terminal Bench 2.0, 2026, long-horizon tasks in command-line environments, execution based unit tests
To highlight the increasing complexity in tasks, three examples.
MATH-500:
Problem: Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$
Answer: \left( 3, \frac{\pi}{2} \right)
AppWorld:
The model has API access to apps like Spotify, Venmo, Gmail, Splitwise, Phone, SimpleNote, Todoist, Amazon and FileSystem and needs to solve tasks that interact with these apps.
Problem: I have compiled a list of invitees for our upcoming baby shower. You can find it in “~/documents/personal” in my file system. The email template for the invitations is saved in my Gmail drafts. Replace the placeholders in it marked by curly braces with the relevant details and send invitation emails, individually to each person.
Answer: There is a piece of evaluation code that will be ran to check if the required modification has been done.
Terminal Bench 2.0:
Problem: Implement an adaptive-rejection sampler as described in Gilks et al. (1992) (Look here for more detail)
Answer: Evaluation code that checks the generated code.
All in all, we can conclude: we see real gains of RL improving model performance on longer and longer tasks, if a reliable reward exists. This reward (the solution or test case) was often written by a human annotator or created by a stronger model (that in turn relied on a human initially).
It’s hard not to be impressed with the progress the community has made in two years.
RSI for AI research
In the last years, starting from the Attention is all you need paper, many more architectural advances have been found by researchers: mixtures of experts, state space models, multitudes of attention variants. The labs building AI models have built evaluation harnesses that encompass everything from benchmark performance to train- and inference-time statistics, to test out various kind of ideas in ablation studies. If we would have AI automate AI research this in itself could be a big deal. Even if it requires formulating relatively specific, verifiable problems inside the vast domain of AI research.
Compute is currently a huge bottleneck. In recent months, usage of AI models exploded. Anthropic introduced peak usage hour restrictions and consequently partnered with xAI to use Colossus, the 200k GPU cluster. When writing this article, I set off about 5 Pro searches; and I hit my GPT Pro membership limits: on June 9th, I got “limited” until July 5th :( And it’s not just in serving more user demand that improved architectures are valuable for. Ingesting more data, faster, will also lead to faster ablations, leading to faster improvements for the model.
But automating the search for these advances has been limited to very narrow successes. AlphaEvolve showed that LLMs can indeed optimise critical pieces of computational infrastructure but it relies on making the to-optimise problem quite precise (relying on a human to do so). Shinka Evolve by Sakana AI aims to enable more open-ended automated exploration. The algorithm managed to discover e.g. better load-balancing loss functions for MoEs but the creators themselves note that for now it also requires a human in the loop. So far, the success of automating AI research is within narrow tasks.
Non-verifiable tasks
Many economically valuable tasks have less clear rewards. Think of tasks in finance, law, medicine, consulting, sales or marketing. Multiple answers may be valid, and understanding validity is not as simple as executing a piece of code, or a rule-based check of consistency. How could we still make progress on these tasks in an automated loop?
Feedback from AI judges/verifiers
We could have AI provide feedback on itself, in the form of a reward; then we could use the same RL loop as before. This is known as RLAIF, where we use an LLM-as-a-Judge for the reward. It works like this: given an output from another LLM, we’d prompt an judge AI model to output a score token and use this as the reward in the RL loop. But this approach suffers from coarse-grained scoring. For a complex task, that contains long reasoning and agent trajectories, judges aren’t always able to reliably distinguish between good and bad answers. As long as we don’t have a reliable judge, we thus cannot RL.
Rubrics and fine-grained feedback
One approach to mitigate the coarse reward is to use rubric-conditioned rewards. The judge LLM is not just asked to score the output as a whole; instead, it’s provided with a series of rubrics that specify sub-components that the answer must satisfy. By breaking down what a good answer means, and scoring each rubric by the LLM judge, one can potentially get a more accurate reward signal.
But rubrics face their own challenges. Writing good rubrics is hard (proof #1, proof #2) and even though rubrics provide a more structured reward specification, we still optimise to pass a rubric, instead of the underlying true objective: we’ll only ever get as good as our rubrics are and as our LLM is at judging the rubric satisfiability. And of course, the rubrics need to be written by someone, bottlenecking us again by human data labelling. Approaches at having AI write rubrics are under way, but still struggle (proof #1, proof #2).
Related to this idea of more fine-grained feedback: a recent paper called LLM-as-a-verifier has a base model generate many outputs to a problem and the LLM-as-a-verifier provides fine-grained feedback. They show that this approach allows to select strong candidate solutions on verifiable (Terminal-Bench and SWE-Bench) tasks.
Recursive harnesses
The above form of RL updates directly the weights of the model, based on a loss function that incorporates the reward we discussed. An alternative approach I very much like is recursive context engineering. This is the topic of papers like GEPA, ACE and MCE. The loop is as follows: starting with a base task instruction, the LLM generates an answer to the task, it receives some form of environmental feedback, using this feedback and its own output, the LLM reflects on its output and rewrites the base task, appending the instruction or skillset in such a way that future similar tasks will be done better. It is an easy-to-implement approach, close to how I’d like a human to learn on the job: a new analyst comes in with an empty notebook, I give them a task, they get some sparse feedback from colleagues or clients, they write some ideas as to how to do better in their notebook.
GEPA reports performance on tasks such as HotpotQA (Q&A’s), IFBench (instruction following), Hover (retrieval and fact checking), AIME (maths). ACE also looks at AppWorld (apps such as messaging, shopping, notes with performing tasks inside these apps). MCE reports on FiNER (entity recognition), Symptom2Disease (disease classification) and LawBench. When testing this on more complex tasks ourselves, we noticed that the instructions were not reliably improving; the model wasn’t forming hypotheses, or properly incorporating past information to learn for the future. The authors of GEPA recently wrote a follow-up where they mix GEPA with weight updates; perhaps context alone is not enough.
An intriguing form of RSI: we train on verifiable domains, we generalise on non-verifiable?
So, all in all, RL’ing remains hard. Only if we have an exact verifier, and a sufficiently specific task, can we obtain reliable improvement. Lots of work is underway to get better, fine-grained feedback on tasks beyond just the verifiable ones, but success if so far limited. But one intriguing form of generalisation that could be very interesting: what if by training on verifiable domains, we see performance increase in non-verifiable tasks too. Perhaps when Anthropic describes “eerie”, “sort of sci-fi” improvements in AI, maybe this is a mechanism that is at play.
A side-note on why it may be hard to trace back exactly where improvement in models comes from, and hence why one may also over-attribute improvements to just RSI. OpenAI for example counts ~5000 employees, spread over departments such as data acquisition, pre-training, post-training, RL, agents, safety, human data, infrastructure. When observing an eerie improvement in the model capability, where does it truly come from? AI has no secret sauce: many elements needs to work well together. But the separation of those elements into disparate departments also hardens the credit assignment problem of what change led to the observed model improvement. Of course, when the post-train lab observes huge gains in the model capabilities just from the RL post-training, my argument doesn’t hold; but in an increasingly complex and large organisation, it does become increasingly complex to keep track of all meaningful changes.
The role of human-labelled data today
Epoch AI reported recently that open-source lags closed-source by about 4 months. What’s in this gap? I think data is in there: while we need to get a lot of things right (data mixtures, distributed training, recover from node failures, … the list is long) to train a strong model, in all the tasks I described above, both to measure performance and to increase performance, somewhere in that loop we rely on a human (or a stronger model, which relied on a human at some point). Some open-source models were critiqued for these so called distillation attacks, which allowed them to get good performance without going through the expensive effort of paying experts to label data.
And if one bottleneck is data, the forever question remains: how much ‘out-of-the-box’ generalisation can we expect to see? This is relevant here, because we’d like to remove human labelling as much as possible, for RSI to take off broadly. If for every sufficiently new task (whatever ‘sufficiently new’ means), we rely on humans curating test cases or solving maths problems, this limits the RSI.
In many of the reports, e.g. Olmo-3 and DeepSeek-v4, the teams are explicit about synthetic data creation. Different LLMs are leveraged to create variations of the problems we’re interested to get better on. But listening to a podcast with Max Welling whose company builds AI for material discovery, he was explicit that gaining out-of-sample performance requires new data. Hence: while synthetic data may help get better within-sample (and maybe slightly around it), it’s limited in bringing us to new tasks.
Also Mercor didn’t reach a billion-dollar valuation for no reason. Their platform shows 5k searchable roles, with a total of 256.6k roles created on the Mercor Experts page.
Today, human labelling seems to remain very valuable. In particular, on tasks that would unlock economic gains. This could mean that the way we will work in the near future is that we will have novel tasks be done by humans a few times, and after the AI takes over.
The future
So, where does this leave us?
Today’s systems don’t seem to be able to autonomously improve across arbitrary domains in a clean, open-ended loop. But we do see AI systems get much better through feedback when the task can be wrapped in a reliable verifier: maths, code, formal proofs, agents in executable environments. In these domains, the RSI is happening, and in two years time, we made impressive progress. The open question I’m left with is how far this generalises.
If progress is bounded by verifiable domains, then human expertise remains central. Humans will keep writing the tasks, the tests, curating the data, judging edge cases, and building environments in which models can learn. Generalisation is limited to novel tasks, bottlenecking us by always needing human in the loop, at least to curate the initial datasets. We may get good AI’s, but only sufficiently narrow ones. Model inference may remain expensive, making for certain tasks humans an interesting hiring decision again. In that world, AI is still hugely useful, just not fully ‘sci-fi’ self-improving.
But there’s another future. From the above, we do see consistent improvement in the model performance. Tasks get longer. Sure, we’re limited by verifiable environments and generalisation sort of still seems limited, but we seem to be collectively building and sharing benchmarks and evaluations that span wide numbers of tasks. Higher expert people are contributing data on Mercor and the like. We’re getting more efficient at architectures, distributed training, and we’re putting data centers in space: tokens will get cheap. And maybe eventually, through human effort and verifiable feedback, we may reach a point where we teach models enough habits that transfer elsewhere: decomposition, reflection, debugging, tool use, hypothesis formation. Perhaps some of the “eerie” improvements described by frontier labs come from this kind of spillover.
This is exciting, but also worrisome. The bad version is not that AI gets powerful; it is that we become dependent on it too much, tied into paying ever-increasing fees for ever-increasing tiers of expertise in AI memberships while many people are disincentivised from learning. If expertise is one of the scarce inputs that lets these systems improve, we should be careful not to give it away too easily.
The good version: automation of white collar work may lead to, as Alex Imas and Phil Trammell argued, human-provided services increasing in value. This may lead to AI tasks taking a decreased share of the economy. And when the tedious is automated, we’ll have more time for the enjoyable. Benn prompts to keep thinking about which software AI cannot build as a way to decide what to focus on. Individually, we should keep learning and thinking, precisely because that seems to remain valuable either way.
Thank you for reading,
Anastasia
You can dive deeper into reinforcement learning for reasoning by looking at the following papers:
I also suggest looking at technical reports from open source models, such as:

