selected work
Loops vs. Prompts: When Iterating an LLM Is Worth the Cost
A controlled experiment on when it pays to run a language model in a loop instead of asking once. Four strategies, three tasks, matched compute, and proper statistics, all reproducible from the code and data.
- Primary language
- Python

Highlights
- Wrote down the hypotheses and predictions before running anything, so results could not be rationalised after the fact.
- Compared four strategies (single prompt, self-refinement, a verifier loop, and best-of-N) across code, grade-school maths, and structured extraction.
- Matched compute across methods and reported cost per correct answer, not just accuracy, so the comparison is fair.
- Ran five seeds per condition with 95% confidence intervals and Holm-corrected paired significance tests.
- Built a sandboxed execution harness with timeouts and memory limits to grade generated code safely.
- Graded code on held-out tests the loop never saw, which exposes overfitting to the feedback signal.
- Found that self-correction without an external check does not beat a single prompt; a checker plus a few samples does, at lower cost.
- Logged every prompt, response, and token count so all reported numbers can be re-derived.
Overview
"Put the model in a loop so it can fix its own mistakes" is common advice. I wanted to know when that actually helps, and whether the extra calls are worth paying for, so I built a like-for-like comparison and measured it.
What I compared
Same model and task each time, with only the strategy changing:
- Single prompt: one well-written prompt, one answer.
- Self-refinement: the model critiques and rewrites its own answer, with no outside signal.
- Verifier loop: the answer is checked by a real test (unit tests, an answer checker), and only the genuine error is fed back for another attempt.
- Best-of-N: several independent answers, then keep the best, chosen by the verifier or by majority vote.
These ran across three tasks (HumanEval for code, GSM8K for maths, and a synthetic extraction task) on local Qwen2.5 models at two sizes, plus a code-specialised model.
What I found
- Self-refinement with no external check did no better than a single prompt. On maths and extraction the smaller model almost always judged its first answer correct and declined to revise.
- The verifier loop did beat a single prompt, but only because it had a real check to retry against.
- At equal compute, plain best-of-N with the same check matched or beat the loop, often using fewer tokens. The gain comes from having a verifier and taking a few attempts, not from refinement itself.
- Measured as cost per correct answer, the single prompt was the most efficient option in every case.
Why the numbers can be trusted
- Predictions were recorded before any run (see
DESIGN.md). - Every method was given the same compute budget, and cost was tracked in tokens.
- Each condition ran over five seeds, reported with confidence intervals and corrected significance tests.
- Code was graded on held-out tests the loop never saw, so a method cannot quietly overfit to the feedback.
- Generated code runs in a sandbox with timeouts and memory limits.
- Every prompt, response, and token count is saved, so results can be reproduced end to end.
Limitations I am upfront about
The study uses small local models (3B to 7B). Larger models may refine more effectively, so I would not extend the "self-correction doesn't help" result to frontier systems without testing them. One of the three tasks turned out too easy to separate the methods, which I report rather than leave out.