selected work

Loops vs. Prompts: When Iterating an LLM Is Worth the Cost

A controlled experiment on when it pays to run a language model in a loop instead of asking once. Four strategies, three tasks, matched compute, and proper statistics, all reproducible from the code and data.

Primary language: Python

View source

Loops vs. Prompts: When Iterating an LLM Is Worth the Cost screenshot 1

Highlights

Wrote down the hypotheses and predictions before running anything, so results could not be rationalised after the fact.
Compared four strategies (single prompt, self-refinement, a verifier loop, and best-of-N) across code, grade-school maths, and structured extraction.
Matched compute across methods and reported cost per correct answer, not just accuracy, so the comparison is fair.
Ran five seeds per condition with 95% confidence intervals and Holm-corrected paired significance tests.
Built a sandboxed execution harness with timeouts and memory limits to grade generated code safely.
Graded code on held-out tests the loop never saw, which exposes overfitting to the feedback signal.
Found that self-correction without an external check does not beat a single prompt; a checker plus a few samples does, at lower cost.
Logged every prompt, response, and token count so all reported numbers can be re-derived.

Overview

"Put the model in a loop so it can fix its own mistakes" is common advice. I wanted to know when that actually helps, and whether the extra calls are worth paying for, so I built a like-for-like comparison and measured it.

What I compared

Same model and task each time, with only the strategy changing:

Single prompt: one well-written prompt, one answer.
Self-refinement: the model critiques and rewrites its own answer, with no outside signal.
Verifier loop: the answer is checked by a real test (unit tests, an answer checker), and only the genuine error is fed back for another attempt.
Best-of-N: several independent answers, then keep the best, chosen by the verifier or by majority vote.

These ran across three tasks (HumanEval for code, GSM8K for maths, and a synthetic extraction task) on local Qwen2.5 models at two sizes, plus a code-specialised model.

What I found

Self-refinement with no external check did no better than a single prompt. On maths and extraction the smaller model almost always judged its first answer correct and declined to revise.
The verifier loop did beat a single prompt, but only because it had a real check to retry against.
At equal compute, plain best-of-N with the same check matched or beat the loop, often using fewer tokens. The gain comes from having a verifier and taking a few attempts, not from refinement itself.
Measured as cost per correct answer, the single prompt was the most efficient option in every case.

Why the numbers can be trusted

Predictions were recorded before any run (see DESIGN.md).
Every method was given the same compute budget, and cost was tracked in tokens.
Each condition ran over five seeds, reported with confidence intervals and corrected significance tests.
Code was graded on held-out tests the loop never saw, so a method cannot quietly overfit to the feedback.
Generated code runs in a sandbox with timeouts and memory limits.
Every prompt, response, and token count is saved, so results can be reproduced end to end.

Limitations I am upfront about

The study uses small local models (3B to 7B). Larger models may refine more effectively, so I would not extend the "self-correction doesn't help" result to frontier systems without testing them. One of the three tasks turned out too easy to separate the methods, which I report rather than leave out.

Highlights

Wrote down the hypotheses and predictions before running anything, so results could not be rationalised after the fact.

Compared four strategies (single prompt, self-refinement, a verifier loop, and best-of-N) across code, grade-school maths, and structured extraction.

Matched compute across methods and reported cost per correct answer, not just accuracy, so the comparison is fair.

Ran five seeds per condition with 95% confidence intervals and Holm-corrected paired significance tests.

Built a sandboxed execution harness with timeouts and memory limits to grade generated code safely.

Graded code on held-out tests the loop never saw, which exposes overfitting to the feedback signal.

Found that self-correction without an external check does not beat a single prompt; a checker plus a few samples does, at lower cost.

Logged every prompt, response, and token count so all reported numbers can be re-derived.

What I compared

Same model and task each time, with only the strategy changing:

Single prompt: one well-written prompt, one answer.

Self-refinement: the model critiques and rewrites its own answer, with no outside signal.

Verifier loop: the answer is checked by a real test (unit tests, an answer checker), and only the genuine error is fed back for another attempt.

Best-of-N: several independent answers, then keep the best, chosen by the verifier or by majority vote.

These ran across three tasks (HumanEval for code, GSM8K for maths, and a synthetic extraction task) on local Qwen2.5 models at two sizes, plus a code-specialised model.

What I found

Self-refinement with no external check did no better than a single prompt. On maths and extraction the smaller model almost always judged its first answer correct and declined to revise.

The verifier loop did beat a single prompt, but only because it had a real check to retry against.

At equal compute, plain best-of-N with the same check matched or beat the loop, often using fewer tokens. The gain comes from having a verifier and taking a few attempts, not from refinement itself.

Measured as cost per correct answer, the single prompt was the most efficient option in every case.

Why the numbers can be trusted

Predictions were recorded before any run (see DESIGN.md).

Every method was given the same compute budget, and cost was tracked in tokens.

Each condition ran over five seeds, reported with confidence intervals and corrected significance tests.

Code was graded on held-out tests the loop never saw, so a method cannot quietly overfit to the feedback.

Generated code runs in a sandbox with timeouts and memory limits.

Every prompt, response, and token count is saved, so results can be reproduced end to end.

Limitations I am upfront about