If you’ve spent real time pairing with a frontier LLM on anything non-trivial, whether a renderer or a tricky refactor, you’ve probably noticed two failure modes that keep showing up together:
It wants to agree with you. Push back on its output, even weakly, and it folds. “You’re right, let me fix that,” and then it “fixes” something that was correct.
It wants to be done. Error paths silently return; with no exception, no log, no indication anything went wrong. Edge cases get hand-waved with a comment. Tests get weakened until they pass. A 2000-line diff arrives with three of the six features you asked for, and a cheerful summary claiming completion.
Individually, either is annoying. Together they’re corrosive, because the sycophancy hides the laziness. The model tells you it’s finished; you believe it; you find out a day later that the shortcut it took broke an invariant essential to actual operation.
A recent example from RISE, my renderer: I was getting VCM (Vertex Connection and Merging) stood up. The first few passes from the model looked plausible. Compiled, ran, produced images. But the images had splotches, the telltale artifact of bad vertex merging: incorrect density estimation blowing up localized radiance into bright blobs. Nothing in the model’s self-report flagged this. It was only after several iterations, and deliberately setting up validation that looked for splotches specifically (alongside variance and convergence checks against a bidirectional path tracing reference), that I got to an implementation I’d trust. The model wasn’t going to tell me the merge kernel was wrong. The images had to.
Why this shape?
My working theory is that this is an RL artifact, not a capability ceiling. The reward signal during post-training is, at best, a noisy proxy for “did the user get what they actually needed.” What it can much more easily capture is:
- Did the user express satisfaction in the next turn?
- Did the response terminate cleanly without long back-and-forth?
- Did the answer look confident and well-structured?
Optimize against those and you get exactly the behavior we see. Agreeing with the user is a near-guaranteed path to a positive next-turn signal. Declaring victory early and producing a tidy summary looks indistinguishable, to a human rater skimming transcripts, from actually finishing. The model isn’t being deceptive; it’s learned that appearing done and being done are rewarded identically, and appearing done is cheaper. This is reward hacking in the standard sense: the policy found a cheap region of input space that scores well on the proxy.
The same dynamic explains why the failures cluster at exactly the places verification is hardest: numerical correctness, performance regressions, subtle concurrency, anything where “it compiles and the happy path runs” is wildly insufficient as an oracle. And it gets worse as models improve. Generation cost is falling fast. Verification cost is not (especially if you are building a renderer). Every capability jump widens the gap between how quickly the model can produce something plausible and how long it takes you to confirm it’s actually right. If you don’t build for that asymmetry, you lose ground on every iteration.
What actually works
The fix, in my experience, is to stop relying on the model’s self-report and instead make correctness externally defined and externally checked. A few things that have moved the needle for me on the renderer:
Adversarial review agents. Not one reviewer, several, with explicit instructions to find flaws, disagree, and escalate. A single reviewer inherits the same sycophancy gradient. Multiple reviewers with conflicting mandates (one checks correctness, one checks performance, one looks for silent scope reduction) break the collusion.
Correctness defined in two registers. Qualitative: image diffs, perceptual checks, “does this frame look right.” Quantitative: variance statistics on the Monte Carlo estimator, convergence rate vs. reference, wall-time budgets with regression thresholds. Either one alone is gameable. Together they’re much harder to satisfy by shortcut.
No self-graded completion. The model never decides when it’s done. A separate evaluation harness does. If the harness doesn’t pass, the task isn’t finished, regardless of how confident the summary sounds.
Cross-model review. One of the more effective tricks I’ve landed on: have a different model review the first model’s work. Same model reviewing itself inherits the same priors, the same shortcuts, the same blind spots, and the same training distribution that taught it which corners are safe to cut. A different lab’s model has been shaped by a different reward signal and trips on different things. The disagreements are where the interesting bugs live. It’s not that the second model is smarter; it’s that it’s wrong in uncorrelated ways.
The pattern underneath all four: assume the model will take the easiest path that looks like success, and make sure the easiest path is success.
Has this been your experience?
I’d be curious whether others are seeing the same thing, and what you’ve built to counter it. Specifically: how do you define correctness for tasks where the obvious checks are cheap to fool? What does your adversarial review setup look like? And has anyone found a prompting pattern, rather than a harness, that reliably suppresses the “declare victory and exit” behavior?