# Both sides of LLM peer review at ICML 2026
This was the first cycle where ICML's reviewer policy explicitly carved out room for LLM assistance (ICML 2026 LLM Policy). I reviewed under Policy B. I also had a paper of mine hit by what was almost certainly an LLM-amplified hostile review on the author side. Both experiences arrived in the same cycle, and they've left me with a more uncomfortable view of where this is going than I started with.
I'm writing this partly because I was just listed as a Gold Reviewer for ICML 2026, which felt like an odd time to be quietly furious about how broken the other half of the system can be.
# Why I went in pro-LLM
When I talked to peers about this earlier in the year, the dominant take was "LLM-assisted reviewing is bad and shouldn't be allowed." My take was the opposite. My own recent interests have been in LLMs and the various applications people are building on top of them; it would be intellectually dishonest for me to argue tools I work with shouldn't touch the workflow I'm part of. So I declared Policy B (the permissive option, which permits LLM use on all my assigned papers as long as the LLM use stays inside the policy's allowed list), and didn't think much more about it.
# What LLMs were actually useful for
The honest answer is: the boring parts of being a careful reviewer.
When a paper claims a 4-point improvement over five baselines, somewhere in the appendix there's a table that says whether the baselines were tuned, what the sequence length was, whether the eval is comparable. Tracking that across the paper, the cited baseline paper, and supplementary material used to be the work that I'd skim past when I was tired, and I think most reviewers do, in practice, skim past it. With LLM assistance to broadly verify "are these numbers actually comparable across what's cited," I caught fairness issues I would not otherwise have flagged. Multiple times.
Language polish on my own drafted review, also fine. Same thing I'd ask a colleague to do, faster.
# Then I got a paper rejected by what was clearly the other use case
One of my own submissions came back with a review that, in retrospect, I can describe pattern-by-pattern:
- Twenty-plus enumerated weaknesses, in a clean ordered list
- Mixed throughout: a few legitimate observations, several factually wrong claims that confidently misread our setup, and a long tail of unfalsifiable objections ("the authors should have explored…" with no specific demand)
- An evaluation-framework attack: the baselines and benchmarks we used were declared insufficient, with a list of alternatives that were either inappropriate for the setting or didn't exist in usable form
- Confidence: 5/5. Score: lowest possible.
- After rebuttal: the reviewer's response was, almost verbatim, "the authors' clarifications reinforce my concerns," taking our concessions on minor points and using them to confirm the rejection.
I don't have proof an LLM wrote that review, and I don't need to. The structural signature is what matters: this is what happens when somebody is too tired, too hostile, or too indifferent to write a real review, and uses a tool that lets them produce something that looks legitimate at scale. The marginal cost of generating one more plausible objection collapsed to nearly zero. The asymmetric burden of responding to twenty-plus weaknesses, half of which require a paragraph each to refute, fell entirely on the authors.
# The asymmetry I keep coming back to
A careful reviewer gets somewhat better with LLM assistance. They catch more details, write more clearly, spend less time on bookkeeping.
A careless or hostile reviewer gets radically worse, in the sense of producing much more output of much lower epistemic quality, with much higher surface plausibility. The cost of being thorough collapsed; the cost of being correct did not.
I think most pro-LLM-in-review arguments (including the one I made to my peers earlier this year) implicitly model the upgrade as uniform. It isn't. It's a multiplier on whatever the reviewer is already doing, and the bottom-quartile reviewer is who actually gets unbounded by this.
Which means the disqualifying property is not the LLM, and not LLM-use. It's the reviewer the LLM is multiplying. The bad-faith reviewer was always going to produce a destructive review; the LLM just lowered their cost. The fight about LLM-in-review is, at the level of policy, the wrong fight.
# Mean Reviewer
I responded to this experience by writing a Claude Code skill: mean-reviewer-skill. It's a deliberately destructive reviewer agent, the "Armchair Executioner," that I can run against my own drafts before submission. It generates exactly the kind of review I just described: the wall of weaknesses, the unfalsifiable objections, the eval-framework attack, the locked-in reject with max confidence, the weaponized rebuttal response.
The point is not to enable abuse. That's available without me. The point is to make the pattern legible to authors, so they can:
- Stress-test their own paper against the worst-case review before submission and pre-empt the most dangerous objections
- Recognize the pattern when it lands on them post-decision, and not internalize a hostile-LLM review as a real signal about their work
- Discuss this concretely at the policy level, with examples that exist, rather than in the abstract
When I tested it against a NeurIPS 2025 oral paper, it produced a review that, on the surface, could have plausibly rejected that paper. That outcome is the warning.
# Where I land now
The policy debate keeps framing the question as "should LLMs be allowed in review." I now think that's the wrong question. The thing that needs to be disqualified isn't the tool, and it isn't the use of the tool. It's the bad-faith, careless, or hostile reviewer who was always going to produce a destructive review and is now able to produce it faster. Banning LLMs lowers good-reviewer throughput without raising the floor; allowing them raises good-reviewer throughput without lowering it either. The floor is the actual problem.
So the lever, to me, sits with ACs and PCs:
- Weight reviewer disagreement more aggressively when one review shows the structural pattern above (long unfalsifiable weakness list, locked-in low score with max confidence, weaponized rebuttal response).
- Disqualify or demote reviewers whose pattern recurs across papers and cycles, regardless of whether they used an LLM. The behavior is the disqualifier, not the tool.
- Don't moralize about LLM-use in review. It's a distraction from the people who would write the same review without one.
For authors, my advice is more defensive: assume the adversarial pattern will land on you at least once, and design your paper's framing to survive it. That's what mean-reviewer-skill is for: making the worst-case review legible enough that you can pre-empt it, and recognizable enough that you don't internalize it as feedback when it arrives.
I'll keep reviewing under Policy B. I'll keep using LLMs to do it carefully. But I no longer think the LLM question is the question. The question is who gets to keep reviewing.