Rubber-Duck Critic Agent for Code Review Workflows
Bottom Line
A rubber-duck critic agent is most useful when it does not write the patch first. Its job is to interrogate assumptions, expose missing tests, and force an implementation plan to survive structured objections.
Key Takeaways
- ›Separate critic behavior from implementation behavior to avoid rubber-stamp reviews.
- ›Ask for risks, missing tests, edge cases, and simpler alternatives before coding.
- ›Use structured JSON output so the critic can gate plans in CI or local scripts.
- ›Verify the agent with intentionally weak plans before trusting it on real work.
A rubber-duck critic agent is a small review partner that reads your proposed implementation before you touch the code. Unlike a general coding assistant, it is not trying to be agreeable or productive in the usual sense. It asks what will break, which tests are missing, whether the design is simpler than necessary, and where your assumptions are thin. This tutorial builds one as a local planning gate you can run before code review or implementation.
Prerequisites
Bottom Line
Do not make the agent both author and judge in the same pass. Give it one narrow job: find reasons your plan is incomplete, risky, or overbuilt.
Prerequisites box
- A repository with a repeatable test command.
- Python available locally for the wrapper script.
- An LLM client function you already trust, whether hosted or local.
- A habit of writing a short implementation plan before editing code.
The critic agent does not need deep framework magic. The durable design is a plain script that accepts three inputs: the task, the proposed plan, and the relevant code context. It returns a structured decision that a human can read and a workflow can enforce.
Keep formatting clean because the critic will quote and classify your own plan. Before pasting long snippets into prompts, run them through TechBytes' Code Formatter so indentation and block boundaries are not part of the problem.
1. Build the Critic Loop
Start with the behavior contract. The critic should not implement, brainstorm unrelated features, or praise the plan. It should classify findings by severity and ask for concrete fixes.
CRITIC_SYSTEM_PROMPT = """
You are a rubber-duck critic for software implementation plans.
Your job is to challenge the plan before code is written.
Rules:
- Do not write the implementation.
- Do not approve vague plans.
- Identify correctness risks, missing tests, edge cases, migration issues, and simpler alternatives.
- Return JSON only.
JSON shape:
{
"decision": "approve" | "revise" | "block",
"findings": [
{
"severity": "blocker" | "major" | "minor",
"area": "correctness" | "tests" | "design" | "security" | "operations",
"issue": "short description",
"fix": "specific next action"
}
],
"missing_tests": ["test that should exist"],
"simpler_alternative": "short option or empty string"
}
"""That prompt is intentionally constrained. The important methods are critique_plan, which prepares the request, and parse_decision, which refuses unstructured output.
import json
from dataclasses import dataclass
from typing import Callable
@dataclass
class CriticInput:
task: str
plan: str
context: str
class CriticError(Exception):
pass
def critique_plan(payload: CriticInput, call_llm: Callable[[str, str], str]) -> dict:
user_prompt = f"""
Task:
{payload.task}
Proposed implementation plan:
{payload.plan}
Relevant code context:
{payload.context}
"""
raw = call_llm(CRITIC_SYSTEM_PROMPT, user_prompt)
return parse_decision(raw)
def parse_decision(raw: str) -> dict:
try:
decision = json.loads(raw)
except json.JSONDecodeError as exc:
raise CriticError(f"Critic returned invalid JSON: {exc}") from exc
if decision.get("decision") not in {"approve", "revise", "block"}:
raise CriticError("Critic decision must be approve, revise, or block")
if not isinstance(decision.get("findings"), list):
raise CriticError("Critic findings must be a list")
return decision2. Add Code Context Without Flooding the Prompt
The easiest way to weaken this agent is to paste the whole repository into it. Give it enough code to reason, not enough to drown. For implementation planning, useful context usually fits into four buckets.
- Changed files or files likely to change.
- Existing tests around the same behavior.
- Public interfaces the plan touches.
- Operational constraints such as migrations, background jobs, or feature flags.
A minimal command wrapper can read a plan file and a context file. The custom flags --task, --plan, and --context belong to this local script, so they are safe to adapt.
import argparse
from pathlib import Path
class FakeLLM:
def __call__(self, system: str, user: str) -> str:
return json.dumps({
"decision": "revise",
"findings": [
{
"severity": "major",
"area": "tests",
"issue": "The plan does not name the regression test that will fail before the fix.",
"fix": "Add one failing test for the reported behavior before editing production code."
}
],
"missing_tests": ["Regression test covering the original bug report"],
"simpler_alternative": "Limit the first patch to the failing path, then refactor after tests pass."
})
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--task", required=True)
parser.add_argument("--plan", required=True)
parser.add_argument("--context", required=True)
args = parser.parse_args()
payload = CriticInput(
task=args.task,
plan=Path(args.plan).read_text(),
context=Path(args.context).read_text(),
)
decision = critique_plan(payload, FakeLLM())
print(json.dumps(decision, indent=2))
if __name__ == "__main__":
main()3. Verify the Agent and Expected Output
Do not verify the critic with a polished plan. Give it a plan that is plausible but incomplete. A good critic should push back on ambiguity and missing tests.
Task: Fix duplicate invoice emails after payment retry.
Plan:
Update the payment retry handler to check whether an invoice email was already sent.
If it was sent, skip the email step.
Context:
retry_payment(order_id) currently calls charge_card(order_id), then send_invoice(order_id).Expected output should look like this shape, even if the exact wording differs:
{
"decision": "revise",
"findings": [
{
"severity": "major",
"area": "correctness",
"issue": "The plan does not define the durable source of truth for whether an invoice was sent.",
"fix": "Use an idempotency record or persisted email status instead of an in-memory check."
},
{
"severity": "major",
"area": "tests",
"issue": "The plan does not include a retry regression test.",
"fix": "Add a test where the first payment attempt sends an invoice and the retry does not."
}
],
"missing_tests": [
"Retry after successful invoice send does not send a second invoice",
"Retry after failed invoice send can still send one invoice"
],
"simpler_alternative": "Add idempotency around send_invoice before restructuring the retry flow."
}Use three pass/fail checks before putting the critic into daily use.
- It returns valid JSON for normal plans.
- It marks incomplete plans as
reviseorblock. - It identifies at least one missing test for intentionally weak plans.
Troubleshooting: Top 3 Failures
1. The critic approves everything
This usually means the prompt is too polite or the examples are too clean. Tighten the role and add explicit rejection criteria.
- Require at least one risk analysis item for non-trivial changes.
- Reject plans that do not name tests.
- Reject plans that do not identify affected files or interfaces.
2. The critic rewrites the implementation
The model is drifting into assistant mode. Repeat the boundary in the system prompt and schema. The output should contain objections and next actions, not replacement code.
- Remove fields such as
patch,implementation, ornew_code. - Add a validation check that fails when large code blocks appear in findings.
- Ask for smaller alternatives, not complete rewrites.
3. The findings are too generic
Generic feedback usually means generic context. Add the relevant test names, function signatures, and failure mode. The critic cannot reason about edge cases it cannot see.
- Include the current behavior and desired behavior.
- Include nearby tests, not just production code.
- Include constraints such as backward compatibility or data privacy.
What's Next
Once the critic is useful locally, wire it into the places where plans already exist. The best insertion points are lightweight and reversible.
- Run it on pull request descriptions before review starts.
- Use
blockonly for missing tests, data loss risks, or security issues. - Store critic output as review notes so humans can accept or reject each finding.
- Track false positives so the prompt improves from real engineering feedback.
The mature version is not an autonomous reviewer that replaces engineers. It is a planning pressure test. It makes vague work visible earlier, when changing direction is still cheap.
Frequently Asked Questions
What is a rubber-duck critic agent for code review? +
Should a critic agent be allowed to write code? +
What should I include in the critic agent prompt? +
approve, revise, or block.How do I know if my code review agent is working? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
AI Code Review Checklist for Engineering Teams
A practical checklist for using AI review tools without weakening human ownership.
Developer ToolsPrompt Engineering for Developers
How to turn vague coding prompts into structured, testable engineering instructions.
Developer ReferenceImplementation Planning Template for Software Teams
A reusable planning format for reducing ambiguity before code changes begin.