What is a rubber-duck critic agent for code review?

It is an agent that reviews an implementation plan before code is written. Instead of producing code, it challenges assumptions, identifies missing tests, and flags design or correctness risks.

Should a critic agent be allowed to write code?

Usually no. Keep the critic separate from the implementation agent so it does not justify its own patch. Let it produce structured findings, then have a human or separate coding workflow act on them.

What should I include in the critic agent prompt?

Include the task, proposed plan, relevant code context, affected interfaces, and nearby tests. Ask for a structured decision such as approve, revise, or block.

How do I know if my code review agent is working?

Test it with intentionally weak plans. A useful critic should catch missing regression tests, unclear data ownership, edge cases, and overcomplicated designs before approving the plan.

Rubber-Duck Critic Agent for Code Review Workflows

A rubber-duck critic agent is a small review partner that reads your proposed implementation before you touch the code. Unlike a general coding assistant, it is not trying to be agreeable or productive in the usual sense. It asks what will break, which tests are missing, whether the design is simpler than necessary, and where your assumptions are thin. This tutorial builds one as a local planning gate you can run before code review or implementation.

Prerequisites

Bottom Line

Do not make the agent both author and judge in the same pass. Give it one narrow job: find reasons your plan is incomplete, risky, or overbuilt.

Prerequisites box

A repository with a repeatable test command.
Python available locally for the wrapper script.
An LLM client function you already trust, whether hosted or local.
A habit of writing a short implementation plan before editing code.

The critic agent does not need deep framework magic. The durable design is a plain script that accepts three inputs: the task, the proposed plan, and the relevant code context. It returns a structured decision that a human can read and a workflow can enforce.

Keep formatting clean because the critic will quote and classify your own plan. Before pasting long snippets into prompts, run them through TechBytes' Code Formatter so indentation and block boundaries are not part of the problem.

1. Build the Critic Loop

Start with the behavior contract. The critic should not implement, brainstorm unrelated features, or praise the plan. It should classify findings by severity and ask for concrete fixes.

CRITIC_SYSTEM_PROMPT = """
You are a rubber-duck critic for software implementation plans.
Your job is to challenge the plan before code is written.

Rules:
- Do not write the implementation.
- Do not approve vague plans.
- Identify correctness risks, missing tests, edge cases, migration issues, and simpler alternatives.
- Return JSON only.

JSON shape:
{
  "decision": "approve" | "revise" | "block",
  "findings": [
    {
      "severity": "blocker" | "major" | "minor",
      "area": "correctness" | "tests" | "design" | "security" | "operations",
      "issue": "short description",
      "fix": "specific next action"
    }
  ],
  "missing_tests": ["test that should exist"],
  "simpler_alternative": "short option or empty string"
}
"""

That prompt is intentionally constrained. The important methods are critique_plan, which prepares the request, and parse_decision, which refuses unstructured output.

import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class CriticInput:
    task: str
    plan: str
    context: str

class CriticError(Exception):
    pass

def critique_plan(payload: CriticInput, call_llm: Callable[[str, str], str]) -> dict:
    user_prompt = f"""
Task:
{payload.task}

Proposed implementation plan:
{payload.plan}

Relevant code context:
{payload.context}
"""
    raw = call_llm(CRITIC_SYSTEM_PROMPT, user_prompt)
    return parse_decision(raw)

def parse_decision(raw: str) -> dict:
    try:
        decision = json.loads(raw)
    except json.JSONDecodeError as exc:
        raise CriticError(f"Critic returned invalid JSON: {exc}") from exc

    if decision.get("decision") not in {"approve", "revise", "block"}:
        raise CriticError("Critic decision must be approve, revise, or block")

    if not isinstance(decision.get("findings"), list):
        raise CriticError("Critic findings must be a list")

    return decision

2. Add Code Context Without Flooding the Prompt

The easiest way to weaken this agent is to paste the whole repository into it. Give it enough code to reason, not enough to drown. For implementation planning, useful context usually fits into four buckets.

Changed files or files likely to change.
Existing tests around the same behavior.
Public interfaces the plan touches.
Operational constraints such as migrations, background jobs, or feature flags.

A minimal command wrapper can read a plan file and a context file. The custom flags --task, --plan, and --context belong to this local script, so they are safe to adapt.

import argparse
from pathlib import Path

class FakeLLM:
    def __call__(self, system: str, user: str) -> str:
        return json.dumps({
            "decision": "revise",
            "findings": [
                {
                    "severity": "major",
                    "area": "tests",
                    "issue": "The plan does not name the regression test that will fail before the fix.",
                    "fix": "Add one failing test for the reported behavior before editing production code."
                }
            ],
            "missing_tests": ["Regression test covering the original bug report"],
            "simpler_alternative": "Limit the first patch to the failing path, then refactor after tests pass."
        })

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--task", required=True)
    parser.add_argument("--plan", required=True)
    parser.add_argument("--context", required=True)
    args = parser.parse_args()

    payload = CriticInput(
        task=args.task,
        plan=Path(args.plan).read_text(),
        context=Path(args.context).read_text(),
    )
    decision = critique_plan(payload, FakeLLM())
    print(json.dumps(decision, indent=2))

if __name__ == "__main__":
    main()

Pro tip: Keep the fake client in tests. It lets you verify parsing, gating, and error handling without depending on a live model call.

3. Verify the Agent and Expected Output

Do not verify the critic with a polished plan. Give it a plan that is plausible but incomplete. A good critic should push back on ambiguity and missing tests.

Task: Fix duplicate invoice emails after payment retry.

Plan:
Update the payment retry handler to check whether an invoice email was already sent.
If it was sent, skip the email step.

Context:
retry_payment(order_id) currently calls charge_card(order_id), then send_invoice(order_id).

Expected output should look like this shape, even if the exact wording differs:

{
  "decision": "revise",
  "findings": [
    {
      "severity": "major",
      "area": "correctness",
      "issue": "The plan does not define the durable source of truth for whether an invoice was sent.",
      "fix": "Use an idempotency record or persisted email status instead of an in-memory check."
    },
    {
      "severity": "major",
      "area": "tests",
      "issue": "The plan does not include a retry regression test.",
      "fix": "Add a test where the first payment attempt sends an invoice and the retry does not."
    }
  ],
  "missing_tests": [
    "Retry after successful invoice send does not send a second invoice",
    "Retry after failed invoice send can still send one invoice"
  ],
  "simpler_alternative": "Add idempotency around send_invoice before restructuring the retry flow."
}

Use three pass/fail checks before putting the critic into daily use.

It returns valid JSON for normal plans.
It marks incomplete plans as revise or block.
It identifies at least one missing test for intentionally weak plans.

Troubleshooting: Top 3 Failures

1. The critic approves everything

This usually means the prompt is too polite or the examples are too clean. Tighten the role and add explicit rejection criteria.

Require at least one risk analysis item for non-trivial changes.
Reject plans that do not name tests.
Reject plans that do not identify affected files or interfaces.

2. The critic rewrites the implementation

The model is drifting into assistant mode. Repeat the boundary in the system prompt and schema. The output should contain objections and next actions, not replacement code.

Remove fields such as patch, implementation, or new_code.
Add a validation check that fails when large code blocks appear in findings.
Ask for smaller alternatives, not complete rewrites.

3. The findings are too generic

Generic feedback usually means generic context. Add the relevant test names, function signatures, and failure mode. The critic cannot reason about edge cases it cannot see.

Include the current behavior and desired behavior.
Include nearby tests, not just production code.
Include constraints such as backward compatibility or data privacy.

What's Next

Once the critic is useful locally, wire it into the places where plans already exist. The best insertion points are lightweight and reversible.

Run it on pull request descriptions before review starts.
Use block only for missing tests, data loss risks, or security issues.
Store critic output as review notes so humans can accept or reject each finding.
Track false positives so the prompt improves from real engineering feedback.

The mature version is not an autonomous reviewer that replaces engineers. It is a planning pressure test. It makes vague work visible earlier, when changing direction is still cheap.

Rubber-Duck Critic Agent for Code Review Workflows

Bottom Line

Prerequisites

Bottom Line

Prerequisites box

1. Build the Critic Loop

2. Add Code Context Without Flooding the Prompt

3. Verify the Agent and Expected Output

Troubleshooting: Top 3 Failures

1. The critic approves everything

2. The critic rewrites the implementation

3. The findings are too generic

What's Next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox

Related Deep-Dives

AI Code Review Checklist for Engineering Teams

Prompt Engineering for Developers

Implementation Planning Template for Software Teams