[ PLAYBOOK · 01 ] · MAY 7, 2026 · 7 min

Agentic coding with Claude Code.

A practitioner view of where the gains are, where the rough edges remain, and how an engineering team can adopt the tool without compromising its review process.


What changes when coding becomes agentic

Most teams call their autocomplete an "AI agent." That is marketing.

A real agent reads your codebase, writes code, runs your tests, and iterates on failures without prompting at every step. Both are useful. Only one changes what a developer does in an afternoon.

Claude Code is the most opinionated tool in the agentic category. It is a CLI built by Anthropic that operates inside an existing repository. It edits files, runs shell commands, parses test output, plans before acting, and surfaces every action for a human to review. The relevant change is not that it writes code. It is that the tool sustains attention across multi-step tasks where the developer previously had to hold the working memory.

The gains from this shift are real, and concentrated. Three workflows account for most of the productivity wins we have measured. The rest is either hype or work that still needs human judgment. The rollout plan has to account for both.

A short tour of Claude Code in 2026

The pieces below are the load-bearing ones for team adoption. There are more (custom slash commands, status line, IDE extensions), but the four below are the ones that change how work gets done.

Plan mode

Plan mode separates "decide what to do" from "do it." The user issues a request. The agent reads the relevant files, formulates a step-by-step plan, and waits for explicit approval before touching the codebase. Approval is a single keystroke. Rejection routes back to refine the plan.

This is the single highest-leverage feature for trust-building during rollout. A team can adopt agentic coding under a rule of "always plan first" and reverse-engineer comfort from there.

Skills

A skill is a packaged workflow. It is a directory with a SKILL.md that includes a name, a description, and a body of instructions for the agent to follow when invoked. The user types /skill-name and the agent loads the skill's full instruction set as context.

Teams use skills for any workflow that recurs: cutting a release, opening a structured pull request, generating a runbook from an incident retrospective, normalizing a feature across the codebase. The skill captures the institutional knowledge once, and every developer on the team executes it the same way.

MCP servers

Model Context Protocol servers are the bridge between Claude Code and the tools your team already uses. There are MCP servers for Linear, GitHub, Postgres, Chrome DevTools, Playwright, and a growing list of paid platforms. The agent can read a Linear ticket, open the corresponding branch, run the failing test through Playwright, and return a draft PR description, all without leaving the terminal session.

The integration cost is low. The discipline cost is real. Each MCP server adds tools the agent can call, which expands its blast radius. Treat MCP server selection as a permissions decision, not a feature decision.

Hooks and subagents

Hooks run shell commands at predefined points in the agent's loop: before a tool call, after it, before user input is sent, on session start, on session stop. They are the right place to enforce guardrails that the agent should not be able to talk its way around. Three patterns earn their keep on most teams.

  1. PreToolUse on Bash: block destructive commands (rm -rf, git push --force, drop table) before the agent can run them.
  2. PostToolUse on Edit: run the formatter on every file the agent touches, so style stays consistent without negotiation.
  3. UserPromptSubmit: inject project context (current branch, latest deploy hash) so every prompt starts grounded.

Subagents are the parallelization story. For tasks that decompose cleanly, Claude Code can spawn subagents that operate in parallel. The pattern that pays off most often is independent research: one subagent searches the codebase for usage patterns, another fetches API documentation, a third scans for security regressions. The orchestrator collects their summaries and proceeds. The pattern that does not pay off is decomposing tightly coupled work. If the subtasks need each other's outputs, sequencing them serially in the orchestrator is faster and cheaper.

The three workflows where the gains are real

In rollouts to engineering teams over the past twelve months, three workflows consistently show measurable productivity gains. They share a property: they are open-ended, they require sustained attention across many files, and the developer's role is more curator than typist.

Long-tail bug investigation

Bugs that span three or more files, that surface only under specific data, and that have no obvious owner. The agent reads the failing test, traces the call graph, hypothesizes a cause, runs the test, iterates. A senior developer remains in the loop to question hypotheses and confirm fixes. In our rollouts, wall-clock time on these tasks routinely drops by 50 percent or more.

Multi-file refactors with a deterministic shape

Examples include "rename this concept across the codebase," "extract this duplicated logic into a shared utility," and "convert this internal API from callbacks to async." These are tasks where the goal is precise and verification is mechanical (the test suite). The agent excels because the work is largely procedural. The developer reviews the diff and runs the suite.

Exploratory codebase questions

Questions like "where do we set this header," "which services consume this queue," and "what was the historical reason for this branch in the auth flow." These are questions whose answers live in the codebase but take a person an hour to assemble. The agent can search, read, and summarize across many files in minutes. The output is a research note, not a code change.

What still needs human judgment

The same rollouts have shown three categories of work where agent output is consistently weaker than a senior developer's, and where leaning on it produces second-order debt.

Architectural decisions

Choosing between two viable patterns. Deciding whether a new service is justified. Deciding what to extract and what to leave inline. The agent can articulate tradeoffs but does not carry the team's history. It does not know which prior decision is load-bearing and which is vestigial. Architectural calls remain a human responsibility, with the agent as a research aide.

Security and access-control review

The agent reads code well. It does not have a threat model for your specific application. It will accept patterns that look idiomatic but leak data through edge cases the team should have flagged. Security review on agent-authored code stays with humans, ideally with a checklist of project-specific failure modes encoded as a hook.

Cross-cutting code style and naming

Style is a team contract. The agent will pick the style most common in its training data, which is rarely the team's. The remedy is a style hook (formatter on PostToolUse), an explicit CLAUDE.md section on naming conventions, and code review that calls out drift early. Without these, agent-authored code starts to look like every other repository on the open internet, which is the loss of identity that makes a codebase hard to reason about over time.

A 30-day rollout plan for an engineering team

The rollout pattern below has worked across teams of 5 to 25 engineers. It frontloads guardrails, builds trust through plan mode, and only loosens defaults after the team has lived with the tool for two weeks.

Week 1: foundations

Install Claude Code on every engineer's machine. Author the project's first CLAUDE.md with three sections: a one-paragraph summary of what the project is, the directory layout, and the test command. Configure two hooks: a PreToolUse Bash deny-list and a PostToolUse formatter call. Authorize MCP servers for the team's source-of-truth tools (GitHub, Linear, the database). Default everyone to plan mode.

Week 2: skills and patterns

Pick three recurring workflows that the team performs weekly. Author them as skills under .claude/skills/. Examples that show up most often: opening a structured PR, writing a runbook entry, adding a new database migration. Hold a 60-minute session where two engineers walk the team through how they used Claude Code for a real ticket, including a moment where they rejected a plan and re-prompted.

Week 3: measurement and review

Track time-to-merge on tickets where the developer used Claude Code, and tickets where they did not. The goal is not a binary "did agent or did not." It is a per-ticket record so the team can see where the wins and losses concentrate. Add a code review rubric for agent-authored diffs: scope adherence, naming consistency with the rest of the codebase, test coverage on the new code path.

Week 4: open the defaults

If the first three weeks went well, lift plan mode as a default for tasks under a defined risk threshold. For higher-risk work (production migrations, auth changes, anything touching billing), plan mode stays mandatory. Encode that rule in settings.json so it is not negotiated case by case.

Measurement: how to know if it is working

A 30-day rollout is not a referendum on the tool. It is a search for the workflows on which the team is faster, the workflows on which it is not, and the workflows on which the agent introduces risk faster than it removes it. Three numbers are useful, in order of how much they matter.

  1. Time-to-merge on agent-assisted tickets, by ticket type. If long-tail bugs and multi-file refactors are not getting faster, the rollout has a problem before the tool does.
  2. Defect rate in the four weeks after a release where agent-authored code shipped, compared to the prior four weeks. A flat or improving trend is the goal. Spikes mean the review rubric needs sharpening.
  3. Self-reported developer focus. A weekly two-question survey: "did the tool save you time this week" and "did it cause friction this week." Both can be true. The answers calibrate the next iteration of skills and hooks.

If those three numbers are moving in the right direction at the end of week four, the rollout is real. If they are not, the failure is in the rollout pattern, not in the tool itself.