Software 2.0

Making AI Real at Scale

Solution architecture, implementation, and deployment of scaled Autonomous Solutions. From multi-agentic meshes to enterprise-wide enablement, transition your organization to the AI era with us.

See how we work
01 / Solutions

What we build


Solution architecture, implementation, and deployment of scaled autonomous systems — engineered for production reliability and integrated across the organization.

01
Customer Support Resolution Agents

Deploy reliable agents for high-ambiguity requests like returns, billing disputes, and account issues. We design systems integrating custom Model Context Protocol (MCP) tools to achieve 80%+ first-contact resolution while securely escalating to human agents.

MCP Escalation CX
02
Code Generation with Claude Code

Accelerate software development. We integrate Claude Code into development workflows with custom slash commands, CLAUDE.md configurations, and intelligent utilization of plan mode versus direct execution for refactoring, debugging, and documentation.

Claude Code DevEx
03
Multi-Agent Research Systems

Orchestrate complex workflows using coordinator-subagent patterns. We build systems where a coordinator delegates to specialized subagents for web search, document analysis, and data synthesis — producing comprehensive, highly accurate, and cited reports.

Orchestration Research
04
Developer Productivity Suites

Empower teams to navigate unfamiliar, massive codebases. We integrate built-in tools (Read, Write, Bash, Grep, Glob) alongside custom MCP servers to help developers understand legacy systems, generate boilerplate, and securely automate repetitive tasks.

MCP Codebase
05
CI/CD AI Integration

Seamlessly embed Claude into your continuous integration and deployment pipelines. We configure systems for automated code reviews, missing test case generation, and actionable pull request feedback — while mitigating false positives.

CI/CD Code review
06
Structured Data Extraction

Transform unstructured documentation into clean datasets. Our architectures leverage strict JSON schemas, validation retries, and explicit tool-use patterns to maintain near-perfect accuracy and handle edge cases gracefully.

JSON Schema ETL
02 / Capabilities

The architectural spine


A breakdown of foundational architecture expertise and the engineering methodologies we apply to ensure robust, production-ready AI deployments.

Agentic Architecture & Orchestration A.01

  • Design and implement agentic loops for autonomous, reliable task execution.
  • Orchestrate multi-agent systems using advanced coordinator-subagent patterns.
  • Configure secure subagent invocation, context passing, and dynamic spawning.
  • Implement multi-step workflows with strict enforcement and handoff patterns.
  • Apply SDK hooks for tool call interception, telemetry, and data normalization.
  • Design task decomposition strategies to solve highly complex, ambiguous workflows.
  • Manage session state, memory resumption, and intelligent thread forking.

Tool Design & MCP Integration A.02

  • Design highly effective tool interfaces with clear boundaries and descriptions.
  • Implement structured error responses and recovery loops for custom APIs.
  • Distribute tools appropriately across distinct agents to avoid hallucination.
  • Integrate custom Model Context Protocol (MCP) servers into enterprise workflows.
  • Select and apply native tools securely within constrained environments.

Context Management & Reliability A.03

  • Manage context decay to preserve critical information across long interactions.
  • Formulate strict escalation criteria for secure human-agent handoffs.
  • Implement error propagation and mitigation strategies across multi-agent meshes.
  • Maintain context limits through intelligent scratchpad files and state exports.
  • Design human-in-the-loop (HITL) review workflows based on statistical confidence calibration.

Prompt Engineering & Extraction A.04

  • Design metaprompts with explicit criteria to improve precision and reduce false positives.
  • Apply few-shot techniques to handle document variability and unstructured noise.
  • Enforce structured outputs using advanced tool-use configurations and JSON schemas.
  • Implement self-correction validations and schema error recovery loops.
  • Design highly efficient batch processing architectures for large datasets.

Development Workflows A.05

  • Configure internal tooling and repository configurations for scalable AI collaboration.
  • Isolate skill sets using strict context boundaries and allowed-tool matrices.
  • Apply path-specific rules for conditional, dynamic loading of coding conventions.
  • Utilize planning frameworks for large-scale migrations and architectural overhauls.
  • Apply iterative refinement techniques for progressive codebase improvement.
03 / Use Cases

Production scenarios


Real-world production contexts and architectural challenges derived from enterprise deployments. Each scenario reflects patterns we've shipped.

Customer Support Resolution Agent

Designed to handle high-ambiguity requests such as returns, billing disputes, and account issues using the Claude Agent SDK. The system leverages custom MCP tools like get_customer and process_refund to achieve an 80%+ first-contact resolution rate with integrated human escalation paths.

Code Generation with Claude Code

Accelerates software development through refactoring, debugging, and documentation workflows. Implementation utilizes custom slash commands, CLAUDE.md configurations, and strategic toggling between plan mode and direct execution.

Multi-Agent Research System

Employs a coordinator-subagent architecture where specialized agents handle web searches, document analysis, and synthesis. This orchestration allows the system to produce comprehensive, cited reports on complex topics.

Developer Productivity with Claude

Assists engineers in exploring unfamiliar codebases and understanding legacy systems. The agent utilizes built-in tools such as Bash, Grep, and Glob alongside custom MCP server integrations to automate repetitive tasks.

Claude Code for Continuous Integration

Embeds Claude into CI/CD pipelines to provide automated code reviews and test case generation. The focus is on designing prompts that offer actionable pull request feedback while minimizing false positives.

Structured Data Extraction

Extracts information from unstructured documents and validates the results against strict JSON schemas. This system is built to maintain high accuracy across edge cases and facilitate integration with downstream backend systems.

04 / Enablement

Training curriculum.


Equip your personnel with hands-on, operational AI mastery. Comprehensive training across the entire official Anthropic course ecosystem.

AI Fluency & Fundamentals

Claude 101Use Claude for everyday work tasks, understand core features, and explore advanced topics.
AI Fluency: Framework & FoundationsCollaborate with AI systems effectively, efficiently, ethically, and safely.
AI Capabilities and LimitationsAn introductory course breaking down how AI actually works under the hood.
Teaching AI FluencyEmpowers academic faculty and instructional designers to teach and assess AI Fluency.
AI Fluency for EducatorsApply AI Fluency into teaching practices and institutional strategy.
AI Fluency for StudentsDevelop skills that enhance learning, career planning, and academic success.
AI Fluency for NonprofitsIncrease organizational impact and efficiency while staying true to mission and values.

Claude Code & Agentic Workflows

Claude Code 101Use Claude Code effectively in your daily development workflow.
Claude Code in ActionHands-on integration of Claude Code into real-world software development pipelines.
Introduction to Claude CoworkWork alongside Claude on your real files. Covers task loops, plugins, and multi-step work.
Introduction to Agent SkillsBuild, configure, and share reusable markdown instructions Claude applies contextually.
Introduction to SubagentsManage context and delegate tasks by creating specialized sub-agents in Claude Code.

API & Cloud Integrations

Building with the Claude APIA comprehensive deep dive covering the full spectrum of building with the Claude API.
Claude with Amazon BedrockFollow the official accreditation program to deploy and configure Claude securely on AWS.
Claude with Google Cloud's Vertex AIMaster working with Anthropic models natively through GCP's Vertex AI infrastructure.

Model Context Protocol

Introduction to MCPBuild MCP servers and clients from scratch using Python. Master tools, resources, and prompts.
MCP: Advanced TopicsAdvanced patterns including sampling, notifications, file system access, and transport.
05 / Knowledge

Practical
domain knowledge


Comprehensive task statements detailing the exact knowledge and skills required for architectural certification.

Agentic Architecture & Orchestration

Domain 01
Design and implement agentic loops for autonomous task execution

Knowledge of

  • The agentic loop lifecycle: sending requests to Claude, inspecting stop_reason ("tool_use" vs "end_turn"), executing requested tools, and returning results for the next iteration
  • How tool results are appended to conversation history so the model can reason about the next action
  • The distinction between model-driven decision-making (Claude reasons about which tool to call next based on context) and pre-configured decision trees or tool sequences

Skills in

  • Implementing agentic loop control flow that continues when stop_reason is "tool_use" and terminates when stop_reason is "end_turn"
  • Adding tool results to conversation context between iterations so the model can incorporate new information into its reasoning
  • Avoiding anti-patterns such as parsing natural language signals to determine loop termination, setting arbitrary iteration caps as the primary stopping mechanism, or checking for assistant text content as a completion indicator
Orchestrate multi-agent systems with coordinator-subagent patterns

Knowledge of

  • Hub-and-spoke architecture where a coordinator agent manages all inter-subagent communication, error handling, and information routing
  • How subagents operate with isolated context — they do not inherit the coordinator's conversation history automatically
  • The role of the coordinator in task decomposition, delegation, result aggregation, and deciding which subagents to invoke based on query complexity
  • Risks of overly narrow task decomposition by the coordinator, leading to incomplete coverage of broad research topics

Skills in

  • Designing coordinator agents that analyze query requirements and dynamically select which subagents to invoke rather than always routing through the full pipeline
  • Partitioning research scope across subagents to minimize duplication (e.g., assigning distinct subtopics or source types to each agent)
  • Implementing iterative refinement loops where the coordinator evaluates synthesis output for gaps, re-delegates to search and analysis subagents with targeted queries, and re-invokes synthesis until coverage is sufficient
  • Routing all subagent communication through the coordinator for observability, consistent error handling, and controlled information flow
Configure subagent invocation, context passing, and spawning

Knowledge of

  • The Task tool as the mechanism for spawning subagents, and the requirement that allowedTools must include "Task" for a coordinator to invoke subagents
  • That subagent context must be explicitly provided in the prompt — subagents do not automatically inherit parent context or share memory between invocations
  • The Agent Definition configuration including descriptions, system prompts, and tool restrictions for each subagent type
  • Fork-based session management for exploring divergent approaches from a shared analysis baseline

Skills in

  • Including complete findings from prior agents directly in the subagent's prompt (e.g., passing web search results and document analysis outputs to the synthesis subagent)
  • Using structured data formats to separate content from metadata (source URLs, document names, page numbers) when passing context between agents to preserve attribution
  • Spawning parallel subagents by emitting multiple Task tool calls in a single coordinator response rather than across separate turns
  • Designing coordinator prompts that specify research goals and quality criteria rather than step-by-step procedural instructions, to enable subagent adaptability
Implement multi-step workflows with enforcement and handoff patterns

Knowledge of

  • The difference between programmatic enforcement (hooks, prerequisite gates) and prompt-based guidance for workflow ordering
  • When deterministic compliance is required (e.g., identity verification before financial operations), prompt instructions alone have a non-zero failure rate
  • Structured handoff protocols for mid-process escalation that include customer details, root cause analysis, and recommended actions

Skills in

  • Implementing programmatic prerequisites that block downstream tool calls until prerequisite steps have completed (e.g., blocking process_refund until get_customer has returned a verified customer ID)
  • Decomposing multi-concern customer requests into distinct items, then investigating each in parallel using shared context before synthesizing a unified resolution
  • Compiling structured handoff summaries (customer ID, root cause, refund amount, recommended action) when escalating to human agents who lack access to the conversation transcript
Apply Agent SDK hooks for tool call interception and data normalization

Knowledge of

  • Hook patterns (e.g., Post ToolUse) that intercept tool results for transformation before the model processes them
  • Hook patterns that intercept outgoing tool calls to enforce compliance rules (e.g., blocking refunds above a threshold)
  • The distinction between using hooks for deterministic guarantees versus relying on prompt instructions for probabilistic compliance

Skills in

  • Implementing Post ToolUse hooks to normalize heterogeneous data formats (Unix timestamps, ISO 8601, numeric status codes) from different MCP tools before the agent processes them
  • Implementing tool call interception hooks that block policy-violating actions (e.g., refunds exceeding $500) and redirect to alternative workflows (e.g., human escalation)
  • Choosing hooks over prompt-based enforcement when business rules require guaranteed compliance
Design task decomposition strategies for complex workflows

Knowledge of

  • When to use fixed sequential pipelines (prompt chaining) versus dynamic adaptive decomposition based on intermediate findings
  • Prompt chaining patterns that break reviews into sequential steps (e.g., analyze each file individually, then run a cross-file integration pass)
  • The value of adaptive investigation plans that generate subtasks based on what is discovered at each step

Skills in

  • Selecting task decomposition patterns appropriate to the workflow: prompt chaining for predictable multi-aspect reviews, dynamic decomposition for open-ended investigation tasks
  • Splitting large code reviews into per-file local analysis passes plus a separate cross-file integration pass to avoid attention dilution
  • Decomposing open-ended tasks (e.g., "add comprehensive tests to a legacy codebase") by first mapping structure, identifying high-impact areas, then creating a prioritized plan that adapts as dependencies are discovered
Manage session state, resumption, and forking

Knowledge of

  • Named session resumption using --resume <session-name> to continue a specific prior conversation
  • fork_session for creating independent branches from a shared analysis baseline to explore divergent approaches
  • The importance of informing the agent about changes to previously analyzed files when resuming sessions after code modifications
  • Why starting a new session with a structured summary is more reliable than resuming with stale tool results

Skills in

  • Using --resume with session names to continue named investigation sessions across work sessions
  • Using fork_session to create parallel exploration branches (e.g., comparing two testing strategies or refactoring approaches from a shared codebase analysis)
  • Choosing between session resumption (when prior context is mostly valid) and starting fresh with injected summaries (when prior tool results are stale)
  • Informing a resumed session about specific file changes for targeted re-analysis rather than requiring full re-exploration

Tool Design & MCP Integration

Domain 02
Design effective tool interfaces with clear descriptions and boundaries

Knowledge of

  • Tool descriptions as the primary mechanism LLMs use for tool selection; minimal descriptions lead to unreliable selection among similar tools
  • The importance of including input formats, example queries, edge cases, and boundary explanations in tool descriptions
  • How ambiguous or overlapping tool descriptions cause misrouting (e.g., analyze_content vs analyze_document with near-identical descriptions)
  • The impact of system prompt wording on tool selection: keyword-sensitive instructions can create unintended tool associations

Skills in

  • Writing tool descriptions that clearly differentiate each tool's purpose, expected inputs, outputs, and when to use it versus similar alternatives
  • Renaming tools and updating descriptions to eliminate functional overlap (e.g., renaming analyze_content to extract_web_results with a web-specific description)
  • Splitting generic tools into purpose-specific tools with defined input/output contracts (e.g., splitting a generic analyze_document into extract_data_points, summarize_content, and verify_claim_against_source)
  • Reviewing system prompts for keyword-sensitive instructions that might override well-written tool descriptions
Implement structured error responses for MCP tools

Knowledge of

  • The MCP isError flag pattern for communicating tool failures back to the agent
  • The distinction between transient errors (timeouts, service unavailability), validation errors (invalid input), business errors (policy violations), and permission errors
  • Why uniform error responses (generic "Operation failed") prevent the agent from making appropriate recovery decisions
  • The difference between retryable and non-retryable errors, and how returning structured metadata prevents wasted retry attempts

Skills in

  • Returning structured error metadata including errorCategory (transient/validation/permission), isRetryable boolean, and human-readable descriptions
  • Including retriable: false flags and customer-friendly explanations for business rule violations so the agent can communicate appropriately
  • Implementing local error recovery within subagents for transient failures, propagating to the coordinator only errors that cannot be resolved locally along with partial results and what was attempted
  • Distinguishing between access failures (needing retry decisions) and valid empty results (representing successful queries with no matches)
Distribute tools appropriately across agents and configure tool choice

Knowledge of

  • The principle that giving an agent access to too many tools (e.g., 18 instead of 4-5) degrades tool selection reliability by increasing decision complexity
  • Why agents with tools outside their specialization tend to misuse them (e.g., a synthesis agent attempting web searches)
  • Scoped tool access: giving agents only the tools needed for their role, with limited cross-role tools for specific high-frequency needs
  • tool_choice configuration options: "auto", "any", and forced tool selection ({"type": "tool", "name": "..."})

Skills in

  • Restricting each subagent's tool set to those relevant to its role, preventing cross-specialization misuse
  • Replacing generic tools with constrained alternatives (e.g., replacing fetch_url with load_document that validates document URLs)
  • Providing scoped cross-role tools for high-frequency needs (e.g., a verify_fact tool for the synthesis agent) while routing complex cases through the coordinator
  • Using tool_choice forced selection to ensure a specific tool is called first (e.g., forcing extract_metadata before enrichment tools), then processing subsequent steps in follow-up turns
  • Setting tool_choice: "any" to guarantee the model calls a tool rather than returning conversational text
Integrate MCP servers into Claude Code and agent workflows

Knowledge of

  • MCP server scoping: project-level (.mcp.json) for shared team tooling vs user-level (~/.claude.json) for personal/experimental servers
  • Environment variable expansion in .mcp.json (e.g., ${GITHUB_TOKEN}) for credential management without committing secrets
  • That tools from all configured MCP servers are discovered at connection time and available simultaneously to the agent
  • MCP resources as a mechanism for exposing content catalogs (e.g., issue summaries, documentation hierarchies, database schemas) to reduce exploratory tool calls

Skills in

  • Configuring shared MCP servers in project-scoped .mcp.json with environment variable expansion for authentication tokens
  • Configuring personal/experimental MCP servers in user-scoped ~/.claude.json
  • Enhancing MCP tool descriptions to explain capabilities and outputs in detail, preventing the agent from preferring built-in tools (like Grep) over more capable MCP tools
  • Choosing existing community MCP servers over custom implementations for standard integrations (e.g., Jira), reserving custom servers for team-specific workflows
  • Exposing content catalogs as MCP resources to give agents visibility into available data without requiring exploratory tool calls
Select and apply built-in tools (Read, Write, Edit, Bash, Grep, Glob) effectively

Knowledge of

  • Grep for content search (searching file contents for patterns like function names, error messages, or import statements)
  • Glob for file path pattern matching (finding files by name or extension patterns)
  • Read/Write for full file operations; Edit for targeted modifications using unique text matching
  • When Edit fails due to non-unique text matches, using Read + Write as a fallback for reliable file modifications

Skills in

  • Selecting Grep for searching code content across a codebase (e.g., finding all callers of a function, locating error messages)
  • Selecting Glob for finding files matching naming patterns (e.g., **/*.test.tsx)
  • Using Read to load full file contents followed by Write when Edit cannot find unique anchor text
  • Building codebase understanding incrementally: starting with Grep to find entry points, then using Read to follow imports and trace flows, rather than reading all files upfront
  • Tracing function usage across wrapper modules by first identifying all exported names, then searching for each name across the codebase

Claude Code Configuration & Workflows

Domain 03
Configure CLAUDE.md files with appropriate hierarchy, scoping, and modular organization

Knowledge of

  • The CLAUDE.md configuration hierarchy: user-level (~/.claude/CLAUDE.md), project-level (.claude/CLAUDE.md or root CLAUDE.md), and directory-level (subdirectory CLAUDE.md files)
  • That user-level settings apply only to that user — instructions in ~/.claude/CLAUDE.md are not shared with teammates via version control
  • The @import syntax for referencing external files to keep CLAUDE.md modular (e.g., importing specific standards files relevant to each package)
  • .claude/rules/ directory for organizing topic-specific rule files as an alternative to a monolithic CLAUDE.md

Skills in

  • Diagnosing configuration hierarchy issues (e.g., a new team member not receiving instructions because they're in user-level rather than project-level configuration)
  • Using @import to selectively include relevant standards files in each package's CLAUDE.md based on maintainer domain knowledge
  • Splitting large CLAUDE.md files into focused topic-specific files in .claude/rules/ (e.g., testing.md, api-conventions.md, deployment.md)
  • Using the /memory command to verify which memory files are loaded and diagnose inconsistent behavior across sessions
Create and configure custom slash commands and skills

Knowledge of

  • Project-scoped commands in .claude/commands/ (shared via version control) vs user-scoped commands in ~/.claude/commands/ (personal)
  • Skills in .claude/skills/ with SKILL.md files that support frontmatter configuration including context: fork, allowed-tools, and argument-hint
  • The context: fork frontmatter option for running skills in an isolated sub-agent context, preventing skill outputs from polluting the main conversation
  • Personal skill customization: creating personal variants in ~/.claude/skills/ with different names to avoid affecting teammates

Skills in

  • Creating project-scoped slash commands in .claude/commands/ for team-wide availability via version control
  • Using context: fork to isolate skills that produce verbose output (e.g., codebase analysis) or exploratory context (e.g., brainstorming alternatives) from the main session
  • Configuring allowed-tools in skill frontmatter to restrict tool access during skill execution (e.g., limiting to file write operations to prevent destructive actions)
  • Using argument-hint frontmatter to prompt developers for required parameters when they invoke the skill without arguments
  • Choosing between skills (on-demand invocation for task-specific workflows) and CLAUDE.md (always-loaded universal standards)
Apply path-specific rules for conditional convention loading

Knowledge of

  • .claude/rules/ files with YAML frontmatter paths fields containing glob patterns for conditional rule activation
  • How path-scoped rules load only when editing matching files, reducing irrelevant context and token usage
  • The advantage of glob-pattern rules over directory-level CLAUDE.md files for conventions that span multiple directories (e.g., test files spread throughout a codebase)

Skills in

  • Creating .claude/rules/ files with YAML frontmatter path scoping (e.g., paths: ["terraform/**/*"]) so rules load only when editing matching files
  • Using glob patterns in path-specific rules to apply conventions to files by type regardless of directory location (e.g., **/*.test.tsx for all test files)
  • Choosing path-specific rules over subdirectory CLAUDE.md files when conventions must apply to files spread across the codebase
Determine when to use plan mode vs direct execution

Knowledge of

  • Plan mode is designed for complex tasks involving large-scale changes, multiple valid approaches, architectural decisions, and multi-file modifications
  • Direct execution is appropriate for simple, well-scoped changes (e.g., adding a single validation check to one function)
  • Plan mode enables safe codebase exploration and design before committing to changes, preventing costly rework
  • The Explore subagent for isolating verbose discovery output and returning summaries to preserve main conversation context

Skills in

  • Selecting plan mode for tasks with architectural implications (e.g., microservice restructuring, library migrations affecting 45+ files, choosing between integration approaches with different infrastructure requirements)
  • Selecting direct execution for well-understood changes with clear scope (e.g., a single-file bug fix with a clear stack trace, adding a date validation conditional)
  • Using the Explore subagent for verbose discovery phases to prevent context window exhaustion during multi-phase tasks
  • Combining plan mode for investigation with direct execution for implementation (e.g., planning a library migration, then executing the planned approach)
Apply iterative refinement techniques for progressive improvement

Knowledge of

  • Concrete input/output examples as the most effective way to communicate expected transformations when prose descriptions are interpreted inconsistently
  • Test-driven iteration: writing test suites first, then iterating by sharing test failures to guide progressive improvement
  • The interview pattern: having Claude ask questions to surface considerations the developer may not have anticipated before implementing
  • When to provide all issues in a single message (interacting problems) versus fixing them sequentially (independent problems)

Skills in

  • Providing 2-3 concrete input/output examples to clarify transformation requirements when natural language descriptions produce inconsistent results
  • Writing test suites covering expected behavior, edge cases, and performance requirements before implementation, then iterating by sharing test failures
  • Using the interview pattern to surface design considerations (e.g., cache invalidation strategies, failure modes) before implementing solutions in unfamiliar domains
  • Providing specific test cases with example input and expected output to fix edge case handling (e.g., null values in migration scripts)
  • Addressing multiple interacting issues in a single detailed message when fixes interact, versus sequential iteration for independent issues
Integrate Claude Code into CI/CD pipelines

Knowledge of

  • The -p (or --print) flag for running Claude Code in non-interactive mode in automated pipelines
  • --output-format json and --json-schema CLI flags for enforcing structured output in CI contexts
  • CLAUDE.md as the mechanism for providing project context (testing standards, fixture conventions, review criteria) to CI-invoked Claude Code
  • Session context isolation: why the same Claude session that generated code is less effective at reviewing its own changes compared to an independent review instance

Skills in

  • Running Claude Code in CI with the -p flag to prevent interactive input hangs
  • Using --output-format json with --json-schema to produce machine-parseable structured findings for automated posting as inline PR comments
  • Including prior review findings in context when re-running reviews after new commits, instructing Claude to report only new or still-unaddressed issues to avoid duplicate comments
  • Providing existing test files in context so test generation avoids suggesting duplicate scenarios already covered by the test suite
  • Documenting testing standards, valuable test criteria, and available fixtures in CLAUDE.md to improve test generation quality and reduce low-value test output

Prompt Engineering & Structured Output

Domain 04
Design prompts with explicit criteria to improve precision and reduce false positives

Knowledge of

  • The importance of explicit criteria over vague instructions (e.g., "flag comments only when claimed behavior contradicts actual code behavior" vs "check that comments are accurate")
  • How general instructions like "be conservative" or "only report high-confidence findings" fail to improve precision compared to specific categorical criteria
  • The impact of false positive rates on developer trust: high false positive categories undermine confidence in accurate categories

Skills in

  • Writing specific review criteria that define which issues to report (bugs, security) versus skip (minor style, local patterns) rather than relying on confidence-based filtering
  • Temporarily disabling high false-positive categories to restore developer trust while improving prompts for those categories
  • Defining explicit severity criteria with concrete code examples for each severity level to achieve consistent classification
Apply few-shot prompting to improve output consistency and quality

Knowledge of

  • Few-shot examples as the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results
  • The role of few-shot examples in demonstrating ambiguous-case handling (e.g., tool selection for ambiguous requests, branch-level test coverage gaps)
  • How few-shot examples enable the model to generalize judgment to novel patterns rather than matching only pre-specified cases
  • The effectiveness of few-shot examples for reducing hallucination in extraction tasks (e.g., handling informal measurements, varied document structures)

Skills in

  • Creating 2-4 targeted few-shot examples for ambiguous scenarios that show reasoning for why one action was chosen over plausible alternatives
  • Including few-shot examples that demonstrate specific desired output format (location, issue, severity, suggested fix) to achieve consistency
  • Providing few-shot examples distinguishing acceptable code patterns from genuine issues to reduce false positives while enabling generalization
  • Using few-shot examples to demonstrate correct handling of varied document structures (inline citations vs bibliographies, methodology sections vs embedded details)
  • Adding few-shot examples showing correct extraction from documents with varied formats to address empty/null extraction of required fields
Enforce structured output using tool use and JSON schemas

Knowledge of

  • Tool use (tool_use) with JSON schemas as the most reliable approach for guaranteed schema-compliant structured output, eliminating JSON syntax errors
  • The distinction between tool_choice: "auto" (model may return text instead of calling a tool), "any" (model must call a tool but can choose which), and forced tool selection (model must call a specific named tool)
  • That strict JSON schemas via tool use eliminate syntax errors but do not prevent semantic errors (e.g., line items that don't sum to total, values in wrong fields)
  • Schema design considerations: required vs optional fields, enum fields with "other" + detail string patterns for extensible categories

Skills in

  • Defining extraction tools with JSON schemas as input parameters and extracting structured data from the tool_use response
  • Setting tool_choice: "any" to guarantee structured output when multiple extraction schemas exist and the document type is unknown
  • Forcing a specific tool with tool_choice: {"type": "tool", "name": "extract_metadata"} to ensure a particular extraction runs before enrichment steps
  • Designing schema fields as optional (nullable) when source documents may not contain the information, preventing the model from fabricating values to satisfy required fields
  • Adding enum values like "unclear" for ambiguous cases and "other" + detail fields for extensible categorization
  • Including format normalization rules in prompts alongside strict output schemas to handle inconsistent source formatting
Implement validation, retry, and feedback loops for extraction quality

Knowledge of

  • Retry-with-error-feedback: appending specific validation errors to the prompt on retry to guide the model toward correction
  • The limits of retry: retries are ineffective when the required information is simply absent from the source document (vs format or structural errors)
  • Feedback loop design: tracking which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns
  • The difference between semantic validation errors (values don't sum, wrong field placement) and schema syntax errors (eliminated by tool use)

Skills in

  • Implementing follow-up requests that include the original document, the failed extraction, and specific validation errors for model self-correction
  • Identifying when retries will be ineffective (e.g., information exists only in an external document not provided) versus when they will succeed (format mismatches, structural output errors)
  • Adding detected_pattern fields to structured findings to enable analysis of false positive patterns when developers dismiss findings
  • Designing self-correction validation flows: extracting calculated_total alongside stated_total to flag discrepancies, adding conflict_detected booleans for inconsistent source data
Design efficient batch processing strategies

Knowledge of

  • The Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA
  • Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation) and inappropriate for blocking workflows (pre-merge checks)
  • The batch API does not support multi-turn tool calling within a single request (cannot execute tools mid-request and return results)
  • custom_id fields for correlating batch request/response pairs

Skills in

  • Matching API approach to workflow latency requirements: synchronous API for blocking pre-merge checks, batch API for overnight/weekly analysis
  • Calculating batch submission frequency based on SLA constraints (e.g., 4-hour windows to guarantee 30-hour SLA with 24-hour batch processing)
  • Handling batch failures: resubmitting only failed documents (identified by custom_id) with appropriate modifications (e.g., chunking documents that exceeded context limits)
  • Using prompt refinement on a sample set before batch-processing large volumes to maximize first-pass success rates and reduce iterative resubmission costs
Design multi-instance and multi-pass review architectures

Knowledge of

  • Self-review limitations: a model retains reasoning context from generation, making it less likely to question its own decisions in the same session
  • Independent review instances (without prior reasoning context) are more effective at catching subtle issues than self-review instructions or extended thinking
  • Multi-pass review: splitting large reviews into per-file local analysis passes plus cross-file integration passes to avoid attention dilution and contradictory findings

Skills in

  • Using a second independent Claude instance to review generated code without the generator's reasoning context
  • Splitting large multi-file reviews into focused per-file passes for local issues plus separate integration passes for cross-file data flow analysis
  • Running verification passes where the model self-reports confidence alongside each finding to enable calibrated review routing

Context Management & Reliability

Domain 05
Manage conversation context to preserve critical information across long interactions

Knowledge of

  • Progressive summarization risks: condensing numerical values, percentages, dates, and customer-stated expectations into vague summaries
  • The "lost in the middle" effect: models reliably process information at the beginning and end of long inputs but may omit findings from middle sections
  • How tool results accumulate in context and consume tokens disproportionately to their relevance (e.g., 40+ fields per order lookup when only 5 are relevant)
  • The importance of passing complete conversation history in subsequent API requests to maintain conversational coherence

Skills in

  • Extracting transactional facts (amounts, dates, order numbers, statuses) into a persistent "case facts" block included in each prompt, outside summarized history
  • Extracting and persisting structured issue data (order IDs, amounts, statuses) into a separate context layer for multi-issue sessions
  • Trimming verbose tool outputs to only relevant fields before they accumulate in context (e.g., keeping only return-relevant fields from order lookups)
  • Placing key findings summaries at the beginning of aggregated inputs and organizing detailed results with explicit section headers to mitigate position effects
  • Requiring subagents to include metadata (dates, source locations, methodological context) in structured outputs to support accurate downstream synthesis
  • Modifying upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains when downstream agents have limited context budgets
Design effective escalation and ambiguity resolution patterns

Knowledge of

  • Appropriate escalation triggers: customer requests for a human, policy exceptions/gaps (not just complex cases), and inability to make meaningful progress
  • The distinction between escalating immediately when a customer explicitly demands it versus offering to resolve when the issue is straightforward
  • Why sentiment-based escalation and self-reported confidence scores are unreliable proxies for actual case complexity
  • How multiple customer matches require clarification (requesting additional identifiers) rather than heuristic selection

Skills in

  • Adding explicit escalation criteria with few-shot examples to the system prompt demonstrating when to escalate versus resolve autonomously
  • Honoring explicit customer requests for human agents immediately without first attempting investigation
  • Acknowledging frustration while offering resolution when the issue is within the agent's capability, escalating only if the customer reiterates their preference
  • Escalating when policy is ambiguous or silent on the customer's specific request (e.g., competitor price matching when policy only addresses own-site adjustments)
  • Instructing the agent to ask for additional identifiers when tool results return multiple matches, rather than selecting based on heuristics
Implement error propagation strategies across multi-agent systems

Knowledge of

  • Structured error context (failure type, attempted query, partial results, alternative approaches) as enabling intelligent coordinator recovery decisions
  • The distinction between access failures (timeouts needing retry decisions) and valid empty results (successful queries with no matches)
  • Why generic error statuses ("search unavailable") hide valuable context from the coordinator
  • Why silently suppressing errors (returning empty results as success) or terminating entire workflows on single failures are both anti-patterns

Skills in

  • Returning structured error context including failure type, what was attempted, partial results, and potential alternatives to enable coordinator recovery
  • Distinguishing access failures from valid empty results in error reporting so the coordinator can make appropriate decisions
  • Having subagents implement local recovery for transient failures and only propagate errors they cannot resolve, including what was attempted and partial results
  • Structuring synthesis output with coverage annotations indicating which findings are well-supported versus which topic areas have gaps due to unavailable sources
Manage context effectively in large codebase exploration

Knowledge of

  • Context degradation in extended sessions: models start giving inconsistent answers and referencing "typical patterns" rather than specific classes discovered earlier
  • The role of scratchpad files for persisting key findings across context boundaries
  • Subagent delegation for isolating verbose exploration output while the main agent coordinates high-level understanding
  • Structured state persistence for crash recovery: each agent exports state to a known location, and the coordinator loads a manifest on resume

Skills in

  • Spawning subagents to investigate specific questions (e.g., "find all test files," "trace refund flow dependencies") while the main agent preserves high-level coordination
  • Having agents maintain scratchpad files recording key findings, referencing them for subsequent questions to counteract context degradation
  • Summarizing key findings from one exploration phase before spawning sub-agents for the next phase, injecting summaries into initial context
  • Designing crash recovery using structured agent state exports (manifests) that the coordinator loads on resume and injects into agent prompts
  • Using /compact to reduce context usage during extended exploration sessions when context fills with verbose discovery output
Design human review workflows and confidence calibration

Knowledge of

  • The risk that aggregate accuracy metrics (e.g., 97% overall) may mask poor performance on specific document types or fields
  • Stratified random sampling for measuring error rates in high-confidence extractions and detecting novel error patterns
  • Field-level confidence scores calibrated using labeled validation sets for routing review attention
  • The importance of validating accuracy by document type and field segment before automating high-confidence extractions

Skills in

  • Implementing stratified random sampling of high-confidence extractions for ongoing error rate measurement and novel pattern detection
  • Analyzing accuracy by document type and field to verify consistent performance across all segments before reducing human review
  • Having models output field-level confidence scores, then calibrating review thresholds using labeled validation sets
  • Routing extractions with low model confidence or ambiguous/contradictory source documents to human review, prioritizing limited reviewer capacity
Preserve information provenance and handle uncertainty in multi-source synthesis

Knowledge of

  • How source attribution is lost during summarization steps when findings are compressed without preserving claim-source mappings
  • The importance of structured claim-source mappings that the synthesis agent must preserve and merge when combining findings
  • How to handle conflicting statistics from credible sources: annotating conflicts with source attribution rather than arbitrarily selecting one value
  • Temporal data: requiring publication/collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions

Skills in

  • Requiring subagents to output structured claim-source mappings (source URLs, document names, relevant excerpts) that downstream agents preserve through synthesis
  • Structuring reports with explicit sections distinguishing well-established findings from contested ones, preserving original source characterizations and methodological context
  • Completing document analysis with conflicting values included and explicitly annotated, letting the coordinator decide how to reconcile before passing to synthesis
  • Requiring subagents to include publication or data collection dates in structured outputs to enable correct temporal interpretation
  • Rendering different content types appropriately in synthesis outputs — financial data as tables, news as prose, technical findings as structured lists — rather than converting everything to a uniform format