AI Agent ArchitectureEngineering Paradigm

Building Code Factories: The New Dawn of Software Engineering

24/7

Autonomous Development Cycles

Code Factories: The New Dawn of Software Engineering

Problem Statement

The Challenge

We aimed to create a system capable of generating high-quality software without constant human intervention. However, we discovered that standard "naive" loop solutions were insufficient for complex software engineering. Simply cycling an LLM until a task is done doesn't work.

LLM Drift & Context Loss

The primary obstacle was LLM drift. As the model worked, its limited memory window filled up with new requests and tool calls. Data got compressed and the model forgot its original instructions. This resulted in "graceful degradation," where the model ran out of valid solutions and began to hallucinate, producing broken code or simply lying to satisfy the prompt.

The Human Bottleneck

AI pair programming worked during working hours, but developers had to babysit it. Approve each step, provide feedback, guide the AI through complex tasks. Nobody was letting it run on its own. Progress stopped when everyone went home.

Coordination Chaos

Large features need architecture, implementation, testing, and documentation. Without a way to coordinate these different concerns, AI assistance stayed stuck on single tasks. We needed a system that maintained context and ensured code integrity.

Quality at Scale

As AI-generated code volume increased, the review burden threatened to overwhelm senior engineers. Manual code review couldn't keep pace with the output of AI-assisted development.

Approach & Methodology

How We Solved It

We flipped the model: engineers should build the factories that create code, not the code itself. To bridge the gap between simple automation and true autonomy, we moved beyond basic Retrieval Augmented Generation (RAG) and implemented advanced context engineering using intelligent agent harnesses powered by Claude Opus.

Claude's architecture proved ideal for this system. Claude Projects provide persistent, shared context across agent sessions—architectural decisions, coding standards, and domain knowledge that every agent inherits. Claude Skills encode reusable behaviors: how to write tests, how to structure PRs, how to handle specific frameworks. Together, they solve the context management problem that breaks naive agent loops.

Our solution relied on three core architectural principles:

Multi-Agent Decomposition

Instead of one agent doing everything, we use specific sub-agents to handle distinct parts of the lifecycle. No "god-mode" agents. Each has a specific toolbelt and clear boundaries.

Spec Coding & Context Management

Detailed planning sessions decompose features into small, focused tasks. Each task spawns a fresh agent session with only the relevant context loaded. This prevents the context bloat that plagues single-loop systems.

Validation Layers

Structured decision trees gate agent output. Each step must pass explicit validation (compile checks, test runs, linter rules) before proceeding. If validation fails, the system reports failure rather than allowing the model to hallucinate past it.

The Code Factory Model

Decompose

Break features into isolated tasks with their own context windows via DAG orchestration

Execute

Run parallel engineer agents in sandboxed Git worktrees until success criteria are met

Validate

Gate every output with compile checks, tests, and linting. Fail explicitly or pass cleanly.

Solution & Delivery

What We Built

We constructed a "factory" orchestrated by a high-level agent that manages the entire software development lifecycle through a Directed Acyclic Graph (DAG).

The Orchestrator & Product Manager

An orchestrator agent receives a feature request and passes it to a Product Manager agent. This sub-agent decomposes the request into granular tasks populated on a Kanban board. Each task spawns a fresh agent session loaded only with relevant files and specs. Task outputs are persisted to disk, not carried in context.

Intelligent Dependency Management (DAG)

The system uses a Directed Acyclic Graph to visualize and manage task flow. It identifies which tasks must be completed sequentially and which can run in parallel. Frontend UI and backend shopping cart logic can be built simultaneously.

Parallel Engineer Agents with Sandboxing

Specialized engineer sub-agents powered by Claude Opus pick up tasks from the queue. To prevent conflicts, we use sandboxing and Git worktrees, allowing agents to code in parallel without overwriting each other's work. Each agent session loads from a shared Claude Project containing the codebase's architectural standards, then applies task-specific Claude Skills for the work at hand—whether that's writing React components, implementing API endpoints, or expanding test coverage. Agents interact with external tools through MCP (Model Context Protocol)—the open standard for AI-to-system communication. MCP servers expose Git operations, file system access, Bazel builds, and issue tracker integration through a unified interface.

Architecture Review Agent

Before code reaches QA, an Architecture Review Agent evaluates it against codified company standards. This agent runs on Claude Opus with a dedicated Claude Project containing the organization's architectural principles, naming conventions, security patterns, and code style. Claude Skills encode the review checklist: what to flag, what to approve, when to escalate. The same documentation that onboards humans now guides agents. It catches architectural drift early, ensuring the factory outputs code that fits the organization, not just code that compiles.

Automated Quality Assurance

A dedicated QA agent validates code quality. It runs the test suite and browser automation to verify the application actually works. When tests fail, it routes the task back to an engineer agent with failure context. Only when all validations pass does it mark the task ready for human review and open a PR.

Structured Requirements: The Factory Input

Autonomous agents need clear success criteria to know when they're done

Spec Generation Agent

A dedicated agent uses a reusable prompt to generate structured specs from feature requests. It produces acceptance criteria and verification steps that downstream agents treat as their "contract."

# Generated Spec

## Acceptance Criteria

- [ ] JWT tokens expire after 24h

- [ ] Refresh token rotation works

- [ ] All auth tests pass

Test-Driven Specifications

For well-defined features, engineers write failing tests first. The agent's success criteria is simple: make all tests pass. This provides unambiguous "done" signals for the autonomous loop.

// auth.test.ts

test('refreshes expired token', ...)

test('rejects invalid signature', ...)

test('rotates refresh tokens', ...)

Issue Tracker Integration

Linear and GitHub Issues feed directly into the factory. Structured templates capture requirements. Agents get assigned issues like human developers, pull context, and deliver PRs.

# LINEAR-1234

Add rate limiting to /api/auth

Labels: backend, security

Assigned: @codegen-agent

Architecture Decision Records

For complex features, ADRs provide the "why" alongside the "what." Agents reference these to make consistent architectural choices without human guidance.

# ADR-007: Auth Strategy

Decision: Use JWT with refresh

Context: Stateless scaling needs

Consequences: Token storage req'd

Agent-Friendly Codebase Architecture

The infrastructure that makes autonomous development possible

Monorepo Structure

A unified monorepo gives agents complete visibility into the codebase. No cross-repo coordination, no version mismatches, no hunting for dependencies. Agents see the same source of truth humans do—every service, library, and config in one place with consistent tooling.

# monorepo/

├── services/api/

├── services/web/

├── libs/shared/

├── tools/codegen/

└── WORKSPACE

Bazel Build System

Google's open-source build tool provides fast, reproducible builds across languages. Agents get deterministic feedback: same inputs always produce same outputs. Hermetic builds mean "it works on my machine" is never an excuse—if Bazel passes, it's correct.

# BUILD.bazel

ts_library(

name = "auth",

deps = ["//libs/shared"],

)

Stacked Diffs

Layered code changes where each diff builds on the previous one. Agents decompose large features into small, reviewable chunks that land independently. No more 2,000-line PRs. Each stacked diff is atomic, testable, and easy to reason about.

# Stack: AUTH-123

3. Add rate limiting middleware

2. Implement token refresh

1. Add JWT validation ✓

Merge Queues

Automated systems that serialize PR merges to prevent conflicts in high-traffic repos. When multiple agents complete tasks simultaneously, the merge queue tests each PR against the latest main, rebases automatically, and merges in order. No manual conflict resolution.

# Merge Queue

→ PR #847 testing...

PR #848 waiting

PR #849 waiting

Auto-rebase on conflict

Code Factory Architecture

Multi-agent orchestration with validation gates at every layer

Input

Spec + Tests

TASK.md

Issues

Linear • GitHub

ADRs

Architecture

Orchestration

Orchestrator

Routes requests

PM Agent

Decomposes → DAG

Task Queue

Kanban board

Execution

Parallel Engineer Agents • Claude Opus

Agent 1

worktree/a

Agent 2

worktree/b

Agent 3

worktree/c

Agent N

worktree/...

Claude ProjectsClaude SkillsMCPGit Worktrees

Gate

COMPILELINTTYPE CHECKTYPES

Review

Architecture Review

Standards • Patterns

QA Agent

Tests • Browser

Gate

UNITINTEGRATIONINTEGE2E

Merge

Stacked Diff

Atomic PR

Merge Queue

Auto-rebase

Main Branch

Production

Feedback

Human review edits → Update Claude Skills & Projects → Agents improveReviews → Update Skills → Agents improve

Technology Stack

Claude OpusClaude ProjectsClaude SkillsMCPBazelMonorepoStacked DiffsMerge QueuesDAG OrchestrationGit WorktreesContainer SandboxingKanban Task ManagementValidation GatesBrowser AutomationGitHub ActionsPythonTypeScript

Outcomes & Impact

Measurable Results

24/7

Autonomous Operation

More Features Shipped

100%

Failures Explicit

70%

Less Time in Review

True Autonomy

The system runs 24/7 without human hand-holding. Engineers queue tasks with clear specs and test criteria before leaving, and return to completed, tested implementations the next morning. For tasks with unambiguous acceptance criteria (bug fixes, feature additions with existing patterns, test coverage expansion), the autonomous loop eliminated the human bottleneck entirely.

Parallel Execution via DAG

By using the DAG to identify independent tasks, agents code frontend and backend simultaneously while humans sleep. Compared to pre-factory velocity (AI-assisted but human-supervised), the team ships roughly 3x more features per sprint. The architecture scales by adding specialized agents. No redesign needed.

Explicit Failures Over Hallucinations

Validation gates at every step (compile, lint, test) catch broken code before it propagates. When agents hit uncertainty or repeated failures, they escalate to humans with full context rather than fabricating solutions. The system is designed to fail loudly, not gracefully degrade into nonsense.

Production-Ready Output

The factory autonomously pushes code to GitHub, creating pull requests pre-verified by QA and architecture review agents. Human reviewers no longer catch style issues, missing tests, or pattern violations. Agents handle that. Review cycles dropped ~70% because seniors focus only on substantive questions the agents flag for human judgment.

Engineers as Architects

The team stopped writing code and started designing systems. They shifted from execution to orchestration: defining quality criteria, describing desired outcomes in natural language, and making architectural decisions. The skill isn't syntax anymore. It's clearly specifying what "done" looks like.

Feedback Flywheels

When engineers review agent PRs, their edits and comments are logged. Rejected patterns trigger updates to the Claude Projects and Skills that agents load. Repeated issues lead to prompt refinements stored in version control alongside the code. The system improves because humans review outputs and those reviews update the guidance agents receive. Claude Skills evolve with the codebase.

"We stopped asking 'how do we write this code?' and started asking 'how do we build the system that writes this code?' That shift changed everything."

— Engineering Lead

Ready to build your own code factory?

We can help you set up autonomous development systems that change how your team ships software.

Schedule a Call View More Case Studies