Building Code Factories: The New Dawn of Software Engineering
24/7
Autonomous Development Cycles

The Challenge
We aimed to create a system capable of generating high-quality software without constant human intervention. However, we discovered that standard "naive" loop solutions were insufficient for complex software engineering. Simply cycling an LLM until a task is done doesn't work.
LLM Drift & Context Loss
The primary obstacle was LLM drift. As the model worked, its limited memory window filled up with new requests and tool calls. Data got compressed and the model forgot its original instructions. This resulted in "graceful degradation," where the model ran out of valid solutions and began to hallucinate, producing broken code or simply lying to satisfy the prompt.
The Human Bottleneck
AI pair programming worked during working hours, but developers had to babysit it. Approve each step, provide feedback, guide the AI through complex tasks. Nobody was letting it run on its own. Progress stopped when everyone went home.
Coordination Chaos
Large features need architecture, implementation, testing, and documentation. Without a way to coordinate these different concerns, AI assistance stayed stuck on single tasks. We needed a system that maintained context and ensured code integrity.
Quality at Scale
As AI-generated code volume increased, the review burden threatened to overwhelm senior engineers. Manual code review couldn't keep pace with the output of AI-assisted development.
How We Solved It
We flipped the model: engineers should build the factories that create code, not the code itself. To bridge the gap between simple automation and true autonomy, we moved beyond basic Retrieval Augmented Generation (RAG) and implemented advanced context engineering using intelligent agent harnesses powered by Claude Opus.
Claude's architecture proved ideal for this system. Claude Projects provide persistent, shared context across agent sessions—architectural decisions, coding standards, and domain knowledge that every agent inherits. Claude Skills encode reusable behaviors: how to write tests, how to structure PRs, how to handle specific frameworks. Together, they solve the context management problem that breaks naive agent loops.
Our solution relied on three core architectural principles:
Multi-Agent Decomposition
Instead of one agent doing everything, we use specific sub-agents to handle distinct parts of the lifecycle. No "god-mode" agents. Each has a specific toolbelt and clear boundaries.
Spec Coding & Context Management
Detailed planning sessions decompose features into small, focused tasks. Each task spawns a fresh agent session with only the relevant context loaded. This prevents the context bloat that plagues single-loop systems.
Validation Layers
Structured decision trees gate agent output. Each step must pass explicit validation (compile checks, test runs, linter rules) before proceeding. If validation fails, the system reports failure rather than allowing the model to hallucinate past it.
The Code Factory Model
Decompose
Break features into isolated tasks with their own context windows via DAG orchestration
Execute
Run parallel engineer agents in sandboxed Git worktrees until success criteria are met
Validate
Gate every output with compile checks, tests, and linting. Fail explicitly or pass cleanly.
What We Built
We constructed a "factory" orchestrated by a high-level agent that manages the entire software development lifecycle through a Directed Acyclic Graph (DAG).
The Orchestrator & Product Manager
An orchestrator agent receives a feature request and passes it to a Product Manager agent. This sub-agent decomposes the request into granular tasks populated on a Kanban board. Each task spawns a fresh agent session loaded only with relevant files and specs. Task outputs are persisted to disk, not carried in context.
Intelligent Dependency Management (DAG)
The system uses a Directed Acyclic Graph to visualize and manage task flow. It identifies which tasks must be completed sequentially and which can run in parallel. Frontend UI and backend shopping cart logic can be built simultaneously.
Parallel Engineer Agents with Sandboxing
Specialized engineer sub-agents powered by Claude Opus pick up tasks from the queue. To prevent conflicts, we use sandboxing and Git worktrees, allowing agents to code in parallel without overwriting each other's work. Each agent session loads from a shared Claude Project containing the codebase's architectural standards, then applies task-specific Claude Skills for the work at hand—whether that's writing React components, implementing API endpoints, or expanding test coverage. Agents interact with external tools through MCP (Model Context Protocol)—the open standard for AI-to-system communication. MCP servers expose Git operations, file system access, Bazel builds, and issue tracker integration through a unified interface.
Architecture Review Agent
Before code reaches QA, an Architecture Review Agent evaluates it against codified company standards. This agent runs on Claude Opus with a dedicated Claude Project containing the organization's architectural principles, naming conventions, security patterns, and code style. Claude Skills encode the review checklist: what to flag, what to approve, when to escalate. The same documentation that onboards humans now guides agents. It catches architectural drift early, ensuring the factory outputs code that fits the organization, not just code that compiles.
Automated Quality Assurance
A dedicated QA agent validates code quality. It runs the test suite and browser automation to verify the application actually works. When tests fail, it routes the task back to an engineer agent with failure context. Only when all validations pass does it mark the task ready for human review and open a PR.
Structured Requirements: The Factory Input
Autonomous agents need clear success criteria to know when they're done
Spec Generation Agent
A dedicated agent uses a reusable prompt to generate structured specs from feature requests. It produces acceptance criteria and verification steps that downstream agents treat as their "contract."
Test-Driven Specifications
For well-defined features, engineers write failing tests first. The agent's success criteria is simple: make all tests pass. This provides unambiguous "done" signals for the autonomous loop.
Issue Tracker Integration
Linear and GitHub Issues feed directly into the factory. Structured templates capture requirements. Agents get assigned issues like human developers, pull context, and deliver PRs.
Architecture Decision Records
For complex features, ADRs provide the "why" alongside the "what." Agents reference these to make consistent architectural choices without human guidance.
Agent-Friendly Codebase Architecture
The infrastructure that makes autonomous development possible
Monorepo Structure
A unified monorepo gives agents complete visibility into the codebase. No cross-repo coordination, no version mismatches, no hunting for dependencies. Agents see the same source of truth humans do—every service, library, and config in one place with consistent tooling.
Bazel Build System
Google's open-source build tool provides fast, reproducible builds across languages. Agents get deterministic feedback: same inputs always produce same outputs. Hermetic builds mean "it works on my machine" is never an excuse—if Bazel passes, it's correct.
Stacked Diffs
Layered code changes where each diff builds on the previous one. Agents decompose large features into small, reviewable chunks that land independently. No more 2,000-line PRs. Each stacked diff is atomic, testable, and easy to reason about.
Merge Queues
Automated systems that serialize PR merges to prevent conflicts in high-traffic repos. When multiple agents complete tasks simultaneously, the merge queue tests each PR against the latest main, rebases automatically, and merges in order. No manual conflict resolution.
Code Factory Architecture
Multi-agent orchestration with validation gates at every layer
Spec + Tests
TASK.md
Issues
Linear • GitHub
ADRs
Architecture
Orchestrator
Routes requests
PM Agent
Decomposes → DAG
Task Queue
Kanban board
Parallel Engineer Agents • Claude Opus
Agent 1
Agent 2
Agent 3
Agent N
Architecture Review
Standards • Patterns
QA Agent
Tests • Browser
Stacked Diff
Atomic PR
Merge Queue
Auto-rebase
Main Branch
Production
Technology Stack
Measurable Results
24/7
Autonomous Operation
3x
More Features Shipped
100%
Failures Explicit
70%
Less Time in Review
True Autonomy
The system runs 24/7 without human hand-holding. Engineers queue tasks with clear specs and test criteria before leaving, and return to completed, tested implementations the next morning. For tasks with unambiguous acceptance criteria (bug fixes, feature additions with existing patterns, test coverage expansion), the autonomous loop eliminated the human bottleneck entirely.
Parallel Execution via DAG
By using the DAG to identify independent tasks, agents code frontend and backend simultaneously while humans sleep. Compared to pre-factory velocity (AI-assisted but human-supervised), the team ships roughly 3x more features per sprint. The architecture scales by adding specialized agents. No redesign needed.
Explicit Failures Over Hallucinations
Validation gates at every step (compile, lint, test) catch broken code before it propagates. When agents hit uncertainty or repeated failures, they escalate to humans with full context rather than fabricating solutions. The system is designed to fail loudly, not gracefully degrade into nonsense.
Production-Ready Output
The factory autonomously pushes code to GitHub, creating pull requests pre-verified by QA and architecture review agents. Human reviewers no longer catch style issues, missing tests, or pattern violations. Agents handle that. Review cycles dropped ~70% because seniors focus only on substantive questions the agents flag for human judgment.
Engineers as Architects
The team stopped writing code and started designing systems. They shifted from execution to orchestration: defining quality criteria, describing desired outcomes in natural language, and making architectural decisions. The skill isn't syntax anymore. It's clearly specifying what "done" looks like.
Feedback Flywheels
When engineers review agent PRs, their edits and comments are logged. Rejected patterns trigger updates to the Claude Projects and Skills that agents load. Repeated issues lead to prompt refinements stored in version control alongside the code. The system improves because humans review outputs and those reviews update the guidance agents receive. Claude Skills evolve with the codebase.
"We stopped asking 'how do we write this code?' and started asking 'how do we build the system that writes this code?' That shift changed everything."
— Engineering Lead
Ready to build your own code factory?
We can help you set up autonomous development systems that change how your team ships software.