Building GenAgent: A Journey Through Reliable Code Generation

November 18, 2025

Abhishek Pradhan - Rakesh Reddy

Building GenAgent marks an important step toward our broader vision of Agentic Data Analysis and the Autonomous Self-Driving Enterprise. As we imagine systems that can understand goals, generate the tools they need, and improve themselves over time, dependable code generation becomes a foundational capability. GenAgent is part of that path — not just a way to produce code, but a way to coordinate intelligent components that work together with increasing autonomy. This post explains how we approached that challenge and why it matters for creating enterprises that can eventually run, adapt, and evolve on their own.

We set out to build GenAgent with a clear goal: generate production-ready Python tools from natural language descriptions. Not just code that compiles, but tools that users could actually depend on—tools that work the first time, integrate cleanly into our Emergence Craft platform, and compose reliably when the orchestrator chains them together.

What we didn't anticipate was how many assumptions we'd need to challenge along the way.

The Deceptively Simple Beginning

"Create a tool to fetch YouTube video transcripts."

We typed this into a state-of-the-art LLM and got back code that looked perfect:

‍

import youtube_dl

def fetch_transcript(video_url):
	ydl = youtube_dl.YoutubeDL({'writesubtitles': True})    
    info = ydl.extract_info(video_url)    
    return info['subtitles']['en']

‍

Clean structure. Proper imports. Does exactly what the function name suggests. We hit run. It failed immediately.

The library (youtube_dl) had been deprecated for years. The API call was wrong—subtitles aren't accessed through extract_info. The code assumed every video had transcripts, and that they'd be in English. There were no type hints, making the function unusable for our platform's schema generation.

This wasn't an isolated case. We tried dozens of prompts across different domains. The pattern repeated: models generated code that looked right but broke when executed. The gap wasn't knowledge—these models had been trained on millions of lines of working code. The gap was process. Single-shot generation was skipping all the steps that working code actually requires: planning, research, interface design, validation.

We needed a different approach.

‍

Breaking Down the Problem

We started with a hypothesis: what if we separated code generation into the same phases humans actually use when writing production code?

When you're asked to build a new feature, you don't immediately start typing implementation code. You think through the requirements. You research the APIs you'll need. You design the interface. You write the implementation. You test it. You iterate based on what breaks.

We built six specialized agents, each focused on one phase:

Planner: Break down the goal into concrete steps
Critic: Review plans and identify gaps before any code is written
Context Gatherer: Find relevant documentation and examples
Blueprint Designer: Create typed function specifications
Coder: Generate implementations
Executor: Run code in isolated environments and validate it works

An Orchestrator coordinates the workflow, deciding which agents to call and when.

The first version was clunky. Agents called tools in odd orders. The workflow was rigid—every request went through all six phases even when that made no sense. But something worked: the code that made it through the full pipeline actually ran.

The Planning Problem We Didn't Expect

Early on, we noticed something strange. The Planner would generate what seemed like solid plans. The Coder would follow those plans precisely. The code would compile and run without errors. But it wouldn't actually work correctly.

Here's what happened with the YouTube transcript request:

‍

The Planner's initial output:

‍

tasks:
- task: Define the entry point function `fetch_youtube_transcript`.
  description: Main function to retrieve transcript for a YouTube video.
  
- task: Extract video ID from URL.
  description: Parse the video_url to extract the unique video identifier.

- task: Fetch transcript using youtube-transcript-api.
  description: Use the YouTubeTranscriptApi to retrieve the transcript for the video ID.

- task: Return the transcript text.
  description: Return the fetched transcript as a string.

‍

Technically accurate. The Coder implemented it exactly. The code ran. But it broke on videos without transcripts, failed on non-English videos, and didn't handle rate limiting.

The issue was confirmation bias. When we asked the Planner to review its own work, it consistently defended its approach. "Yes, this plan handles the core requirement." Missing edge cases stayed invisible.

‍

We added a Critic agent.

‍

Not just another prompt to the same model, but a separate agent with a different perspective. Its only job: poke holes in the plan. The Planner proposes. The Critic challenges. They debate until the plan is solid or they hit three rounds (we learned that more rounds rarely helped).

The dynamic changed immediately. On that same YouTube request:

Critic's response:

‍

critique:

logical_completeness:
  missing_tasks:
    - issue: No error handling for transcript unavailability
      severity: critical
      description: The plan assumes all videos have transcripts available. Many videos don't have captions, which will cause the fetch to fail. Need to catch TranscriptsDisabled exception and provide clear error message to user.
      
    - issue: Language preference not specified
      severity: high
      description: Plan doesn't address how to handle non-English transcripts. Should allow users to specify preferred languages and fallback logic when preferred language is unavailable.
      
    - issue: Missing URL validation
      severity: medium
      description: No validation that the provided URL is actually a valid YouTube URL before attempting to extract video ID. Invalid URLs will cause cryptic errors downstream.

technical_accuracy:
  inconsistencies:
    - issue: Missing exception types
      description: Plan references catching exceptions but doesn't specify the exact exception types (TranscriptsDisabled, NoTranscriptFound) that youtube-transcript-api raises. This leads to overly broad exception handling.
      
    - issue: Return type ambiguity
      description: Task says "return the transcript" but doesn't specify the data structure. youtube-transcript-api returns a list of dictionaries with 'text', 'start', 'duration' fields, not a plain string.

clarity_and_structure:
  strengths:
    - Good separation into discrete tasks (extract ID, fetch transcript, return data)
    - Helper function approach is appropriate
  
  improvements_needed:
    - Specify the exact structure of transcript data being returned
    - Clarify whether to return raw transcript segments or concatenated text

robustness_and_error_handling:
  gaps:
    - issue: No rate limit handling
      severity: medium
      description: YouTube API has rate limits. Plan needs strategy for handling rate limit errors, implementing exponential backoff, or communicating limits to users.
      
    - issue: Network error handling missing
      severity: medium
      description: No handling for network failures when calling YouTube API. Should include retry logic for transient failures.
      
    - issue: Video ID extraction failure
      severity: high
      description: If URL parsing fails to extract valid video ID, what exception is raised? How should this be communicated to user?

assumptions_and_knowledge_gaps:
  implicit_assumptions:
    - Assumes video ID can always be extracted from any YouTube URL format (youtube.com, youtu.be, mobile URLs, embedded URLs all have different formats)
    - Assumes transcript, when available, is in a format directly usable without post-processing
    
  identified_knowledge_gaps:
    - "What's the exact exception type thrown by youtube-transcript-api when transcripts are disabled?"
    - "Does youtube-transcript-api handle language fallback automatically, or must we implement it?"
    - "What are all the YouTube URL formats we need to support for video ID extraction?"
    - "What's the structure of transcript segments returned - is it a list of dicts with 'text', 'start', 'duration'?"

recommendation: major_revision_required
reasoning: Critical gaps in error handling and unclear return types will lead to brittle implementation. Must address transcript unavailability and language handling before proceeding to implementation.

‍

Planner's revision:

‍

tasks:
- task: Define the entry point function `fetch_youtube_transcript`.
  description: Main function to retrieve transcript for a YouTube video with error handling and language support.
  
- task: Validate and parse the YouTube URL.
  description: Verify the URL is a valid YouTube format and extract the video ID. Raise InvalidVideoURL if format is incorrect.

- task: Fetch transcript with language preferences.
  description: Attempt to fetch transcript using the language preference list. Try each language in order until one succeeds.

- task: Handle transcript unavailability.
  description: Catch TranscriptsDisabled exception and provide a clear error message indicating the video has no captions.

- task: Implement rate limit handling.
  description: Wrap API calls in try-except to catch rate limit errors. Implement exponential backoff with 3 retries.

- task: Return structured transcript data.
  description: Return transcript segments as a list of dictionaries containing text, start time, and duration for each segment.

‍

This debate happened before a single line of implementation code was written. Catching logical errors at the planning stage turned out to be dramatically cheaper than discovering them after execution failures.

‍

The insight: Structured critique beats repeated generation. The Critic sees the plan from a different angle and catches issues that simple regeneration would miss.

‍

The Knowledge Cutoff Problem

‍

Even with solid plans, we kept hitting a wall. Models consistently used deprecated libraries. They'd use youtube_dl when they should use youtube-transcript-api. They'd implement OAuth flows that APIs had moved away from years ago. They'd import packages that no longer existed.

‍

The problem was obvious in retrospect: models are frozen in time. They can't know what's changed since their training cutoff. They're coding against a stale snapshot of the world.

‍

But there was a deeper issue: models don't say "I don't know." Ask a model to generate code using the Stripe SDK, and it'll confidently write something that looks plausible. Completely wrong, but plausible. We'd only discover the hallucination during execution, after wasting time generating and validating code based on fantasy documentation.

‍

We needed to make knowledge gaps explicit.

‍

We modified the Planner to identify what it doesn't know:

‍

framework_library_knowledge_gaps:
- "What Python library is recommended for interacting with the GitHub API (PyGithub vs requests)?"
- "How should we handle GitHub API pagination if repository metadata requires multiple API calls?"
- "What is the structure of the GitHub API response for the /repos endpoint, particularly the field names for stars, watchers, and forks?"
- "What are the recommended patterns for handling GitHub API rate limits and authentication errors?"

‍

Now these explicit gaps trigger the Context Gatherer before code generation. The Coder receives accurate API information instead of generating from incorrect memory.

‍

But context quality matters more than quantity.

‍

Our first attempt at context retrieval: when the Planner identified missing information—say, the Stripe API spec—we'd dump the entire documentation into the prompt. Thousands of tokens of every endpoint, every parameter, every example.

‍

Code generation slowed to a crawl. Costs multiplied. But worse: the generated code got less accurate. Models would find contradictory examples in different sections and pick the wrong one. Or they'd fixate on an irrelevant advanced feature instead of the simple approach.

‍

We found that 5 highly relevant sections consistently outperformed comprehensive documentation dumps.

‍

The Context Gatherer became a multi-tiered precision retrieval system:

Tier 1: Local Knowledge Base (curated docs)

Highest accuracy
Instant retrieval
Limited coverage (only what we've curated)
First check for common libraries and APIs

Tier 2: Context7 (structured API docs)

High accuracy
Fast retrieval
Good library coverage
Fallback for well-documented libraries

Tier 3: Web Search

Variable accuracy
Slower retrieval
Universal coverage
Used only for novel/recent information not in structured sources

The Context Gatherer uses semantic similarity scoring (vector embeddings) to select the top 5 most relevant sections from the winning tier. This reduced token usage substantially while improving code quality.

‍

For the YouTube transcript tool, instead of the full documentation, we provided just this:

‍

from youtube_transcript_api import YouTubeTranscriptApi

ytt_api = YouTubeTranscriptApi()

# Get transcript
transcript = ytt_api.fetch(video_id)

# Handle errors
try:
    transcript = ytt_api.fetch(video_id)
except TranscriptsDisabled:
    # Video has no transcript

‍

Five sections. Minimal. Precise. Current. This single example eliminated the deprecated library problem and showed exactly the pattern needed—nothing more.

‍

The complete retrieval process:

Planner identifies knowledge gaps explicitly in the plan
Orchestrator triggers Context Gatherer if gaps exist
Context Gatherer queries tiers sequentially: Local KB → Context7 → Web Search
Semantic similarity scoring ranks all results
Top 5 sections returned to downstream agents
Coder generates with accurate context instead of hallucinated APIs

‍

This approach had a side benefit: when we reviewed failed generations, we could see exactly what context was provided. If the context was wrong or incomplete, we knew to improve our knowledge base. If the context was right but the code was still wrong, the problem was elsewhere in the pipeline.

‍

The lesson: External context is essential, but it must be focused. Make knowledge gaps explicit so models don't hallucinate. Use multi-tiered retrieval with relevance ranking. Five highly relevant sections beat comprehensive documentation dumps.

The Cost Paradox of Parallel Generation

By this point, we had a working pipeline: plan, critique, gather context, generate code. But the success rate still wasn't high enough. Code would fail with runtime errors. Retry loops would regenerate the same mistake over and over.

‍

"What if we generate three implementations in parallel and pick the best one?"

‍

"That's 3× the token cost."

‍

But we tried it anyway.

‍

The results surprised us. Yes, parallel generation cost 3× the tokens upfront. But:

Latency dropped (parallel execution vs. sequential retries)
Success rate jumped significantly (we could pick the best of three)
Total retry loops decreased, meaning fewer overall tokens spent
We got diversity—different approaches to the same problem, not the same mistake three times

‍

The math worked out: paying 3× upfront to generate three options cost less than paying for sequential failures and retries. Plus, we could compare implementations and often got better code quality by selecting the best approach.

‍

The counterintuitive insight: Parallel diversity beats sequential retry. When the first attempt has a fundamental misunderstanding, retrying just repeats the same mistake. Parallel generation explores different solution spaces simultaneously.

The Validation Layers We Needed

Code that compiles isn't code that works. We learned this the hard way.

We tracked success rates at each stage:

Most generated code is syntactically valid (passes parsing)
Most syntactically valid code has correct imports
Most code with correct imports actually runs without exceptions
But far fewer produce correct outputs on the first try

The gap between "runs" and "works correctly" was massive. Static analysis missed an entire class of problems that only appeared at runtime: unexpected API response formats, edge cases in data processing, incorrect assumptions about return values.

‍

We added validation at every layer:

Syntax validation before we even try to run code:

import ast
try:
    ast.parse(generated_code)
except SyntaxError:
    # Catch parse errors, missing colons, unmatched parentheses
    # Immediate retry with error feedback

‍

This simple check eliminated an entire class of failures in milliseconds, before we paid the cost of creating virtual environments and installing dependencies.

Blueprint specifications before implementation:

‍

Instead of going straight from plan to code, we added an intermediate step: typed function specifications.

‍

def fetch_transcript(video_id: str, languages: list[str] = None) -> list[dict]:
    """
    Fetches transcript for a YouTube video.
    
    Args:
        video_id: YouTube video identifier
        languages: Preferred languages in order of preference (default: ['en'])
    
    Returns:
        List of transcript segments, each with 'text', 'start', 'duration'
    
    Raises:
        TranscriptsDisabled: Video has no available transcripts
        NoTranscriptFound: Requested language not available
    """
    pass

‍

This blueprint served multiple purposes:

Reduced type errors by establishing clear contracts
Made debugging easier with a clear specification to compare against
Enabled automatic schema generation for platform registration
Clarified requirements before writing implementation details

‍

The blueprint became an intermediate representation that both humans and models could reason about more clearly than either prose plans or raw code.

Isolated execution with real validation:

‍

Generated code runs in separate virtual environments with:

Dedicated venv instances
Fresh dependency installation from generated requirements.txt
Timeout protection
Captured stdout/stderr

‍

If execution fails due to missing packages, the Executor attempts to fix requirements.txt automatically. If execution fails due to logic errors, we feed the error back to the Coder with the full execution context for regeneration.

The multi-stage validation caught different errors at appropriate stages:

‍

Stage	What it catches
Planning	Logic errors, missing requirements
Blueprint	Interface mismatches, type inconsistencies
Syntax validation	Parse errors, invalid Python
Execution	Runtime errors, incorrect behavior

‍

This defense-in-depth approach pushed our success rate from 60% to over 85% for complex tasks.

The Orchestration Challenge

With all the agents in place, we faced a new problem: how should the Orchestrator decide which agents to call and in what order?

‍

Version 1 of the orchestrator: run every agent, every time, in a fixed sequence.

‍

This worked but was wasteful. Simple requests like "Create a function to add two numbers" went through the full workflow: planning, critique, context gathering, blueprint, three parallel generations, execution. Overkill.

‍

We tried letting the Orchestrator figure it out implicitly. It sometimes skipped steps. Sometimes called agents in bizarre orders. Sometimes got stuck in loops.

‍

We were on version 18 of the orchestrator prompt before it finally worked reliably.

‍

The breakthrough came from making workflows explicit:

‍

For simple, well-defined tasks:
  blueprint → coder → executor

For complex tasks requiring research:
  planner → critic → context → blueprint → coder → executor

If execution fails with dependency errors:
  Re-run executor (it will auto-fix requirements)

If execution fails with logic errors:
  context (gather more examples) → coder (regenerate) → executor

‍

Instead of expecting the Orchestrator to discover optimal workflows, we encoded the patterns we observed through hundreds of test runs. The Orchestrator's job became: analyze the request, classify its complexity, follow the appropriate workflow pattern.

‍

Results:

A lot of requests took the simple workflow (blueprint → code → execute)
Average latency dropped significantly for simple tasks
Complex tasks maintained the full workflow where it was justified

‍

The lesson: Don't expect orchestrators to figure out optimal workflows through implicit learning. Explicit guidance based on observed patterns works dramatically better.

The Model Selection Revelation

‍

We started with a simple approach: use the same model for all agents. The most capable model we could access.

‍

Two problems emerged:

Cost exploded. Using expensive, capable models for every agent—including simple tasks like context ranking—made the system economically unviable.
Some agents didn't need that much capability. The Context Gatherer primarily ranks documents by relevance. A lighter, faster model worked fine and returned results quicker.

‍

We experimented with mixing model tiers:

‍

Agent	Model Tier	Why
Planner	High	Quality here cascades through the entire pipeline
Critic	High	Needs nuanced, actionable feedback
Context Gatherer	Medium	Primarily retrieval and filtering
Blueprint	High	Specifications must be precise
Coder	Highest	Code quality is paramount
Executor	Medium	Error interpretation, not generation

‍

The expensive reasoning model for the Planner paid for itself by preventing downstream failures. The lighter model for the Context Gatherer was fast enough that latency didn't suffer, and it cost a fraction of the price.

‍

Using tiered model selection reduced our costs significantly while maintaining (and in some cases improving) output quality.

‍

The insight: Match model capabilities to agent responsibilities. Don't use expensive models where simpler ones suffice, but don't cheap out on critical reasoning tasks.

The Observability We Needed But Didn't Build Initially

‍

For the first few weeks, when something went wrong, we couldn't figure out why. The Orchestrator made decisions we didn't understand. Agents produced outputs that seemed reasonable but led to failures downstream. Debugging felt like archaeology—trying to reconstruct what happened with incomplete information.

‍

We added comprehensive logging:

‍

@before_tool_callback
async def log_tool_start(tool_name: str, inputs: dict):
    logger.info(f"{tool_name} started with {inputs}")

@after_tool_callback
async def log_tool_end(tool_name: str, outputs: dict, duration: float):
    logger.info(f"{tool_name} completed in {duration}s")

‍

Every agent call. Every decision point. Every error. All logged with timestamps and context.

‍

This observability transformed our ability to improve the system:

We discovered the Context Gatherer was taking too long (optimized retrieval)
We saw the Orchestrator making odd decisions (improved prompts)
We measured per-agent success rates (identified weak points)
We caught patterns in failures (improved error recovery)

‍

The lesson we should have learned earlier: instrument everything from day one. You can't improve what you can't measure, and debugging complex agent interactions is nearly impossible without detailed traces.

What We're Still Working On

‍

GenAgent works reliably now—85%+ success rate for complex tasks, 95%+ for simple ones. But challenges remain:

‍

The long tail of failures. Some tasks consistently fail despite our best efforts. Execution failures with runtime errors that aren't auto-fixable. Syntactically valid code that doesn't match the user's actual intent. Timeout or infrastructure issues. Each failure is unique, making them hard to solve systematically.

‍

Latency tradeoffs. More agents and validation stages mean more sequential processing. We've optimized where we can—parallel generation, lighter models for simple tasks, adaptive workflows—but complex tasks still take more than few minutes . The tradeoff between reliability and speed is real.

‍

Model selection complexity. Managing a multi-model setup adds operational overhead. Tracking which models work best for which agents. Monitoring costs across different model families. Handling API differences between providers. Managing fallbacks when specific models are unavailable.

‍

We're exploring continuous learning from failures. Every failed generation now gets stored with the user goal, generated plan, code produced, and error encountered. Mining this data could help us:

Improve prompts by identifying common error patterns
Improve context selection by understanding which documentation would have prevented errors
Improve critic instructions by learning what it should catch

What We Learned

‍

Building GenAgent taught us that reliable code generation isn't about having the most powerful model. It's about having the right structure for the right problem.

‍

Separation of concerns beats monolithic generation. Different cognitive tasks—planning, critique, research, design, implementation, validation—benefit from being done separately with specialized focus.

‍

Adversarial dynamics improve quality. The Planner-Critic debate catches logical errors early through genuine critique, not just regeneration.

‍‍

Parallel diversity beats sequential retry. When the first attempt has a fundamental misunderstanding, retrying repeats the same mistake. Parallel generation explores different solution spaces.

‍

Focused context beats comprehensive documentation. Models work better with targeted examples than with complete API references. Surgical precision, not comprehensive coverage.

‍

Validation must happen at multiple stages. Syntax validation catches parse errors. Blueprint specifications catch interface mismatches. Execution catches runtime errors. Each layer defends against different failure modes.

‍

Explicit orchestration guidance beats implicit discovery. Encode observed workflow patterns directly. Don't expect orchestrators to figure out optimal paths on their own.

‍

Different agents need different models. Match model capabilities to agent responsibilities. Use expensive models where quality matters, lighter models where speed and cost matter more.

‍

Observability enables improvement. You can't debug what you can't see. Instrument everything from day one.

‍

The multi-agent approach works because it mirrors how humans actually write code. We don't sit down and immediately type perfect implementation code. We plan. We research. We design interfaces. We implement. We test. We iterate based on what breaks.

‍

GenAgent demonstrates that when you design for a specific ecosystem—in our case, generating tools for a multi-agent platform—the architecture naturally optimizes for the right properties: reliability, composability, integration.

‍

The key insight: reliability comes from structure and specialization, not just model capability.

‍

For teams building similar systems, the most important lesson might be this: be clear about what you're building and why, then let those requirements guide your architecture. A generic code generator faces different constraints than a tool generator for an agentic platform. Understanding your specific goal—what the code needs to be and how it will be used—should drive every architectural decision.

‍

GenAgent isn't a generic code generator. It's a tool generator for a specific ecosystem. That focus made all the difference.

‍

Building GenAgent: A Journey Through Reliable Code Generation

The Deceptively Simple Beginning

Breaking Down the Problem

The Planning Problem We Didn't Expect

The Knowledge Cutoff Problem

The Cost Paradox of Parallel Generation

The Validation Layers We Needed

The Orchestration Challenge

The Model Selection Revelation

The Observability We Needed But Didn't Build Initially

What We're Still Working On

What We Learned

More From the Journal

The Next Frontier of Web Automation: Supercharging AI Agents with Web Domain Insights

Building GenAgent: A Journey Through Reliable Code Generation

Learning from Stored Workflows: Retrieval for Better Orchestration

See what Emergence can do for you.