The Complete Guide to Building Agentic AI Systems: From Basics to Production
Why This Guide Matters
If you’ve shipped LLM-based features, you’ve probably hit the wall where simple prompt engineering stops working. Your PM wants the AI to “do more,” your team is drowning in edge cases, and you’re not sure if you need better prompts, more tools, or an entirely different architecture.
I’ve been there. After building several production agentic systems, here’s what I wish someone had told me on day one: The difference between teams that ship working AI agents and those that don’t isn’t the model—it’s the discipline around evaluation and error analysis.
This guide distills lessons from Andrew Ng’s Agentic AI course and my own production experience into actionable patterns for senior engineers and technical leaders.
What You’ll Learn
This isn’t another “ChatGPT is amazing” article. We’re diving deep into:
- Why agentic workflows outperform even the best one-shot prompts (spoiler: GPT-3.5 with agents beats GPT-4 without them)
- Four core design patterns that power every production agent system
- The evaluation framework that separates high-performing teams from everyone else
- Real cost/latency trade-offs and when to optimize what
- Multi-agent architectures and when they’re worth the complexity
The Agentic AI Paradigm Shift
The Old Way: One-Shot Generation
Remember when we thought the future was just bigger context windows and smarter prompts?
# The old approach
prompt = "Write a comprehensive research report on quantum computing"
response = llm.generate(prompt)
# Hope it's good 🤞
This is like asking someone to write an essay without backspace—from first word to last, no revision, no research breaks.
The New Way: Iterative Agentic Workflows
# The agentic approach
outline = llm.generate("Create an outline for quantum computing report")
research_data = web_search(extract_topics(outline))
first_draft = llm.generate(f"Write report with this outline and data...")
critique = llm.reflect("Review this draft for gaps and weaknesses...")
final_report = llm.revise(first_draft, critique, research_data)
This mirrors how humans actually work—iteratively, with research, reflection, and revision.
The Performance Gap
Here’s data that should make you reconsider your architecture:
HumanEval Coding Benchmark:
- GPT-3.5 (direct): 48%
- GPT-4 (direct): 67%
- GPT-3.5 + agentic workflow: 95% 🎯
That’s right—last-gen models with good workflows beat current-gen models without them. The implication? Architecture matters more than model selection for many real-world tasks.
The Autonomy Spectrum
Not all agents need the same level of autonomy:
graph LR
A[Deterministic<br/>Fixed Pipeline] --> B[Semi-Autonomous<br/>Tool Selection] --> C[Highly Autonomous<br/>Dynamic Planning]
A1[Invoice Processing] -.-> A
B1[Customer Support] -.-> B
C1[Research Agent] -.-> C
style A fill:#e8f5e1
style B fill:#fff4e1
style C fill:#ffe1e1
Start simple. Add autonomy only when deterministic workflows fail.
Most production systems I’ve seen succeed with semi-autonomous agents—they choose from predefined tools but follow predictable patterns. The full autonomy of “do whatever it takes” agents is powerful but hard to control and debug.
Pattern 1: Reflection—The Easy Win
Why It Works
Reflection is the simplest pattern that consistently improves output quality. The insight: LLMs are better critics than creators.
Here’s a real example from a production code generator:
Without Reflection:
def calculate_interest(principal, rate, time):
return principal * rate * time # Wrong formula!
With Reflection:
# Generation
code_v1 = llm.generate("Write compound interest calculator")
# Reflection
critique = llm.reflect("""
Review this code for correctness:
{code_v1}
Check: formulas, edge cases, variable names
""")
# Output: "Uses simple interest formula, should be compound"
# Revision
code_v2 = llm.revise(code_v1, critique)
# Correct implementation
The Secret Sauce: External Feedback
But here’s where it gets interesting. Reflection with external feedback is dramatically more effective:
sequenceDiagram
participant LLM
participant Executor
participant LLM as LLM (Reflect)
LLM->>LLM: Generate Code v1
LLM->>Executor: Run Code
Executor->>Executor: SyntaxError: line 5
Executor->>LLM: Error: "SyntaxError..."
LLM->>LLM: Fix Error
LLM->>Executor: Run Code v2
Executor->>LLM: Success!
The principle: New information from outside the LLM’s context (execution results, API responses, validation checks) makes reflection exponentially more powerful.
Production Implementation
def generate_with_reflection(task: str, max_iterations: int = 3):
"""
Generate with self-correction loop
"""
output = llm.generate(task)
for i in range(max_iterations):
# External feedback (execution, validation, etc.)
feedback = validate_output(output)
if feedback.is_valid:
return output
# Reflect with concrete feedback
critique = llm.reflect(f"""
Task: {task}
Output: {output}
Issues found: {feedback.errors}
Provide specific improvements.
""")
output = llm.revise(output, critique)
return output
When Reflection Helps
High-value use cases:
- ✅ Code generation (with execution feedback)
- ✅ Structured data (HTML, JSON) with validation
- ✅ Multi-step instructions
- ✅ Domain-specific content with expert critique
Lower value:
- ❌ Simple factual questions
- ❌ Tasks with single correct answers
- ❌ Latency-critical applications
A Word of Caution
Research shows reflection improves performance by 10-30% on average, but it’s not magic. I’ve seen teams waste weeks trying to fix fundamental prompt issues with reflection. Fix your base prompts first, then add reflection.
Pattern 2: Tool Use—Extending LLM Capabilities
The Mental Model
Think of tool use as giving your LLM an API to the outside world. Without tools, it’s a brain in a jar. With tools, it can:
- Access real-time data (web search, databases)
- Perform exact computations (code execution)
- Take actions (send emails, create calendar events)
- Integrate with your systems (CRM, ticketing, etc.)
How It Actually Works
Here’s what happens under the hood:
# 1. You define a function
def get_weather(city: str, units: str = "celsius") -> dict:
"""
Get current weather for a city.
Args:
city: City name
units: Temperature units (celsius/fahrenheit)
"""
return weather_api.fetch(city, units)
# 2. LLM sees this description
{
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"city": {"type": "string"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
}
}
# 3. When user asks "What's the weather in Tokyo?"
# LLM outputs:
{
"tool_call": {
"function": "get_weather",
"arguments": {"city": "Tokyo", "units": "celsius"}
}
}
# 4. YOU execute the function (LLM doesn't do it)
result = get_weather("Tokyo", "celsius")
# 5. Feed result back to LLM
# LLM generates: "It's currently 18°C in Tokyo..."
Critical insight: The LLM never directly executes functions. It requests execution by generating structured output. Your code parses this and runs the function.
The Code Execution Game-Changer
Code execution is special—it’s the Swiss Army knife of tools:
# Instead of creating tools for:
# - calculate_compound_interest()
# - calculate_roi()
# - calculate_depreciation()
# - calculate_... (hundreds more)
# Just give it ONE tool:
def execute_code(code: str) -> str:
"""Execute Python code and return output"""
return run_in_sandbox(code)
# Now it can do anything:
llm: "What's the compound interest on $1000 at 5% for 10 years?"
# Generates:
principal = 1000
rate = 0.05
time = 10
result = principal * (1 + rate) ** time
print(f"${result:.2f}")
# Output: $1628.89
Why this works: LLMs have seen massive amounts of code during training. They’re better at writing code than describing every possible calculation in natural language.
Security: The Elephant in the Room
Real story from our team:
# LLM generated this once:
import os
os.system("rm -rf *.py") # 😱
Always sandbox code execution:
# ❌ Dangerous
exec(llm_generated_code)
# ✅ Safe
import docker
def execute_safely(code: str, timeout: int = 30):
client = docker.from_env()
container = client.containers.run(
"python:3.11-slim",
command=f"python -c '{code}'",
timeout=timeout,
mem_limit="256m",
network_disabled=True,
remove=True
)
return container.decode()
Production checklist:
- ✅ Run in isolated environment (Docker, E2B, etc.)
- ✅ Limit execution time
- ✅ Restrict file system access
- ✅ Disable network access (unless needed)
- ✅ Monitor resource usage
Model Context Protocol (MCP): The Ecosystem Play
Here’s the problem MCP solves:
Before MCP: Every team building an agent has to write their own GitHub integration, their own Google Drive connector, their own Slack wrapper…
M teams × N integrations = M×N total work 😫
With MCP: Someone builds a GitHub MCP server once. Everyone uses it.
M teams + N servers = M+N total work 🎉
# Using GitHub MCP server
async with mcp.connect("github") as github:
readme = await github.get_file("org/repo", "README.md")
prs = await github.list_pull_requests("org/repo")
# Your LLM can now naturally interact with GitHub
Strategic insight for tech leads: Invest in MCP servers for your internal systems. Write once, use across all your agent projects. The ROI compounds quickly.
The Evaluation Discipline: What Separates Winners from Losers
This is the section that will make or break your agentic AI project. I’ve reviewed dozens of struggling AI teams, and 90% of the time, the issue isn’t their prompts or their architecture—it’s that they have no systematic way to measure and improve.
The Hard Truth
Without evals:
- You’re flying blind
- Every change is a gamble
- You’ll spend weeks on things that don’t matter
- You can’t justify resource allocation to leadership
With evals:
- Data-driven decisions
- Clear ROI on improvements
- Efficient use of engineering time
- Objective progress tracking for stakeholders
The Build-Analyze Cycle
graph TB
Start[Build Quick<br/>End-to-End System] --> Examine[Examine<br/>Outputs & Traces]
Examine --> Identify[Identify<br/>Problems]
Identify --> Eval[Build<br/>Evaluations]
Eval --> Improve[Make Targeted<br/>Improvements]
Improve --> Measure[Measure<br/>Impact]
Measure --> Decision{Good Enough?}
Decision -->|No| Identify
Decision -->|Yes| Optimize[Optimize<br/>Cost/Latency]
style Start fill:#e8f5e1
style Eval fill:#ffe1e1
style Improve fill:#e1f5ff
What I see in struggling teams: 80% building, 20% analyzing What I see in high-performing teams: 50% building, 50% analyzing
Start Small, Scale Smart
Week 1: Quick & Dirty
# Just collect 10-15 examples
test_cases = [
{"input": "...", "expected": "..."},
# ... 9 more
]
# Simple pass/fail
correct = sum(1 for t in test_cases if system(t["input"]) == t["expected"])
print(f"Accuracy: {correct}/{len(test_cases)}")
This is enough to get started. Don’t let perfect be the enemy of good.
Week 4: More Sophisticated
class EvaluationSuite:
def __init__(self):
self.test_cases = load_test_cases() # Now 50+ examples
def evaluate_end_to_end(self):
"""Overall system performance"""
return {
"accuracy": self._calculate_accuracy(),
"latency_p95": self._measure_latency(),
"cost_per_request": self._calculate_cost()
}
def evaluate_component(self, component_name):
"""Individual component performance"""
return self._run_component_eval(component_name)
The Evaluation Matrix
Understanding when to use which eval type is crucial:
| Type | Use Code-Based Evals | Use LLM-as-Judge |
|---|---|---|
| Per-Example Ground Truth | ✅ Invoice date extraction ✅ Database query correctness ✅ Structured data validation |
✅ Essay contains required points ✅ Response addresses customer issue ✅ Code implements specified features |
| Fixed Criteria | ✅ Output length checks ✅ Format validation ✅ Regex pattern matching |
✅ Chart quality (rubric) ✅ Email professionalism ✅ Explanation clarity |
LLM-as-Judge: The Right Way
❌ Don’t do this (position bias is real):
prompt = """
Which output is better?
Output A: {output_a}
Output B: {output_b}
Answer: A or B
"""
# LLMs consistently favor the first option ~60% of the time
✅ Do this instead (rubric-based):
rubric = """
Evaluate this chart on these criteria (0 or 1 point each):
1. Has clear, descriptive title
2. Axis labels are present and readable
3. Uses appropriate chart type for data
4. Data points are clearly visible
5. Legend present (if needed)
6. Professional appearance
7. No visual clutter
Return JSON: {"scores": [1,1,0,1,1,1,0], "total": 5, "explanation": "..."}
"""
score = llm_judge(output, rubric)
Why this works:
- Binary criteria are more reliable than scales
- Avoids position bias
- Provides granular feedback
- Easier to debug and improve
Error Analysis: Your Secret Weapon
This is where senior engineers separate themselves. Anyone can build evals. Great engineers use them to drive strategic decisions.
The Process:
# 1. Focus on failures only
failures = [test for test in test_suite if test.failed()]
# 2. Examine traces for each failure
for failure in failures:
traces = failure.get_intermediate_outputs()
# Look at output of each component
# 3. Build error taxonomy spreadsheet
Example Error Analysis:
| Test Case | Search Terms | Search Results | Source Selection | Final Output | Root Cause |
|---|---|---|---|---|---|
| Quantum computing | ✅ Good | ❌ Too broad | ⚠️ Limited input | ❌ Shallow | Search results |
| Black holes | ✅ Good | ❌ Blog posts | ❌ Low quality | ❌ Missing key facts | Search results |
| AI ethics | ⚠️ Too generic | ❌ Opinion pieces | ❌ Wrong sources | ❌ One-sided | Search terms + results |
| Climate change | ✅ Good | ✅ Good | ✅ Good | ⚠️ Missing nuance | Final synthesis |
Component Error Rates:
- Search terms: 15%
- Search results: 65% ← Focus here first
- Source selection: 20%
- Final synthesis: 10%
This data tells you exactly where to invest your time.
Component-Level Evals
Once error analysis identifies a problematic component, component-level evals let you iterate faster:
# Instead of running full workflow every time...
# (which might take 60s and cost $0.50)
# Just evaluate the component you're improving
# (takes 5s, costs $0.02)
def evaluate_search_component():
"""Faster iteration on search quality"""
gold_standard = {
"quantum computing": ["arxiv.org/...", "nature.com/...", "quantum.gov/..."],
"black holes": ["nasa.gov/...", "eso.org/...", "arxiv.org/..."],
# ...
}
for query, expected_sources in gold_standard.items():
actual_results = search_engine(query)
f1_score = calculate_overlap(actual_results, expected_sources)
# Quick feedback on search quality
Strategic value:
- Iterate 10x faster
- Test parameter changes quickly
- Don’t waste money on end-to-end runs
- Validate with full eval before deploying
Pattern 3: Planning—When Agents Decide the Strategy
The Paradigm Shift
Up until now, we’ve been hard-coding the sequence of operations:
# Developer decides the steps
step1 = search_web(query)
step2 = select_sources(step1)
step3 = write_essay(step2)
With planning, the LLM decides the sequence:
# LLM generates a plan, then executes it
plan = llm.plan(query, available_tools)
# Plan might be: [search, validate, search_more, synthesize]
for step in plan:
result = execute_step(step)
When You Need Planning
Indicators you’ve outgrown deterministic workflows:
- Queries require different approaches (can’t hard-code one path)
- Number of “if-else” conditions exploding
- User requests are too varied to predict
- You keep adding special cases
Example: Customer Service Agent
# Deterministic approach (breaks easily):
if "order" in query:
check_order_status()
elif "return" in query:
process_return()
elif "product" in query:
search_catalog()
# ... 50 more elif statements
# Planning approach:
plan = llm.create_plan(query, tools=[
check_order_status,
process_return,
search_catalog,
check_inventory,
contact_support_team
])
# LLM decides which tools to use and in what order
Planning with JSON
The System Prompt:
system_prompt = """
You have access to these tools:
- get_item_descriptions(category): Get product descriptions
- check_inventory(item_ids): Check stock levels
- get_item_price(item_ids): Get current prices
- process_return(order_id, item_id): Handle returns
Create a step-by-step plan in JSON to answer user queries.
Format:
{
"plan": [
{
"step": 1,
"description": "What this step does",
"tool": "tool_name",
"arguments": {...},
"depends_on": [] // Previous step numbers needed
}
]
}
"""
LLM Generates:
{
"plan": [
{
"step": 1,
"description": "Find all round sunglasses",
"tool": "get_item_descriptions",
"arguments": {"category": "sunglasses", "filter": "round"},
"depends_on": []
},
{
"step": 2,
"description": "Check which round sunglasses are in stock",
"tool": "check_inventory",
"arguments": {"item_ids": "from_step_1"},
"depends_on": [1]
},
{
"step": 3,
"description": "Get prices for in-stock items",
"tool": "get_item_price",
"arguments": {"item_ids": "from_step_2"},
"depends_on": [2]
}
]
}
Execution Engine:
def execute_plan(plan, context):
results = {}
for step in plan["plan"]:
# Wait for dependencies
for dep in step["depends_on"]:
if dep not in results:
raise Exception(f"Step {step['step']} depends on incomplete step {dep}")
# Resolve arguments from previous steps
args = resolve_arguments(step["arguments"], results)
# Execute
tool = get_tool(step["tool"])
results[step["step"]] = tool(**args)
return results
Planning with Code: The Power Move
Here’s where planning gets really interesting. Instead of JSON, let the LLM write code as the plan:
The Problem:
# You'd need dozens of specific tools:
get_column_max()
get_column_mean()
filter_rows()
get_unique_values()
sort_by_column()
group_by()
# ... hundreds more
The Solution:
system_prompt = """
You have access to Python with pandas library.
Write code to answer the user's query.
Return format:
<execute_python>
# Your code here
</execute_python>
"""
# User: "What were the amounts of the last 5 transactions?"
# LLM generates:
import pandas as pd
# Step 1: Load data
df = pd.read_csv('transactions.csv')
# Step 2: Ensure proper date parsing
df['date'] = pd.to_datetime(df['date'])
# Step 3: Sort by date (most recent first)
df_sorted = df.sort_values('date', ascending=False)
# Step 4: Get last 5 transactions
last_5 = df_sorted.head(5)
# Step 5: Extract amounts
amounts = last_5['amount'].tolist()
print(amounts)
Why This Works:
- Massive function library: Python+pandas has 1000s of functions LLMs know
- Extensive training data: LLMs have seen millions of pandas examples
- One tool replaces hundreds:
execute_code()vs. hundreds of specific functions - Natural expressiveness: Code naturally expresses complex multi-step logic
Production Considerations:
def safe_code_execution(code: str, timeout: int = 30):
"""
Execute LLM-generated code safely
"""
# 1. Sandbox environment
container = create_sandbox()
# 2. Resource limits
container.set_memory_limit("512MB")
container.set_cpu_limit(1.0)
# 3. Network isolation
container.disable_network()
# 4. Timeout
try:
result = container.run(code, timeout=timeout)
except TimeoutError:
return {"error": "Execution timeout"}
return result
Real-World Example: Agentic Coding
The most successful application of planning today is in coding assistants:
# User: "Build a REST API for a todo app with auth"
# LLM generates plan:
"""
☐ Step 1: Design database schema
☐ Step 2: Implement user model and authentication
☐ Step 3: Create todo model with user association
☐ Step 4: Build API endpoints (CRUD for todos)
☐ Step 5: Add input validation and error handling
☐ Step 6: Write unit tests
☐ Step 7: Write integration tests
☐ Step 8: Create API documentation
"""
# Then executes each step systematically
# If tests fail → revise → retest
# Until working application delivered
Why it works well for coding:
- Clear success criteria (tests pass/fail)
- Immediate feedback (code runs or doesn’t)
- Well-structured domain (software has clear components)
- LLMs trained extensively on code
When Planning Struggles
Challenges:
- Harder to control and predict
- More complex debugging
- Can get stuck in loops
- Higher token costs
Mitigation strategies:
def planning_with_guardrails(query, max_steps=10):
"""
Planning with safety limits
"""
plan = llm.generate_plan(query)
# Validate plan before execution
if len(plan) > max_steps:
return {"error": "Plan too complex, please simplify query"}
# Check for dangerous patterns
if contains_risky_operations(plan):
return {"error": "Plan contains operations requiring human approval"}
# Execute with monitoring
for step in plan:
result = execute_with_monitoring(step)
if result.error_rate > 0.5:
return {"error": "Plan execution failing, aborting"}
return result
Strategic advice: Start with deterministic workflows. Add planning only when the complexity of conditional logic becomes unmaintainable.
Pattern 4: Multi-Agent Systems—The Team Approach
Why Multiple Agents?
Just as you wouldn’t hire one person to do everything in a complex project, sometimes breaking AI work across specialized agents makes sense.
The mental model shift:
- From: “What’s the one AI that can do this?”
- To: “What team of specialized AIs should I assemble?”
When Multi-Agent Makes Sense
Good candidates:
- ✅ Tasks naturally split into specialized roles
- ✅ Different steps need different expertise
- ✅ Opportunities for parallel processing
- ✅ Clear hand-off points between phases
Example: Marketing Campaign
Instead of one mega-agent:
# One agent doing everything (gets confused)
marketing_agent = LLM("""
You're responsible for research, design, and copywriting...
""")
# Tries to do too much, does nothing well
Use specialized agents:
researcher = LLM("Expert at market analysis and competitor research")
designer = LLM("Skilled at data visualization and graphic design")
writer = LLM("Compelling marketing copywriter")
# Each does what it's best at
Communication Patterns
The architecture of how agents communicate is critical. Here are the four main patterns:
1. Linear (Sequential Pipeline)
graph LR
R[Researcher] --> D[Designer] --> W[Writer]
# Clean handoffs, predictable flow
research = researcher.analyze("sunglasses market trends")
graphics = designer.create(research)
brochure = writer.compose(research, graphics)
Best for: Assembly-line workflows with clear stages Pros: Simple, debuggable, predictable Cons: No parallelization, sequential bottleneck
2. Hierarchical (Manager-Worker)
graph TB
M[Manager Agent] --> R[Researcher]
M --> D[Designer]
M --> W[Writer]
manager_prompt = """
You're a marketing manager with this team:
- Researcher: Market analysis expert
- Designer: Visual content creator
- Writer: Marketing copywriter
Create a plan and delegate tasks.
"""
manager = LLM(manager_prompt, tools=[
call_researcher,
call_designer,
call_writer
])
# Manager decides who does what and when
plan = manager.create_plan("Summer sunglasses campaign")
# Might decide: researcher → designer + writer in parallel → final review
Best for: Tasks requiring coordination and dynamic work allocation Pros: Flexible, can parallelize, adapts to results Cons: Manager becomes bottleneck, more complex
3. Deep Hierarchy (Organizational)
Marketing Director
├── Research Team Lead
│ ├── Web Researcher
│ └── Competitor Analyst
├── Creative Team Lead
│ ├── Graphic Designer
│ └── Art Director
└── Content Team Lead
├── Copywriter
└── Editor
Best for: Very large, complex projects Pros: Scales to many agents, mirrors org structure Cons: High complexity, slower, expensive
Real talk: I’ve rarely seen this work well in production. The coordination overhead usually outweighs benefits. Consider carefully before going this deep.
4. All-to-All (Collaborative)
graph TB
R[Researcher] <--> D[Designer]
D <--> W[Writer]
W <--> E[Editor]
E <--> R
R <--> W
D <--> E
# Everyone can talk to everyone
agents = [researcher, designer, writer, editor]
for agent in agents:
agent.team_members = [a for a in agents if a != agent]
# Free-form collaboration
conversation = []
while not consensus_reached():
for agent in agents:
message = agent.respond_to(conversation)
conversation.append(message)
final_output = synthesize(conversation)
Best for: Creative brainstorming, exploratory problems Pros: Maximum flexibility, emergent solutions Cons: Unpredictable, expensive, may not converge
My experience: This pattern is exciting in demos, frustrating in production. Use for non-critical applications where you can tolerate variability.
Implementing Individual Agents
Each agent is just an LLM with a specialized prompt and toolset:
class Agent:
def __init__(self, role: str, tools: list, prompt_template: str):
self.role = role
self.tools = tools
self.prompt = prompt_template
def execute(self, task: str, context: dict = None):
"""
Execute task with role-specific behavior
"""
full_prompt = self.prompt.format(
task=task,
context=context or {}
)
return llm.generate(
full_prompt,
tools=self.tools
)
# Create specialized agents
researcher = Agent(
role="Market Researcher",
tools=[web_search, fetch_article],
prompt_template="""
You're an expert market researcher. Analyze:
- Market trends and patterns
- Competitor positioning
- Customer preferences
Task: {task}
Context: {context}
Provide data-driven insights with citations.
"""
)
designer = Agent(
role="Graphic Designer",
tools=[generate_image, create_chart, code_execution],
prompt_template="""
You're a skilled graphic designer. Create:
- Data visualizations
- Marketing graphics
- Brand-consistent artwork
Task: {task}
Context: {context}
Focus on visual impact and clarity.
"""
)
Multi-Agent Frameworks
Don’t build this from scratch. Leverage existing frameworks:
Popular options:
- LangGraph: Graph-based workflows (my current go-to)
- AutoGen: Microsoft’s conversational agents
- CrewAI: Role-based teams
- ChatDev: Simulated software company
Example with LangGraph:
from langgraph.graph import Graph
# Define workflow
workflow = Graph()
# Add agents as nodes
workflow.add_node("researcher", researcher.execute)
workflow.add_node("designer", designer.execute)
workflow.add_node("writer", writer.execute)
# Define edges (communication flow)
workflow.add_edge("researcher", "designer")
workflow.add_edge("researcher", "writer")
workflow.add_edge("designer", "writer")
# Compile and run
app = workflow.compile()
result = app.invoke({"query": "Create sunglasses campaign"})
When NOT to Use Multi-Agent
Red flags:
- Simple, linear tasks (just use a pipeline)
- Real-time requirements (coordination adds latency)
- Tight budget (more agents = more tokens)
- First version of anything (start simpler)
My rule of thumb: If you can’t clearly articulate why each agent exists and what unique value it provides, you don’t need multi-agent. Start with a single agent or simple pipeline.
Optimization: Making It Production-Ready
You’ve built something that works. Now let’s make it ready for real users.
The Priority Stack
In order of importance:
-
Output Quality (80% of your initial effort)
- Nothing else matters if the output is wrong
- Get to “good enough” before optimizing anything else
-
Latency (15% of initial effort)
- Impacts user experience directly
- But users tolerate slow if results are good
-
Cost (5% of initial effort)
- Only becomes urgent at scale
- Premature optimization wastes time
The tipping point: Optimize cost when your monthly bill makes you uncomfortable. For most teams, that’s around $5-10K/month.
Latency Optimization
Step 1: Measure Everything
import time
from dataclasses import dataclass
@dataclass
class PerformanceMetrics:
component: str
duration_ms: float
cost_dollars: float
def benchmark_workflow(query):
metrics = []
# Time each component
start = time.time()
terms = generate_search_terms(query)
metrics.append(PerformanceMetrics(
"generate_terms",
(time.time() - start) * 1000,
calculate_cost(terms)
))
start = time.time()
results = web_search(terms)
metrics.append(PerformanceMetrics(
"web_search",
(time.time() - start) * 1000,
0.016 # API cost
))
# ... rest of workflow
return metrics
# Example output:
# generate_terms: 7000ms, $0.04
# web_search: 5000ms, $0.016
# fetch_pages: 11000ms, $0.040
# write_essay: 18000ms, $2.80
# TOTAL: 41000ms, $2.896
Step 2: Attack the Biggest Bottlenecks
# Before: 18s for essay generation with GPT-4
essay = await llm.generate(
model="gpt-4o",
prompt=essay_prompt
)
# Strategy 1: Faster model (if quality acceptable)
essay = await llm.generate(
model="gpt-4o-mini", # 3x faster, 70% cheaper
prompt=essay_prompt
)
# Now: 6s
# Strategy 2: Streaming (feels faster)
async for chunk in llm.stream(model="gpt-4o", prompt=essay_prompt):
yield chunk # User sees progress immediately
# Strategy 3: Specialized provider (same model, faster hardware)
essay = await llm.generate(
model="groq:llama3-70b", # Specialized hardware
prompt=essay_prompt
)
# Now: 3s
Step 3: Parallelize Everything Possible
# Before: Sequential fetching (15s total)
page1 = fetch(url1) # 5s
page2 = fetch(url2) # 5s
page3 = fetch(url3) # 5s
# After: Parallel fetching (5s total)
import asyncio
pages = await asyncio.gather(
fetch(url1),
fetch(url2),
fetch(url3)
)
# Now: 5s (max of all requests)
# Savings: 10s per request
# At 1000 requests/day: 2.8 hours saved daily
Cost Optimization
Step 1: Profile Your Spend
def analyze_costs():
"""
Break down costs by component
"""
costs = {
'generate_search_terms': {
'model': 'gpt-4o',
'avg_tokens': 200,
'cost_per_request': 0.04,
'requests_per_day': 1000,
'daily_cost': 40.00
},
'web_search': {
'cost_per_request': 0.016,
'requests_per_day': 1000,
'daily_cost': 16.00
},
'write_essay': {
'model': 'gpt-4o',
'avg_tokens': 3000,
'cost_per_request': 2.80,
'requests_per_day': 1000,
'daily_cost': 2800.00 # ← 97% of total cost!
}
}
total_daily = sum(c['daily_cost'] for c in costs.values())
# $2,856/day = $85,680/month 😱
return costs
Step 2: Optimize High-Impact Components
# Before: GPT-4 for everything
write_essay_cost = 2.80 * 1000 = $2,800/day
# After: Strategic model selection
costs = {
# Non-critical: Use cheap models
'search_terms': ('claude-haiku', 0.01), # Was $40, now $10
'source_selection': ('gpt-3.5', 0.02), # Was $40, now $20
# Critical quality: Keep premium model
'essay_writing': ('gpt-4o', 2.80), # Still $2,800
# OR: Use cheaper premium model
'essay_writing': ('claude-sonnet', 0.80), # Now $800 (71% savings!)
}
# New total: $830/day = $24,900/month
# Savings: $60,780/month 💰
Step 3: Caching and Deduplication
from functools import lru_cache
import hashlib
class SmartCache:
def __init__(self, redis_client):
self.cache = redis_client
self.ttl = 3600 # 1 hour
def cached_llm_call(self, prompt: str, model: str) -> str:
"""
Cache LLM responses for identical prompts
"""
cache_key = hashlib.sha256(
f"{model}:{prompt}".encode()
).hexdigest()
# Check cache
if cached := self.cache.get(cache_key):
return cached # $0.00 cost!
# Call LLM
response = llm.generate(model=model, prompt=prompt)
# Store in cache
self.cache.setex(cache_key, self.ttl, response)
return response
# Real-world impact:
# - Research queries: 30% cache hit rate
# - Savings: 30% of $24,900 = $7,470/month
Step 4: Batch Processing
# Before: Individual processing
for item in items: # 1000 items
result = llm.process(item) # $0.10 each = $100
# After: Batch processing
batch_size = 10
for i in range(0, len(items), batch_size):
batch = items[i:i+batch_size]
results = llm.process_batch(batch) # $0.50 per batch
# Cost: 100 batches × $0.50 = $50 (50% savings)
Real-World Optimization Case Study
Initial State:
Research Agent Performance:
- Latency: 65s average
- Cost: $2.93 per query
- Volume: 10,000 queries/month
- Monthly cost: $29,300
After Optimization:
Research Agent Performance:
- Latency: 27s average (58% improvement)
- Cost: $0.86 per query (71% reduction)
- Volume: 10,000 queries/month
- Monthly cost: $8,600
Savings: $20,700/month = $248,400/year
What we did:
- Switched search term generation to Claude Haiku ($0.04 → $0.005)
- Optimized search queries (3 calls → 1 call)
- Parallelized page fetching (25s → 8s)
- Switched essay writing to Claude Sonnet ($2.80 → $0.80)
- Added caching layer (30% hit rate)
Production Checklist: Shipping with Confidence
Before you deploy to real users, make sure you have these covered:
Quality Gates
class ProductionReadiness:
def check_quality(self):
"""
Minimum quality thresholds
"""
assert self.end_to_end_accuracy > 0.90, "Not accurate enough"
assert self.critical_component_accuracy > 0.95, "Component failing"
assert self.error_analysis_complete, "Don't know failure modes"
assert self.edge_cases_handled, "Will break on edge cases"
def check_performance(self):
"""
Performance requirements
"""
assert self.p95_latency < self.sla_requirement
assert self.cost_per_request < self.budget_limit
assert self.parallelization_optimized
def check_reliability(self):
"""
Production hardening
"""
assert self.error_handling_comprehensive
assert self.retry_logic_implemented
assert self.timeout_protection_enabled
assert self.rate_limiting_configured
assert self.graceful_degradation_works
def check_safety(self):
"""
Security and safety
"""
assert self.code_execution_sandboxed
assert self.input_validation_active
assert self.output_filtering_enabled
assert self.pii_protection_configured
assert self.content_safety_checked
def check_observability(self):
"""
Monitoring and debugging
"""
assert self.logging_comprehensive
assert self.trace_storage_enabled
assert self.metrics_dashboard_ready
assert self.alerting_configured
assert self.cost_tracking_active
The Monitoring Stack
import structlog
from prometheus_client import Counter, Histogram
# Structured logging
logger = structlog.get_logger()
# Metrics
requests_total = Counter('agent_requests_total', 'Total requests')
request_duration = Histogram('agent_request_duration_seconds', 'Request duration')
request_cost = Histogram('agent_request_cost_dollars', 'Request cost')
errors_total = Counter('agent_errors_total', 'Total errors', ['component'])
class AgentObservability:
def trace_request(self, request_id: str):
"""
Comprehensive request tracing
"""
with request_duration.time():
try:
# Log request start
logger.info("request_started",
request_id=request_id,
query=request.query
)
# Execute workflow
result = self.execute_workflow(request)
# Log success
logger.info("request_completed",
request_id=request_id,
latency_ms=result.latency,
cost_dollars=result.cost,
output_length=len(result.output)
)
# Update metrics
requests_total.inc()
request_cost.observe(result.cost)
return result
except Exception as e:
# Log failure
logger.error("request_failed",
request_id=request_id,
error=str(e),
component=e.component
)
# Update error metrics
errors_total.labels(component=e.component).inc()
raise
The Incident Response Playbook
When things go wrong (and they will):
"""
INCIDENT RESPONSE PLAYBOOK
1. DETECT
- Alert fires from monitoring
- User reports issues
- Cost spike detected
2. TRIAGE
- Check dashboard: Which component failing?
- Review recent traces: Pattern in failures?
- Check error logs: New error types?
3. IMMEDIATE MITIGATION
- Can we roll back recent changes?
- Can we disable problematic component?
- Can we route to fallback system?
4. ROOT CAUSE ANALYSIS
- Examine traces of failed requests
- Compare to successful requests
- Check for:
* Model provider issues
* API rate limits hit
* Input pattern changes
* Prompt regressions
5. FIX & VERIFY
- Implement fix
- Test with recent failures
- Monitor closely for 24h
6. POSTMORTEM
- Document what happened
- Update evals to catch this
- Improve monitoring
- Update runbook
"""
The Strategic Roadmap: Building Your Agentic AI Team
As a tech lead, here’s how I’d approach building organizational capability:
Phase 1: Foundation (Month 1-2)
Goals:
- Ship first working agent
- Establish eval discipline
- Build team intuition
Actions:
week_1_2 = {
"build": "Simple agent with deterministic workflow",
"focus": "Output quality above all else",
"team_size": "2-3 engineers"
}
week_3_4 = {
"build": "Basic eval suite (20-30 examples)",
"focus": "Identify failure modes",
"deliverable": "Error analysis document"
}
week_5_8 = {
"build": "Iterate based on evals",
"focus": "Get to 85-90% accuracy",
"gate": "Quality threshold before proceeding"
}
Phase 2: Production Hardening (Month 3-4)
Goals:
- Make it reliable
- Optimize critical path
- Instrument everything
Actions:
month_3 = {
"build": [
"Comprehensive error handling",
"Retry logic and timeouts",
"Monitoring and alerting",
"Component-level evals"
],
"focus": "Reliability engineering"
}
month_4 = {
"build": [
"Latency optimization",
"Cost optimization (if needed)",
"Load testing",
"Incident playbooks"
],
"focus": "Performance and ops readiness"
}
Phase 3: Scale and Sophistication (Month 5-6)
Goals:
- Handle edge cases
- Add advanced patterns
- Build reusable components
Actions:
month_5 = {
"build": [
"Expand eval set (50-100 examples)",
"Add reflection where beneficial",
"Implement caching layer"
],
"focus": "Quality improvements"
}
month_6 = {
"build": [
"Planning capabilities (if needed)",
"Multi-agent (if justified)",
"Reusable agent components"
],
"focus": "Advanced capabilities"
}
Team Structure
Agentic AI Team (6-8 people)
├── Tech Lead (You)
│ ├── Architecture decisions
│ ├── Strategic direction
│ └── Stakeholder management
│
├── Senior ML Engineers (2)
│ ├── Prompt engineering
│ ├── Model selection
│ └── Agent design
│
├── ML Engineers (2-3)
│ ├── Eval development
│ ├── Error analysis
│ └── Component optimization
│
├── ML Ops Engineer (1)
│ ├── Infrastructure
│ ├── Monitoring
│ └── Cost optimization
│
└── Product Manager (1)
├── Use case prioritization
├── Success metrics
└── User research
Budget Planning
Realistic cost structure:
monthly_costs = {
"compute": {
"development": 2000, # Experimentation
"production": 10000, # 10K queries/day
"overhead": 2000 # Monitoring, logs
},
"people": {
"engineers": 6 * 15000, # Loaded cost
"pm": 12000
},
"tools": {
"monitoring": 500,
"frameworks": 200,
"misc": 300
}
}
total_monthly = (
sum(monthly_costs["compute"].values()) +
sum(monthly_costs["people"].values()) +
sum(monthly_costs["tools"].values())
)
# = $116,000/month
# With 10K queries/day at $5 value each
revenue_potential = 10000 * 5 * 30 # $1.5M/month
# ROI = 1300% (if you capture the value!)
Key Takeaways for Technical Leaders
After working on multiple production agentic systems, here are my hard-earned lessons:
1. Evals Are Not Optional
The #1 differentiator between teams that succeed and those that struggle is evaluation discipline. You cannot improve what you don’t measure.
Action item: Before writing a single line of production code, define how you’ll measure success.
2. Start Simple, Add Complexity Deliberately
Every team that jumped straight to multi-agent planning regretted it.
Progressive complexity:
- Deterministic pipeline
- Add reflection (if beneficial)
- Add tool use (as needed)
- Consider planning (rarely)
- Multi-agent (very rarely)
3. Error Analysis Drives Strategy
Don’t guess where the problems are. Count them. A spreadsheet tracking failure modes by component will tell you exactly where to invest engineering time.
4. Model Selection Matters Less Than You Think
Architecture and prompting usually matter more than GPT-4 vs Claude vs Llama. I’ve seen GPT-3.5 with good architecture outperform GPT-4 with poor architecture.
Exception: Reasoning-heavy tasks genuinely need frontier models.
5. Optimize in Order: Quality → Latency → Cost
Premature cost optimization is the root of much evil. Get it working well first, then make it fast, then make it cheap.
6. Plan for Observability from Day One
You will have production incidents. Comprehensive logging and tracing are the difference between 10-minute and 10-hour debugging sessions.
7. Human-in-the-Loop is Often Better Than Full Autonomy
For high-stakes decisions, having an agent draft + human review often beats trying to make the agent 100% autonomous.
8. External Feedback Makes Everything Better
Whether it’s code execution results, API responses, or validation checks, getting new information from outside the LLM dramatically improves reflection and planning.
What’s Next: The Future of Agentic AI
Emerging Trends
Constitutional AI: Training agents with values and principles, not just instructions
Memory Systems: Long-term memory across sessions, agents that learn from experience
Self-Improvement: Agents that write better versions of themselves
Human-AI Collaboration: Seamless blending of human expertise and agent capabilities
Research Frontiers
What’s being actively worked on:
- More reliable planning mechanisms
- Better confidence calibration
- Improved error recovery
- Lower computational costs
- Stronger safety guarantees
Your Learning Path
Next steps:
- Build something this week: Start with a simple reflection pattern
- Instrument everything: Add logging and metrics from day one
- Create your first eval: Even 10 examples beats intuition
- Join the community: Share your learnings, learn from others
- Read others’ prompts: Study open-source projects deeply
Resources:
- DeepLearning.AI courses (Andrew Ng)
- Anthropic’s prompt engineering guide
- OpenAI function calling documentation
- LangChain/LangGraph documentation
- Community forums and Discord servers
Final Thoughts
Building production agentic AI systems is equal parts engineering, science, and art:
Engineering: Designing robust workflows and integrating components Science: Measuring systematically and optimizing data-driven Art: Crafting prompts and choosing appropriate patterns
The field is evolving rapidly, but the core principles—systematic evaluation, error analysis, iterative improvement—will remain valuable regardless of which models or tools dominate tomorrow.
The teams winning with agentic AI aren’t the ones with the best models or the biggest budgets. They’re the ones with the best development discipline.
Now go build something amazing. 🚀
This guide synthesizes insights from Andrew Ng’s Agentic AI course and practical experience shipping production agentic systems. The patterns and practices described are battle-tested across multiple real-world deployments.
「真诚赞赏,手留余香」
真诚赞赏,手留余香
使用微信扫描二维码完成支付