Why I Built This
Agentic AI is moving fast, but most tutorials stop at "here's a ReAct loop." I wanted to go further — to build and compare every major coordination pattern, not just read about them. The goal was a single, self-contained reference: ten patterns, ten notebooks, one consistent stack (Gemini API + Pydantic), and honest notes on what each pattern actually feels like to implement and operate.
I organized the patterns into two tiers. Single-Agent Systems (SAS) involve one agent reasoning through a task, possibly calling tools or pausing for a human. Multi-Agent Systems (MAS) involve two or more agents collaborating, competing, or coordinating. Both tiers matter — the right architecture depends on the problem, not on what sounds impressive.
The Stack
- Model: Google Gemini 2.5 Flash via the
google-genaiSDK - Structured outputs: Pydantic models with JSON schema validation
- Safety rails:
max_iterationslimits on every loop-based pattern - Temperature strategy: Low (0.1) for evaluators and classifiers; higher for creative/generative roles
- Memory: Accumulated string-based context passed across iterations
- Format: Self-contained Jupyter notebooks — no shared utilities, no external databases
The deliberate simplicity was intentional. By keeping every notebook standalone, each pattern is legible on its own — you can read the code and immediately see the coordination logic without tracing imports across files.
Part 1: Single-Agent Systems (SAS)
One agent. One context window. Two very different interaction models.
ReAct — Reason, Act, Observe
Notebook: Agentic_AI_ReAct_Pattern.ipynb
What It Is
ReAct is the foundational single-agent pattern. The agent loops through three steps: Reason (decide what to do), Act (select and call a tool), and Observe (read the result and update its context). Each iteration appends to a growing memory string, so the agent always sees its full reasoning history.
What I Built
I wired up a small set of tools (search simulation, calculator, date lookup) and let the agent work through multi-step questions that required chaining them. The agent correctly decided which tool to call, in what order, and when it had enough information to stop.
What Worked Well
- Transparent reasoning — easy to debug by reading the trace
- Tool selection was reliable with clear tool descriptions
- Accumulated context kept the agent coherent across steps
What to Watch Out For
- Context grows with every iteration — token cost compounds
- Without
max_iterations, loops can run indefinitely - Poor tool descriptions cause wrong tool selection
Takeaway: ReAct is the right starting point for any agentic system. Master the loop, the tool schema design, and the memory accumulation strategy here before moving to multi-agent patterns.
Human-in-the-Loop (HITL)
Notebook: Agentic_AI_HITL_Pattern.ipynb
What It Is
HITL inserts a human checkpoint into the agent loop. The agent generates output, pauses, and waits for explicit approval or rejection. On rejection, it incorporates the human's feedback and regenerates. On approval, it proceeds. This pattern is essential anywhere a fully autonomous agent would be too risky.
What I Built
I used a draft-writing scenario where the agent generates content, the human reviews it in the notebook, and the feedback is fed back into the next generation cycle. Pydantic models structured the approval/rejection signal cleanly.
What Worked Well
- Structured feedback (Pydantic) made rejections unambiguous
- The agent genuinely improved output after incorporating notes
- Simple to implement — just a breakpoint in the ReAct loop
What to Watch Out For
- Vague human feedback produces marginal improvements
- Breaks async workflows — blocks on human response
- Must define clear approval criteria upfront
Takeaway: HITL is not a limitation — it's a feature in high-stakes domains (security, legal, finance). The quality of the feedback loop matters as much as the agent's generation quality.
Part 2: Multi-Agent Systems (MAS)
Eight patterns. Eight different ways agents can collaborate, delegate, compete, and self-organize.
Sequential — Linear Pipeline
Notebook: Agentic_AI_Sequential_Pattern.ipynb
What It Is
The simplest multi-agent topology: A feeds B feeds C. Each agent receives the prior agent's output as its input. No dynamic routing, no branching, no feedback — pure linear chaining.
What I Built
A three-stage pipeline: a research agent that summarizes a topic, a writing agent that drafts an article from the summary, and an editing agent that polishes the draft. Each agent had a focused system prompt tuned to its role.
What Worked Well
- Dead simple to reason about and debug
- Role specialization meaningfully improved output at each stage
- Easy to swap individual agents without changing the pipeline
What to Watch Out For
- Errors early in the chain cascade and compound
- No mechanism to loop back if a stage produces poor output
- Latency adds up — every stage is a sequential API call
Takeaway: Start here for any well-defined, ordered workflow. If the steps never need to loop back or branch, Sequential is both the simplest and the most operationally predictable architecture.
Loop — Writer and Critic
Notebook: Agentic_AI_Loop_Pattern.ipynb
What It Is
Two agents in a cycle: a writer generates content, a critic evaluates it with a structured output (score + approval flag), and the loop continues until the critic approves or max_iterations is hit. This is the multi-agent equivalent of HITL, but fully automated.
What I Built
The critic returned a Pydantic model with an approved: bool field, a numeric quality score, and specific feedback strings. The writer's next prompt incorporated the critic's feedback verbatim. I tested both strict (high quality threshold) and lenient critics and observed how threshold tuning directly impacts iteration count.
What Worked Well
- Structured critic outputs eliminated ambiguous feedback
- Output quality genuinely improved with each rejection cycle
max_iterationsguard prevented infinite loops cleanly
What to Watch Out For
- Overly strict critics can loop to the iteration cap without converging
- Critic and writer temperature need separate tuning
- Token cost scales linearly with iteration count
Takeaway: The Loop pattern is a quality amplifier. The writer-critic dynamic is most effective when the critic's feedback is structured, specific, and actionable — vague critiques produce marginal gains.
Iterative Refinement — Evolving the Prompt Itself
Notebook: Agentic_AI_Iterative_Refinement_Pattern.ipynb
What It Is
Three agents working together: a generator produces output, an evaluator scores it, and a prompt enhancer rewrites the generator's prompt based on the evaluation. The next cycle uses the enhanced prompt — meaning the system improves not just the output but its own instructions over time.
What I Built
I applied this to a security advisory writing task. The evaluator scored drafts on accuracy, clarity, and completeness. The prompt enhancer then rewrote the generator's system prompt to address the weak dimensions. Across three cycles, the prompt evolved meaningfully — specificity and structure improved in ways I hadn't explicitly designed for.
What Worked Well
- Prompt evolution produced genuinely different (better) outputs
- The pattern surfaced prompt weaknesses you'd miss manually
- Three-agent separation kept responsibilities clean
What to Watch Out For
- Prompt enhancer can drift toward verbosity without constraints
- Three API calls per iteration — cost adds up fast
- Need to constrain the prompt enhancer's output length
Takeaway: This is the most intellectually interesting single-pipeline pattern. When the evaluator dimensions are well-defined, watching prompts evolve across iterations is genuinely illuminating — it surfaces prompt engineering heuristics automatically.
Parallel — Concurrent Specialists
Notebook: Agentic_AI_Parallel_Pattern.ipynb
What It Is
An orchestrator fans out the same input to multiple specialist agents concurrently, then synthesizes their combined outputs into a final response. All specialist agents run at the same time — the orchestrator waits for all results before synthesizing.
What I Built
I built a security assessment scenario with three parallel specialists: a vulnerability analyst, a compliance reviewer, and a threat modeler. All three received the same application description, worked independently, and their findings were synthesized by the orchestrator into a unified risk report.
What Worked Well
- Total latency is specialist latency, not sum of all — big speedup
- Independent perspectives caught issues single-agent analysis missed
- Orchestrator synthesis was clean with structured specialist outputs
What to Watch Out For
- Specialists can produce contradictory findings — synthesis is non-trivial
- All agents share the same input — good for independent analysis, bad for sequential context
- Orchestrator prompt must handle variable-length, variable-quality specialist outputs
Takeaway: Parallel is the highest-leverage pattern for tasks that decompose into truly independent subtasks. In security contexts, running vulnerability, compliance, and threat analysis concurrently is a natural fit.
Coordinator — Intent Classification and Routing
Notebook: Agentic_AI_Coordinator_Pattern.ipynb
What It Is
A central coordinator classifies the user's intent using a Pydantic enum (structured output), routes the request to the appropriate specialist agent, and then synthesizes the specialist's response into the final output. Unlike Parallel (which fans out to all specialists), Coordinator picks exactly one.
What I Built
The coordinator classified incoming queries into one of four security domains (AppSec, CloudSec, DevSecOps, AI Security) using a strict enum — no free-form routing decisions. The right specialist then handled the query. The enum constraint was the key design choice: it made routing deterministic and auditable.
What Worked Well
- Enum routing was reliable — no hallucinated specialist names
- Each specialist could be heavily specialized without generalist dilution
- Easy to add new specialists by extending the enum
What to Watch Out For
- Cross-domain queries (e.g., AI + AppSec) route to only one specialist
- Coordinator becomes a single point of failure
- Classification errors silently misdirect queries
Takeaway: The structured output enum for routing was the most important implementation detail — it transformed routing from a probabilistic LLM decision into a deterministic classification. Use this pattern for any system with a clear taxonomy of intent.
Hierarchical — Three-Tier Decomposition
Notebook: Agentic_AI_Hierarchical_Pattern.ipynb
What It Is
A three-tier architecture: a root coordinator decomposes the top-level task, mid-level coordinators manage domain subtasks, and worker agents execute leaf-level operations. This mirrors org charts, military command structures, and microservice architectures.
What I Built
A comprehensive security assessment pipeline: the root coordinator decomposed "assess this application" into three domains (infrastructure, code, compliance). Domain coordinators assigned subtasks to workers (e.g., port scanning, dependency analysis, policy checking). Workers returned results up the chain.
What Worked Well
- Naturally handles complex, multi-domain tasks
- Workers remain simple and focused — easy to test independently
- Mirrors real-world organizational structures intuitively
What to Watch Out For
- Most complex pattern to implement correctly
- Error propagation up the hierarchy is hard to handle gracefully
- Communication overhead grows with depth — more API calls per task
Takeaway: Hierarchical is the right pattern for enterprise-scale tasks with genuine domain decomposition. The coordination overhead is real — only worth it when the task complexity justifies the architecture complexity.
Swarm — Autonomous Agent Handoffs
Notebook: Agentic_AI_Swarm_Pattern.ipynb
What It Is
No central coordinator. Each agent autonomously decides which agent should handle the conversation next, communicates that via a structured handoff field in its output, and contributes to a shared history. The swarm terminates on consensus or when the iteration cap is reached.
What I Built
A vulnerability triage swarm with four agents: a detector, an analyzer, a prioritizer, and a remediator. Each agent decided whether to continue its own work or hand off to another. Shared history ensured all agents had full context of prior contributions. The handoff field was a Pydantic enum — structured, not free-form.
What Worked Well
- No coordinator bottleneck — fully distributed decision-making
- Emergent coordination: agents naturally found efficient handoff sequences
- Shared history kept all agents coherent without explicit messaging
What to Watch Out For
- Agents can form handoff cycles — robust iteration cap is mandatory
- Hardest pattern to predict or audit — emergent behavior cuts both ways
- Shared history grows large — token cost escalates quickly
Takeaway: Swarm is the most fascinating and the most dangerous pattern to operate. The emergent coordination is real — but so is the emergent failure mode. The iteration cap is not optional; it's the only hard stop in a system with no central authority.
Review & Critic — Specialist Evaluation
Notebook: Agentic_AO_Review_Critic_Pattern.ipynb
What It Is
Similar to the Loop pattern but with a domain-expert critic rather than a generic quality evaluator. The generator produces output; the critic returns a structured evaluation with a numeric quality score and an approval flag. Rejected output loops back with the critic's feedback attached.
What I Built
A code security review scenario: a generator agent wrote Python code snippets, and a security-expert critic evaluated them specifically for OWASP vulnerabilities — not general code quality. The critic's Pydantic output included a score (0-10), an approved boolean, and a list of specific security findings. The generator used those findings as a diff-style patch list on the next cycle.
What Worked Well
- Domain-expert critic caught security issues a generic critic missed
- Score + approval flag gave both a relative signal and a hard gate
- Structured finding list made generator improvements targeted
What to Watch Out For
- Critic expertise is prompt-dependent — weak system prompt = weak critique
- Generator can learn to game the critic without actually improving
- Convergence is slower for adversarial or complex domains
Takeaway: The difference between this and the Loop pattern is the critic's specificity. A generic critic improves writing; a security-expert critic surfaces OWASP violations. Domain expertise in the critic's system prompt is what makes this pattern powerful for security use cases.
Pattern Comparison at a Glance
| Pattern | Type | Complexity | Best For |
|---|---|---|---|
| ReAct | SAS | Low | Tool-use, multi-step reasoning |
| HITL | SAS | Low | High-stakes, human oversight required |
| Sequential | MAS | Low | Ordered, non-branching workflows |
| Loop | MAS | Low-Med | Automated quality refinement |
| Iterative Refinement | MAS | Medium | Prompt optimization, evolving criteria |
| Parallel | MAS | Medium | Independent concurrent analysis |
| Coordinator | MAS | Medium | Intent classification and routing |
| Hierarchical | MAS | High | Complex multi-domain decomposition |
| Swarm | MAS | High | Fully distributed, emergent coordination |
| Review & Critic | MAS | Med-High | Domain-expert quality gating |
Cross-Cutting Observations
Structured Outputs Are Not Optional
Every pattern that involves routing, evaluation, or handoffs depends on Pydantic-validated structured outputs. Free-form LLM text for these decisions is unreliable. The moment I switched from parsing free-form responses to Pydantic enums and models, error rates dropped to near-zero.
Temperature Is a First-Class Parameter
Evaluators, classifiers, and critics need low temperature (0.1–0.2) for consistency. Generators, writers, and creative agents benefit from higher temperature. Treating all agents identically produces worse results across the board. Every pattern had its own temperature profile.
Iteration Caps Are Safety Infrastructure
max_iterations is not a tuning parameter — it's a hard safety constraint. In Loop, Iterative Refinement, and Swarm patterns, an uncapped system will eventually burn through your token budget. Set it conservatively and treat hitting the cap as an alertable event, not a normal exit condition.
Context Accumulation Has Diminishing Returns
String-based memory accumulation works, but the signal-to-noise ratio degrades as context grows. In longer runs of the Loop and Swarm patterns, early iterations had more influence on later outputs than they should have. Sliding window context or summarization are worth implementing for production use.
Pattern Complexity Does Not Correlate With Output Quality
The Swarm pattern produced some of the most interesting outputs and some of the worst. Sequential produced consistent, predictable results. Match the pattern to the problem structure, not to what sounds architecturally impressive. Start with the simplest pattern that fits the task.
A Security Practitioner's Lens
As an AppSec practitioner, I couldn't build these patterns without thinking about their threat surfaces. A few observations:
- —Prompt injection risk scales with complexity. In Hierarchical and Swarm patterns, a compromised agent can influence downstream agents through shared history or handoff fields. Each tier is an injection surface.
- —Structured outputs reduce injection attack surface. Pydantic validation rejects malformed outputs before they reach the next agent. This is a concrete security benefit, not just an engineering nicety.
- —HITL is a security control, not just a UX feature. In agentic systems with real-world consequences (code execution, API calls, financial transactions), human checkpoints are a last-resort guardrail. Design for HITL before you remove it.
- —Swarm's lack of central authority is a governance problem. With no coordinator logging decisions, auditing a Swarm execution requires reconstructing the handoff chain from shared history. Build audit logging in from the start.
What's Next
The notebooks in this repo are a foundation, not a finish line. Next explorations:
- Red-teaming these patterns — specifically testing prompt injection across agent boundaries in Hierarchical and Swarm architectures
- Sliding window memory — replacing string accumulation with summarized context windows for long-running agents
- MCP tool integration — wiring real tools (file system, search, code execution) into the ReAct and Coordinator patterns
- Cross-pattern composition — running a Coordinator that routes to a Parallel fan-out, which feeds into a Loop refinement cycle
Run the Notebooks
All ten notebooks are open-source, self-contained, and runnable with a free Gemini API key. Clone the repo, install the two dependencies, and work through them in order — SAS patterns first, then MAS.
pip install google-genai pydantic