Agentic AI Design Patterns: SAS and MAS Explored

Why I Built This

Agentic AI is moving fast, but most tutorials stop at "here's a ReAct loop." I wanted to go further — to build and compare every major coordination pattern, not just read about them. The goal was a single, self-contained reference: ten patterns, ten notebooks, one consistent stack (Gemini API + Pydantic), and honest notes on what each pattern actually feels like to implement and operate.

I organized the patterns into two tiers. Single-Agent Systems (SAS) involve one agent reasoning through a task, possibly calling tools or pausing for a human. Multi-Agent Systems (MAS) involve two or more agents collaborating, competing, or coordinating. Both tiers matter — the right architecture depends on the problem, not on what sounds impressive.

The Stack

Model: Google Gemini 2.5 Flash via the google-genai SDK
Structured outputs: Pydantic models with JSON schema validation
Safety rails: max_iterations limits on every loop-based pattern
Temperature strategy: Low (0.1) for evaluators and classifiers; higher for creative/generative roles
Memory: Accumulated string-based context passed across iterations
Format: Self-contained Jupyter notebooks — no shared utilities, no external databases

The deliberate simplicity was intentional. By keeping every notebook standalone, each pattern is legible on its own — you can read the code and immediately see the coordination logic without tracing imports across files.

Part 1: Single-Agent Systems (SAS)

One agent. One context window. Two very different interaction models.

SAS-01

ReAct — Reason, Act, Observe

Notebook: Agentic_AI_ReAct_Pattern.ipynb

What It Is

ReAct is the foundational single-agent pattern. The agent loops through three steps: Reason (decide what to do), Act (select and call a tool), and Observe (read the result and update its context). Each iteration appends to a growing memory string, so the agent always sees its full reasoning history.

What I Built

I wired up a small set of tools (search simulation, calculator, date lookup) and let the agent work through multi-step questions that required chaining them. The agent correctly decided which tool to call, in what order, and when it had enough information to stop.

What Worked Well

Transparent reasoning — easy to debug by reading the trace
Tool selection was reliable with clear tool descriptions
Accumulated context kept the agent coherent across steps

What to Watch Out For

Context grows with every iteration — token cost compounds
Without max_iterations, loops can run indefinitely
Poor tool descriptions cause wrong tool selection

Takeaway: ReAct is the right starting point for any agentic system. Master the loop, the tool schema design, and the memory accumulation strategy here before moving to multi-agent patterns.

SAS-02

Human-in-the-Loop (HITL)

Notebook: Agentic_AI_HITL_Pattern.ipynb

What It Is

HITL inserts a human checkpoint into the agent loop. The agent generates output, pauses, and waits for explicit approval or rejection. On rejection, it incorporates the human's feedback and regenerates. On approval, it proceeds. This pattern is essential anywhere a fully autonomous agent would be too risky.

What I Built

I used a draft-writing scenario where the agent generates content, the human reviews it in the notebook, and the feedback is fed back into the next generation cycle. Pydantic models structured the approval/rejection signal cleanly.

What Worked Well

Structured feedback (Pydantic) made rejections unambiguous
The agent genuinely improved output after incorporating notes
Simple to implement — just a breakpoint in the ReAct loop

What to Watch Out For

Vague human feedback produces marginal improvements
Breaks async workflows — blocks on human response
Must define clear approval criteria upfront

Takeaway: HITL is not a limitation — it's a feature in high-stakes domains (security, legal, finance). The quality of the feedback loop matters as much as the agent's generation quality.

Part 2: Multi-Agent Systems (MAS)

Eight patterns. Eight different ways agents can collaborate, delegate, compete, and self-organize.

MAS-01

Sequential — Linear Pipeline

Notebook: Agentic_AI_Sequential_Pattern.ipynb

What It Is

The simplest multi-agent topology: A feeds B feeds C. Each agent receives the prior agent's output as its input. No dynamic routing, no branching, no feedback — pure linear chaining.

What I Built

A three-stage pipeline: a research agent that summarizes a topic, a writing agent that drafts an article from the summary, and an editing agent that polishes the draft. Each agent had a focused system prompt tuned to its role.

What Worked Well

Dead simple to reason about and debug
Role specialization meaningfully improved output at each stage
Easy to swap individual agents without changing the pipeline

What to Watch Out For

Errors early in the chain cascade and compound
No mechanism to loop back if a stage produces poor output
Latency adds up — every stage is a sequential API call

Takeaway: Start here for any well-defined, ordered workflow. If the steps never need to loop back or branch, Sequential is both the simplest and the most operationally predictable architecture.

MAS-02

Loop — Writer and Critic

Notebook: Agentic_AI_Loop_Pattern.ipynb

What It Is

Two agents in a cycle: a writer generates content, a critic evaluates it with a structured output (score + approval flag), and the loop continues until the critic approves or max_iterations is hit. This is the multi-agent equivalent of HITL, but fully automated.

What I Built

The critic returned a Pydantic model with an approved: bool field, a numeric quality score, and specific feedback strings. The writer's next prompt incorporated the critic's feedback verbatim. I tested both strict (high quality threshold) and lenient critics and observed how threshold tuning directly impacts iteration count.

What Worked Well

Structured critic outputs eliminated ambiguous feedback
Output quality genuinely improved with each rejection cycle
max_iterations guard prevented infinite loops cleanly

What to Watch Out For

Overly strict critics can loop to the iteration cap without converging
Critic and writer temperature need separate tuning
Token cost scales linearly with iteration count

Takeaway: The Loop pattern is a quality amplifier. The writer-critic dynamic is most effective when the critic's feedback is structured, specific, and actionable — vague critiques produce marginal gains.

MAS-03

Iterative Refinement — Evolving the Prompt Itself

Notebook: Agentic_AI_Iterative_Refinement_Pattern.ipynb

What It Is

Three agents working together: a generator produces output, an evaluator scores it, and a prompt enhancer rewrites the generator's prompt based on the evaluation. The next cycle uses the enhanced prompt — meaning the system improves not just the output but its own instructions over time.

What I Built

I applied this to a security advisory writing task. The evaluator scored drafts on accuracy, clarity, and completeness. The prompt enhancer then rewrote the generator's system prompt to address the weak dimensions. Across three cycles, the prompt evolved meaningfully — specificity and structure improved in ways I hadn't explicitly designed for.

What Worked Well

Prompt evolution produced genuinely different (better) outputs
The pattern surfaced prompt weaknesses you'd miss manually
Three-agent separation kept responsibilities clean

What to Watch Out For

Prompt enhancer can drift toward verbosity without constraints
Three API calls per iteration — cost adds up fast
Need to constrain the prompt enhancer's output length

Takeaway: This is the most intellectually interesting single-pipeline pattern. When the evaluator dimensions are well-defined, watching prompts evolve across iterations is genuinely illuminating — it surfaces prompt engineering heuristics automatically.

MAS-04

Parallel — Concurrent Specialists

Notebook: Agentic_AI_Parallel_Pattern.ipynb

What It Is

An orchestrator fans out the same input to multiple specialist agents concurrently, then synthesizes their combined outputs into a final response. All specialist agents run at the same time — the orchestrator waits for all results before synthesizing.

What I Built

I built a security assessment scenario with three parallel specialists: a vulnerability analyst, a compliance reviewer, and a threat modeler. All three received the same application description, worked independently, and their findings were synthesized by the orchestrator into a unified risk report.

What Worked Well

Total latency is specialist latency, not sum of all — big speedup
Independent perspectives caught issues single-agent analysis missed
Orchestrator synthesis was clean with structured specialist outputs

What to Watch Out For

Specialists can produce contradictory findings — synthesis is non-trivial
All agents share the same input — good for independent analysis, bad for sequential context
Orchestrator prompt must handle variable-length, variable-quality specialist outputs

Takeaway: Parallel is the highest-leverage pattern for tasks that decompose into truly independent subtasks. In security contexts, running vulnerability, compliance, and threat analysis concurrently is a natural fit.

MAS-05

Coordinator — Intent Classification and Routing

Notebook: Agentic_AI_Coordinator_Pattern.ipynb

What It Is

A central coordinator classifies the user's intent using a Pydantic enum (structured output), routes the request to the appropriate specialist agent, and then synthesizes the specialist's response into the final output. Unlike Parallel (which fans out to all specialists), Coordinator picks exactly one.

What I Built

The coordinator classified incoming queries into one of four security domains (AppSec, CloudSec, DevSecOps, AI Security) using a strict enum — no free-form routing decisions. The right specialist then handled the query. The enum constraint was the key design choice: it made routing deterministic and auditable.

What Worked Well

Enum routing was reliable — no hallucinated specialist names
Each specialist could be heavily specialized without generalist dilution
Easy to add new specialists by extending the enum

What to Watch Out For

Cross-domain queries (e.g., AI + AppSec) route to only one specialist
Coordinator becomes a single point of failure
Classification errors silently misdirect queries

Takeaway: The structured output enum for routing was the most important implementation detail — it transformed routing from a probabilistic LLM decision into a deterministic classification. Use this pattern for any system with a clear taxonomy of intent.

MAS-06

Hierarchical — Three-Tier Decomposition

Notebook: Agentic_AI_Hierarchical_Pattern.ipynb

What It Is

A three-tier architecture: a root coordinator decomposes the top-level task, mid-level coordinators manage domain subtasks, and worker agents execute leaf-level operations. This mirrors org charts, military command structures, and microservice architectures.

What I Built

A comprehensive security assessment pipeline: the root coordinator decomposed "assess this application" into three domains (infrastructure, code, compliance). Domain coordinators assigned subtasks to workers (e.g., port scanning, dependency analysis, policy checking). Workers returned results up the chain.

What Worked Well

Naturally handles complex, multi-domain tasks
Workers remain simple and focused — easy to test independently
Mirrors real-world organizational structures intuitively

What to Watch Out For

Most complex pattern to implement correctly
Error propagation up the hierarchy is hard to handle gracefully
Communication overhead grows with depth — more API calls per task

Takeaway: Hierarchical is the right pattern for enterprise-scale tasks with genuine domain decomposition. The coordination overhead is real — only worth it when the task complexity justifies the architecture complexity.

MAS-07

Swarm — Autonomous Agent Handoffs

Notebook: Agentic_AI_Swarm_Pattern.ipynb

What It Is

No central coordinator. Each agent autonomously decides which agent should handle the conversation next, communicates that via a structured handoff field in its output, and contributes to a shared history. The swarm terminates on consensus or when the iteration cap is reached.

What I Built

A vulnerability triage swarm with four agents: a detector, an analyzer, a prioritizer, and a remediator. Each agent decided whether to continue its own work or hand off to another. Shared history ensured all agents had full context of prior contributions. The handoff field was a Pydantic enum — structured, not free-form.

What Worked Well

No coordinator bottleneck — fully distributed decision-making
Emergent coordination: agents naturally found efficient handoff sequences
Shared history kept all agents coherent without explicit messaging

What to Watch Out For

Agents can form handoff cycles — robust iteration cap is mandatory
Hardest pattern to predict or audit — emergent behavior cuts both ways
Shared history grows large — token cost escalates quickly

Takeaway: Swarm is the most fascinating and the most dangerous pattern to operate. The emergent coordination is real — but so is the emergent failure mode. The iteration cap is not optional; it's the only hard stop in a system with no central authority.

MAS-08

Review & Critic — Specialist Evaluation

Notebook: Agentic_AO_Review_Critic_Pattern.ipynb

What It Is

Similar to the Loop pattern but with a domain-expert critic rather than a generic quality evaluator. The generator produces output; the critic returns a structured evaluation with a numeric quality score and an approval flag. Rejected output loops back with the critic's feedback attached.

What I Built

A code security review scenario: a generator agent wrote Python code snippets, and a security-expert critic evaluated them specifically for OWASP vulnerabilities — not general code quality. The critic's Pydantic output included a score (0-10), an approved boolean, and a list of specific security findings. The generator used those findings as a diff-style patch list on the next cycle.

What Worked Well

Domain-expert critic caught security issues a generic critic missed
Score + approval flag gave both a relative signal and a hard gate
Structured finding list made generator improvements targeted

What to Watch Out For

Critic expertise is prompt-dependent — weak system prompt = weak critique
Generator can learn to game the critic without actually improving
Convergence is slower for adversarial or complex domains

Takeaway: The difference between this and the Loop pattern is the critic's specificity. A generic critic improves writing; a security-expert critic surfaces OWASP violations. Domain expertise in the critic's system prompt is what makes this pattern powerful for security use cases.

Pattern Comparison at a Glance

Pattern	Type	Complexity	Best For
ReAct	SAS	Low	Tool-use, multi-step reasoning
HITL	SAS	Low	High-stakes, human oversight required
Sequential	MAS	Low	Ordered, non-branching workflows
Loop	MAS	Low-Med	Automated quality refinement
Iterative Refinement	MAS	Medium	Prompt optimization, evolving criteria
Parallel	MAS	Medium	Independent concurrent analysis
Coordinator	MAS	Medium	Intent classification and routing
Hierarchical	MAS	High	Complex multi-domain decomposition
Swarm	MAS	High	Fully distributed, emergent coordination
Review & Critic	MAS	Med-High	Domain-expert quality gating

Cross-Cutting Observations

Structured Outputs Are Not Optional

Every pattern that involves routing, evaluation, or handoffs depends on Pydantic-validated structured outputs. Free-form LLM text for these decisions is unreliable. The moment I switched from parsing free-form responses to Pydantic enums and models, error rates dropped to near-zero.

Temperature Is a First-Class Parameter

Evaluators, classifiers, and critics need low temperature (0.1–0.2) for consistency. Generators, writers, and creative agents benefit from higher temperature. Treating all agents identically produces worse results across the board. Every pattern had its own temperature profile.

Iteration Caps Are Safety Infrastructure

max_iterations is not a tuning parameter — it's a hard safety constraint. In Loop, Iterative Refinement, and Swarm patterns, an uncapped system will eventually burn through your token budget. Set it conservatively and treat hitting the cap as an alertable event, not a normal exit condition.

Context Accumulation Has Diminishing Returns

String-based memory accumulation works, but the signal-to-noise ratio degrades as context grows. In longer runs of the Loop and Swarm patterns, early iterations had more influence on later outputs than they should have. Sliding window context or summarization are worth implementing for production use.

Pattern Complexity Does Not Correlate With Output Quality

The Swarm pattern produced some of the most interesting outputs and some of the worst. Sequential produced consistent, predictable results. Match the pattern to the problem structure, not to what sounds architecturally impressive. Start with the simplest pattern that fits the task.

A Security Practitioner's Lens

As an AppSec practitioner, I couldn't build these patterns without thinking about their threat surfaces. A few observations:

—Prompt injection risk scales with complexity. In Hierarchical and Swarm patterns, a compromised agent can influence downstream agents through shared history or handoff fields. Each tier is an injection surface.
—Structured outputs reduce injection attack surface. Pydantic validation rejects malformed outputs before they reach the next agent. This is a concrete security benefit, not just an engineering nicety.
—HITL is a security control, not just a UX feature. In agentic systems with real-world consequences (code execution, API calls, financial transactions), human checkpoints are a last-resort guardrail. Design for HITL before you remove it.
—Swarm's lack of central authority is a governance problem. With no coordinator logging decisions, auditing a Swarm execution requires reconstructing the handoff chain from shared history. Build audit logging in from the start.

What's Next

The notebooks in this repo are a foundation, not a finish line. Next explorations:

Red-teaming these patterns — specifically testing prompt injection across agent boundaries in Hierarchical and Swarm architectures
Sliding window memory — replacing string accumulation with summarized context windows for long-running agents
MCP tool integration — wiring real tools (file system, search, code execution) into the ReAct and Coordinator patterns
Cross-pattern composition — running a Coordinator that routes to a Parallel fan-out, which feeds into a Loop refinement cycle

Run the Notebooks

All ten notebooks are open-source, self-contained, and runnable with a free Gemini API key. Clone the repo, install the two dependencies, and work through them in order — SAS patterns first, then MAS.

pip install google-genai pydantic

View on GitHub

All Notebooks Available on GitHub

Why I Built This

The Stack

Part 1: Single-Agent Systems (SAS)

ReAct — Reason, Act, Observe

What It Is

What I Built

Human-in-the-Loop (HITL)

What It Is

What I Built

Part 2: Multi-Agent Systems (MAS)

Sequential — Linear Pipeline

What It Is

What I Built

Loop — Writer and Critic

What It Is

What I Built

Iterative Refinement — Evolving the Prompt Itself

What It Is

What I Built

Parallel — Concurrent Specialists

What It Is

What I Built

Coordinator — Intent Classification and Routing

What It Is

What I Built

Hierarchical — Three-Tier Decomposition

What It Is

What I Built

Swarm — Autonomous Agent Handoffs

What It Is

What I Built

Review & Critic — Specialist Evaluation

What It Is

What I Built

Pattern Comparison at a Glance

Cross-Cutting Observations

Structured Outputs Are Not Optional

Temperature Is a First-Class Parameter

Iteration Caps Are Safety Infrastructure

Context Accumulation Has Diminishing Returns

Pattern Complexity Does Not Correlate With Output Quality

A Security Practitioner's Lens

What's Next

Run the Notebooks

About Viswanath Chirravuri