Overview

DagNats provides the infrastructure primitives that LLM agent pipelines need but ad-hoc scripts lack: durability, bounded execution, checkpointing, and observability.

The Problem with Raw API Calls

Most LLM integrations start as a script: call the API, parse the response, maybe loop a few times. This works until it does not:

No retry on failure. A transient 429 or 500 kills the entire run. You write retry logic. Then backoff. Then jitter.
No state persistence. If the process dies mid-conversation, you lose everything and start over. For a 20-iteration agent loop, that is 20 wasted API calls.
No execution bounds. An agent loop with no iteration cap burns tokens indefinitely. You add a counter. Then a timeout. Then a cost tracker.
No observability. When something goes wrong at 2 AM, you grep logs. No traces, no metrics, no structured events.

Each of these is solvable individually. Solving all of them together, reliably, across multiple agents running concurrently, is a workflow engine.

What DagNats Provides

Concern	Raw Script	DagNats
Retry on failure	Manual retry loops	Retry policies with configurable backoff
State persistence	None (or ad-hoc files)	Checkpoints in NATS KV
Execution bounds	Manual counters	MaxIterations, MaxDuration on agent loops
Human intervention	Not possible mid-run	Signals and approval gates
Parallel execution	threading/async	Map steps with bounded concurrency
Dynamic planning	Hardcoded pipelines	Planner steps generate DAG fragments
Observability	print statements	Distributed traces, structured events, metrics
Cost control	Hope	Iteration caps, timeouts, rate limits

Core Primitives for LLM Pipelines

DagNats was not designed exclusively for AI. But the primitives it provides map directly to LLM agent requirements:

Agent Loops solve the variable-iteration problem. An LLM agent reasons in cycles (prompt, tool call, observe, decide). The number of cycles is unknown at design time. StepTypeAgentLoop with Continue() handles this natively with MaxIterations and MaxDuration as hard bounds.

Checkpoints solve the state persistence problem. Conversation history, tool results, and intermediate reasoning are saved to KV after each iteration. A crash replays only the current iteration, not the entire conversation.

Signals solve the human-in-the-loop problem. A running agent can pause and wait for human input via WaitForSignal(). An external system (CLI, API, Slack bot) sends the signal when the human responds.

Planner Steps solve the dynamic workflow problem. An LLM can analyze a task, generate a plan as a JSON DAG fragment, and the engine executes it. No predefined step graph required.

Streaming solves the real-time feedback problem. PutStream() publishes tokens as they arrive from the model API. Clients subscribe for live output without waiting for the full response.

Section Guide

The pages in this section show how to compose these primitives into practical LLM agent patterns:

Page	Pattern
Agent Loop Pattern	Core reasoning cycle with checkpointed state
Tool Use as Steps	Each LLM tool call as a typed DAG step
Planner Pattern	LLM generates DAG dynamically
Human in the Loop	Approval gates and signal-based review
Context Management	Conversation state across iterations and retries
Multi-Agent Orchestration	Delegation, fan-out, inter-agent communication
Cost and Safety Controls	Bounded execution, rate limits, spend caps
Prompt and Response Schemas	Typed I/O validation for LLM handlers

Each page includes working Go code patterns that reference actual DagNats APIs.

Agent Loop Pattern