Skip to content

Overview

DagNats provides the infrastructure primitives that LLM agent pipelines need but ad-hoc scripts lack: durability, bounded execution, checkpointing, and observability.

The Problem with Raw API Calls

Most LLM integrations start as a script: call the API, parse the response, maybe loop a few times. This works until it does not:

  • No retry on failure. A transient 429 or 500 kills the entire run. You write retry logic. Then backoff. Then jitter.
  • No state persistence. If the process dies mid-conversation, you lose everything and start over. For a 20-iteration agent loop, that is 20 wasted API calls.
  • No execution bounds. An agent loop with no iteration cap burns tokens indefinitely. You add a counter. Then a timeout. Then a cost tracker.
  • No observability. When something goes wrong at 2 AM, you grep logs. No traces, no metrics, no structured events.

Each of these is solvable individually. Solving all of them together, reliably, across multiple agents running concurrently, is a workflow engine.

What DagNats Provides

ConcernRaw ScriptDagNats
Retry on failureManual retry loopsRetry policies with configurable backoff
State persistenceNone (or ad-hoc files)Checkpoints in NATS KV
Execution boundsManual countersMaxIterations, MaxDuration on agent loops
Human interventionNot possible mid-runSignals and approval gates
Parallel executionthreading/asyncMap steps with bounded concurrency
Dynamic planningHardcoded pipelinesPlanner steps generate DAG fragments
Observabilityprint statementsDistributed traces, structured events, metrics
Cost controlHopeIteration caps, timeouts, rate limits

Core Primitives for LLM Pipelines

DagNats was not designed exclusively for AI. But the primitives it provides map directly to LLM agent requirements:

Agent Loops solve the variable-iteration problem. An LLM agent reasons in cycles (prompt, tool call, observe, decide). The number of cycles is unknown at design time. StepTypeAgentLoop with Continue() handles this natively with MaxIterations and MaxDuration as hard bounds.

Checkpoints solve the state persistence problem. Conversation history, tool results, and intermediate reasoning are saved to KV after each iteration. A crash replays only the current iteration, not the entire conversation.

Signals solve the human-in-the-loop problem. A running agent can pause and wait for human input via WaitForSignal(). An external system (CLI, API, Slack bot) sends the signal when the human responds.

Planner Steps solve the dynamic workflow problem. An LLM can analyze a task, generate a plan as a JSON DAG fragment, and the engine executes it. No predefined step graph required.

Streaming solves the real-time feedback problem. PutStream() publishes tokens as they arrive from the model API. Clients subscribe for live output without waiting for the full response.

Section Guide

The pages in this section show how to compose these primitives into practical LLM agent patterns:

PagePattern
Agent Loop PatternCore reasoning cycle with checkpointed state
Tool Use as StepsEach LLM tool call as a typed DAG step
Planner PatternLLM generates DAG dynamically
Human in the LoopApproval gates and signal-based review
Context ManagementConversation state across iterations and retries
Multi-Agent OrchestrationDelegation, fan-out, inter-agent communication
Cost and Safety ControlsBounded execution, rate limits, spend caps
Prompt and Response SchemasTyped I/O validation for LLM handlers

Each page includes working Go code patterns that reference actual DagNats APIs.