Skip to content
NATS Infrastructure

NATS Infrastructure

DagNats provisions all its NATS resources automatically on startup via natsutil.SetupAll.

JetStream Streams

Seven streams handle all durable messaging. dagnats serve creates these on first boot; distributed deployments must ensure they exist before components start.

StreamSubjectsRetentionStorageLimitsPurpose
WORKFLOW_HISTORYhistory.>LimitsFile5s dedup windowImmutable event log (source of truth)
TASK_QUEUEStask.>WorkQueueFileAtomic publish enabledWork distribution to workers
EVENTSevent.>LimitsFileExternal event ingestion
DEAD_LETTERSdead.>LimitsFile30-day retentionPermanent failures for inspection
TELEMETRYtelemetry.>LimitsFile7-day, 1 GiB max, 5s dedupSpans, metrics, logs
SLEEP_TIMERSsleep.>, scheduled.>LimitsFileDurable timers (sleep, timeout, rate-retry, scheduled runs)
STICKY_TASKSsticky.>LimitsMemory30-minute max ageWorker-affinity task routing

WORKFLOW_HISTORY

The event log. Every state change for every workflow run is an immutable event published to history.{runID}. The 5-second dedup window (via Nats-Msg-Id) prevents duplicate events during retries. The orchestrator replays this stream on startup to rebuild in-memory actor state.

TASK_QUEUES

Work distribution. Uses WorkQueuePolicy so each message is delivered to exactly one consumer. Tasks are published to task.{taskType} and workers create pull consumers filtered to the task types they handle. Atomic publish is enabled for fan-out operations (map steps).

SLEEP_TIMERS

A shared timer stream using an action discriminator in the message payload:

ActionFires AfterResult
sleep_completeSleep durationPublishes step.sleep.completed event
wait_timeoutWait-for-event timeoutPublishes step.wait.timeout event
rate_retryRate limit refill delayRe-publishes task to task.>
debounce_fireDebounce windowFires debounced trigger
batch_fireBatch timeoutFires accumulated batch
retry_afterRequested delayRe-publishes task for retry

All timers use NATS NakWithDelay – the message is negatively acknowledged with a delay, and NATS redelivers it after the specified duration. No external timer service needed.

KV Buckets

Fifteen KV buckets store workflow state, coordination data, and operational metadata.

BucketTTLHistoryPurpose
workflow_defsdefaultImmutable workflow definitions
workflow_runsdefaultMutable run state snapshots
scheduled_runsdefaultOne-shot scheduled workflow runs
workers60sdefaultWorker directory (heartbeat)
event_waitersdefaultWait-for-event correlation entries
rate_limitsdefaultToken bucket state per task type
concurrency_tasks1Per-task-type concurrency counters
approval_tokens7 days1Human approval gate tokens
debounce_state14 daysdefaultSubject trigger debounce windows
idempotency_keys24 hoursdefaultWorkflow dedup key-to-runID mapping
sticky_bindings~25 hoursdefaultRun-to-worker affinity binding
singleton_locksdefaultSingleton execution locks
checkpointsdefaultWorker step state persistence
signalsdefaultCross-workflow KV-based signaling
triggersdefaultTrigger definitions
trigger_statedefaultCron last-run timestamps

Workers Bucket

The workers bucket has a 60-second TTL. Workers re-PUT their entry every 30 seconds, so a single missed heartbeat is tolerated. The engine never reads this bucket – it exists purely for observability (dagnats workers list). If the bucket is missing, workers function normally.

Concurrency Buckets

concurrency_tasks uses History: 1 to minimize storage for CAS-based counters. The engine checks these counters before dispatching tasks. If a limit is exhausted, the task is retried via SLEEP_TIMERS with a 1-second delay.

Subject Hierarchy

All NATS subjects follow a dot-separated hierarchy. The > wildcard matches one or more tokens.

history.{runID}                    # Workflow events
task.{taskType}                    # Task distribution
event.{eventType}                  # External events
dead.{runID}.{stepID}              # Dead letter entries
telemetry.spans                    # Trace spans
telemetry.metrics                  # Metrics
telemetry.logs                     # Log records
sleep.{runID}.{stepID}             # Timer messages
scheduled.{workflowName}          # Scheduled run triggers
sticky.{workerID}.{taskType}      # Worker-affinity tasks
stream.{runID}.{stepID}           # Real-time step output streaming
approval.{runID}.{stepID}         # Approval notifications

The subject design ensures that consumers can filter to exactly the messages they need. A worker subscribing to task.summarize only receives summarize tasks. The orchestrator subscribing to history.> receives all workflow events.

Resource Setup

On startup, natsutil.SetupAll(nc) calls:

  1. SetupStreams – creates the 5 core streams
  2. SetupKVBuckets – creates all KV buckets
  3. SetupTelemetryStream – creates the TELEMETRY stream
  4. SetupStickyStream – creates the STICKY_TASKS stream
  5. enableAtomicPublish – enables atomic batch publish on TASK_QUEUES

Each call uses CreateOrUpdateStream/CreateOrUpdateKeyValue, making them idempotent. Running dagnats serve multiple times against the same NATS data directory is safe.

All setup operations have a 30-second timeout. If NATS is unavailable, startup fails fast.