Skip to content

Circuit Breakers & Dead Letter Queue

Production workflows interact with external services that fail. Orch8 provides automatic circuit breaking to prevent cascade failures and a dead letter queue for instances that exhaust retries. Both are observable and actionable via API.

Circuit Breakers

Every handler has an automatic circuit breaker. When a handler fails repeatedly, the circuit opens — subsequent calls are rejected immediately without contacting the external service. After a cooldown period, one probe request is allowed to test recovery.

Per-Handler

Each handler name has its own independent breaker. A failing Stripe handler doesn't affect the SendGrid handler.

Automatic

No configuration needed. Breakers are created on first use with registry defaults.

Observable

List all breaker states and manually reset them via the API when an outage is resolved.

Breaker States

Closed (normal)

All requests are allowed. Failures are counted. When consecutive failures reach failure_threshold, the circuit opens.

Open (tripped)

All requests are rejected immediately with a circuit-open error. Steps using this handler will fail fast without making network calls. After cooldown_secs elapse, transitions to HalfOpen.

HalfOpen (probing)

One probe request is allowed through. If it succeeds, the circuit closes (resets failure count). If it fails, the circuit reopens for another cooldown period.

State Diagram

ClosedOpenHalfOpenfailures ≥ thresholdcooldown elapsedprobe succeedsprobe fails

Configuration

ParameterTypeDescription
failure_thresholdu32Consecutive failures before opening. Registry default applied per handler.
cooldown_secsu64Seconds to wait in Open state before allowing a probe request.

Success resets everything: A single successful call (in Closed or HalfOpen state) resets the failure count to zero and transitions to Closed.

Breaker API

List All Breakers

GET /circuit-breakers

Response:
[
  {
    "handler": "http_request",
    "state": "closed",
    "failure_count": 0,
    "failure_threshold": 5,
    "cooldown_secs": 30,
    "opened_at": null
  },
  {
    "handler": "grpc://payments:50051/Pay.Charge",
    "state": "open",
    "failure_count": 5,
    "failure_threshold": 5,
    "cooldown_secs": 60,
    "opened_at": "2024-01-15T10:30:00Z"
  }
]

Get Single Breaker

GET /circuit-breakers/{handler_name}

Response: Single CircuitBreakerState object (404 if handler never invoked)

Reset Breaker

POST /circuit-breakers/{handler_name}/reset

Response: 200 OK

// Immediately transitions to Closed state
// Clears failure count and opened_at
// Use after confirming external service is recovered

Dead Letter Queue

When an instance exhausts all retries (or encounters a permanent error), it transitions to Failed state and enters the dead letter queue. The DLQ is a queryable view of all failed instances — nothing is lost, everything is recoverable.

How Instances Enter the DLQ

Retry Exhaustion

A step fails and its retry policy's max_attempts is reached. The instance transitions to Failed.

Permanent Error

A handler returns a permanent (non-retryable) error. No retries are attempted; instance fails immediately.

Manual Transition

An operator manually sets instance state to Failed via PATCH /instances/{id}/state.

Circuit Breaker Open

If a handler's circuit is open and the step has no retry policy to wait for cooldown, the step may fail permanently.

DLQ API

List Failed Instances

GET /instances/dlq?tenant_id=tenant-1&limit=50

Optional query params:
  tenant_id     Filter by tenant
  namespace     Filter by namespace
  sequence_id   Filter by sequence
  offset        Pagination offset (default: 0)
  limit         Page size (default: 100, max: 1000)

Response: Array of TaskInstance objects in Failed state

The DLQ endpoint is a filtered view of instances — it returns all instances in the Failed state, sorted by failure time.

Retry from DLQ

Any failed instance can be retried. The engine resets execution state while preserving outputs from previously successful steps — preventing double-execution of side-effectful work.

Retry Endpoint

POST /instances/{instance_id}/retry

Response:
{
  "id": "abc-123",
  "state": "scheduled"
}

What Retry Does

1

Validate State

Only works on Failed instances. Returns error for any other state.

2

Delete Execution Tree

Removes all ExecutionNode records. The scheduler rebuilds from the sequence definition.

3

Delete Sentinel Outputs

Clears in-progress markers from the failed step. Allows the step to re-execute.

4

Preserve Real Outputs

Outputs from previously successful steps remain. The scheduler sees them and skips those steps.

5

Reschedule

Instance transitions to Scheduled with next_fire_at = now. Picked up on next scheduler tick.

Idempotency guarantee: Because successful step outputs are preserved, retrying a partially-completed workflow won't re-execute steps that already produced results. Only the failed step and subsequent steps re-run.

Bulk Operations

Act on multiple instances at once using filter-based bulk endpoints.

Bulk State Update

PATCH /instances/bulk/state

{
  "filter": {
    "tenant_id": "tenant-1",
    "namespace": "production",
    "sequence_id": "order-processing",
    "states": ["failed"]
  },
  "state": "cancelled"
}

// Cancels all failed order-processing instances for tenant-1

Bulk Reschedule

PATCH /instances/bulk/reschedule

{
  "filter": {
    "tenant_id": "tenant-1",
    "states": ["scheduled"]
  },
  "offset_secs": 3600
}

// Shifts all scheduled instances forward by 1 hour
// Negative values shift backward (execute sooner)
Filter FieldDescription
tenant_idRequired. Scopes the operation to a tenant.
namespaceOptional. Filter by namespace.
sequence_idOptional. Filter by sequence.
statesOptional. Only affect instances in these states.

Production Patterns

Alerting on Open Circuits

Poll GET /circuit-breakers every 30s. Alert when any breaker enters open state. Include handler name and failure count in the alert for quick triage.

DLQ Monitoring

Track DLQ depth per sequence. A growing DLQ indicates a systemic issue (external service down, bad deployment, schema change). Alert when DLQ depth exceeds baseline.

Retry After Fix

When an external service recovers: (1) reset the circuit breaker, (2) retry failed instances from DLQ. The circuit reset allows new requests through; the DLQ retry recovers past failures.

# 1. Reset the breaker
curl -X POST /circuit-breakers/http_request/reset

# 2. Retry each failed instance
for id in $(curl /instances/dlq?sequence_id=... | jq -r '.[].id'); do
  curl -X POST /instances/$id/retry
done

Bulk Cancel Stale DLQ

If failed instances are too old to be relevant (e.g., time-sensitive notifications from 2 weeks ago), bulk-cancel them instead of retrying.

Retry Budget

Set retry policies with sensible limits. For idempotent operations (reads, notifications): generous retries (5-10 attempts, exponential backoff). For non-idempotent operations (payments): fewer retries (2-3) with manual DLQ review.