Circuit Breakers & Dead Letter Queue
Production workflows interact with external services that fail. Orch8 provides automatic circuit breaking to prevent cascade failures and a dead letter queue for instances that exhaust retries. Both are observable and actionable via API.
Circuit Breakers
Every handler has an automatic circuit breaker. When a handler fails repeatedly, the circuit opens — subsequent calls are rejected immediately without contacting the external service. After a cooldown period, one probe request is allowed to test recovery.
Per-Handler
Each handler name has its own independent breaker. A failing Stripe handler doesn't affect the SendGrid handler.
Automatic
No configuration needed. Breakers are created on first use with registry defaults.
Observable
List all breaker states and manually reset them via the API when an outage is resolved.
Breaker States
Closed (normal)
All requests are allowed. Failures are counted. When consecutive failures reach failure_threshold, the circuit opens.
Open (tripped)
All requests are rejected immediately with a circuit-open error. Steps using this handler will fail fast without making network calls. After cooldown_secs elapse, transitions to HalfOpen.
HalfOpen (probing)
One probe request is allowed through. If it succeeds, the circuit closes (resets failure count). If it fails, the circuit reopens for another cooldown period.
State Diagram
Configuration
| Parameter | Type | Description |
|---|---|---|
| failure_threshold | u32 | Consecutive failures before opening. Registry default applied per handler. |
| cooldown_secs | u64 | Seconds to wait in Open state before allowing a probe request. |
Success resets everything: A single successful call (in Closed or HalfOpen state) resets the failure count to zero and transitions to Closed.
Breaker API
List All Breakers
GET /circuit-breakers
Response:
[
{
"handler": "http_request",
"state": "closed",
"failure_count": 0,
"failure_threshold": 5,
"cooldown_secs": 30,
"opened_at": null
},
{
"handler": "grpc://payments:50051/Pay.Charge",
"state": "open",
"failure_count": 5,
"failure_threshold": 5,
"cooldown_secs": 60,
"opened_at": "2024-01-15T10:30:00Z"
}
]Get Single Breaker
GET /circuit-breakers/{handler_name}
Response: Single CircuitBreakerState object (404 if handler never invoked)Reset Breaker
POST /circuit-breakers/{handler_name}/reset
Response: 200 OK
// Immediately transitions to Closed state
// Clears failure count and opened_at
// Use after confirming external service is recoveredDead Letter Queue
When an instance exhausts all retries (or encounters a permanent error), it transitions to Failed state and enters the dead letter queue. The DLQ is a queryable view of all failed instances — nothing is lost, everything is recoverable.
How Instances Enter the DLQ
Retry Exhaustion
A step fails and its retry policy's max_attempts is reached. The instance transitions to Failed.
Permanent Error
A handler returns a permanent (non-retryable) error. No retries are attempted; instance fails immediately.
Manual Transition
An operator manually sets instance state to Failed via PATCH /instances/{id}/state.
Circuit Breaker Open
If a handler's circuit is open and the step has no retry policy to wait for cooldown, the step may fail permanently.
DLQ API
List Failed Instances
GET /instances/dlq?tenant_id=tenant-1&limit=50
Optional query params:
tenant_id — Filter by tenant
namespace — Filter by namespace
sequence_id — Filter by sequence
offset — Pagination offset (default: 0)
limit — Page size (default: 100, max: 1000)
Response: Array of TaskInstance objects in Failed stateThe DLQ endpoint is a filtered view of instances — it returns all instances in the Failed state, sorted by failure time.
Retry from DLQ
Any failed instance can be retried. The engine resets execution state while preserving outputs from previously successful steps — preventing double-execution of side-effectful work.
Retry Endpoint
POST /instances/{instance_id}/retry
Response:
{
"id": "abc-123",
"state": "scheduled"
}What Retry Does
Validate State
Only works on Failed instances. Returns error for any other state.
Delete Execution Tree
Removes all ExecutionNode records. The scheduler rebuilds from the sequence definition.
Delete Sentinel Outputs
Clears in-progress markers from the failed step. Allows the step to re-execute.
Preserve Real Outputs
Outputs from previously successful steps remain. The scheduler sees them and skips those steps.
Reschedule
Instance transitions to Scheduled with next_fire_at = now. Picked up on next scheduler tick.
Idempotency guarantee: Because successful step outputs are preserved, retrying a partially-completed workflow won't re-execute steps that already produced results. Only the failed step and subsequent steps re-run.
Bulk Operations
Act on multiple instances at once using filter-based bulk endpoints.
Bulk State Update
PATCH /instances/bulk/state
{
"filter": {
"tenant_id": "tenant-1",
"namespace": "production",
"sequence_id": "order-processing",
"states": ["failed"]
},
"state": "cancelled"
}
// Cancels all failed order-processing instances for tenant-1Bulk Reschedule
PATCH /instances/bulk/reschedule
{
"filter": {
"tenant_id": "tenant-1",
"states": ["scheduled"]
},
"offset_secs": 3600
}
// Shifts all scheduled instances forward by 1 hour
// Negative values shift backward (execute sooner)| Filter Field | Description |
|---|---|
| tenant_id | Required. Scopes the operation to a tenant. |
| namespace | Optional. Filter by namespace. |
| sequence_id | Optional. Filter by sequence. |
| states | Optional. Only affect instances in these states. |
Production Patterns
Alerting on Open Circuits
Poll GET /circuit-breakers every 30s. Alert when any breaker enters open state. Include handler name and failure count in the alert for quick triage.
DLQ Monitoring
Track DLQ depth per sequence. A growing DLQ indicates a systemic issue (external service down, bad deployment, schema change). Alert when DLQ depth exceeds baseline.
Retry After Fix
When an external service recovers: (1) reset the circuit breaker, (2) retry failed instances from DLQ. The circuit reset allows new requests through; the DLQ retry recovers past failures.
# 1. Reset the breaker
curl -X POST /circuit-breakers/http_request/reset
# 2. Retry each failed instance
for id in $(curl /instances/dlq?sequence_id=... | jq -r '.[].id'); do
curl -X POST /instances/$id/retry
doneBulk Cancel Stale DLQ
If failed instances are too old to be relevant (e.g., time-sensitive notifications from 2 weeks ago), bulk-cancel them instead of retrying.
Retry Budget
Set retry policies with sensible limits. For idempotent operations (reads, notifications): generous retries (5-10 attempts, exponential backoff). For non-idempotent operations (payments): fewer retries (2-3) with manual DLQ review.