Menu

Workflow Orchestration / Failure Modeling

OrionScheduler

The defining choice is ownership: one loop owns mutable graph state, everything else produces events that can be persisted, replayed, or rejected.

OrionScheduler is a single-node DAG execution engine for dependency-aware workloads. It routes graph mutation through one scheduler loop, uses a priority queue plus worker pool for execution, persists acknowledged state through a write-ahead log, and exposes an interactive UI that makes failure and recovery inspectable instead of implicit.

GoDAGWALRecoveryPrometheusConcurrency

Worker Pool

4

default concurrent executors in the engine

Retry Budget

3

default retries with capped exponential backoff

Submission Cap

1000 tasks

max tasks accepted per request in the handler layer

The Core Problem

DAG execution is easy. Staying correct under failure is not.

Most task runners share mutable state across goroutines behind a mutex. Under concurrent completion signals, two workers can complete tasks simultaneously, both unlock dependents, and the same child task runs twice.

OrionScheduler eliminates this by design. One goroutine owns all graph state. Workers send events via typed channels into that loop — they never touch shared maps directly. The correctness invariant is structural, not a convention.

Every state change is persisted to a Write-Ahead Log before it is applied. A crash at any point leaves a recoverable, consistent snapshot.

Naive task runner

Shared maps + mutex

races under concurrent completions

OrionScheduler

Single-owner loop

zero mutexes on hot path

OrionScheduler system architecture
Expand

HTTP → WAL → Scheduler → Priority Queue → Worker Pool. Each boundary is a correctness guarantee, not an abstraction.

Architecture

Six components. One coherent ownership model.

Each component has a defined responsibility boundary. No component mutates state owned by another.

OrionScheduler full architecture
Expand

End-to-end: ingestion → WAL durability → single-owner scheduler → priority dispatch → worker execution → event telemetry.

Component Registry

ComponentDetailKey Invariant
HTTP ServerPOST /api/v1/dag validates payload, writes WAL.AppendIngest, then calls scheduler.Ingest. Rate-limited to 5 rps for non-local callers. Accepts up to 1000 tasks per request.WAL write before scheduler visibility
Write-Ahead LogAppend-only newline-delimited JSON. fsync() on every write. Mutex-guarded for concurrent callers. Recover() truncates at first corrupt record.fsync before returning to caller
Scheduler Run LoopOne goroutine. select over 10 typed channels. Owns tasks map, inDegree map, dependents map, readyQueue. Zero mutexes on hot path.All graph mutation in one goroutine
Priority QueueMin-heap ordered by task Priority field. pushLoop() pops ready tasks and feeds them to workers via readyChan. Stale cancelled tasks are skipped on dequeue.Priority-ordered dispatch
Worker Pool4 goroutines reading from readyChan. Each worker checks task.Cancelled.Store(1) before executing. Calls scheduler.Complete() or scheduler.Fail() after execution.Atomic cancellation guard
EventChan1000-buffered channel. emitEvent() is non-blocking — drops counted in droppedEvents. 1s ticker emits live metrics snapshot. Context cancellation stops the ticker on crash.Never blocks run loop

Write-Ahead Log

Tasks are durable before the scheduler sees them.

Every state transition is written to the WAL and fsynced before it is applied to in-memory state. Recovery reads entries in order to reconstruct exact pre-crash state.

01
HTTP handler validates payload

Max 1000 tasks, rate-limited

02
WAL.AppendIngest(tasks)

JSON marshal → file.Write → file.Sync() — durability boundary

03
scheduler.Ingest(tasks)

Sends to ingestChan, blocks until runLoop processes

04
runLoop: handleIngest()

Builds inDegree, dependents maps. Enqueues zero-dependency tasks to readyQueue.

05
pushLoop → Worker dequeues

Worker checks Cancelled flag, executes, then calls Complete() or Fail()

WAL Entry Types

INGESTtasks[]

When: Before scheduler sees new work

Recovery: Re-ingest all tasks from this batch

STARTtask_id

When: Worker picks up a task

Recovery: Task was running at crash — mark as orphan, re-enqueue

COMPLETEtask_id

When: Worker finishes successfully

Recovery: Mark completed, unblock dependents

FAILtask_id, is_cascade: false

When: Worker reports organic failure

Recovery: Apply retry logic; re-enqueue if retries remain

FAIL (cascade)task_id, is_cascade: true

When: Upstream failure propagates to this task

Recovery: Mark permanently failed, no retry, no re-enqueue

Task lifecycle state machine
Expand

Five states. Deterministic transitions. Retry uses exponential backoff via time.AfterFunc — the run loop is never blocked.

Task Lifecycle

Five states. Retries cap at 30s backoff.

PendingReady

All dependencies completed (inDegree == 0)

ReadyRunning

Worker dequeues from priority queue

RunningCompleted

Worker reports success

RunningPending (retry)

Worker fails, RetryCount < MaxRetries — exponential backoff via time.AfterFunc

Running/PendingFailed (permanent)

Retries exhausted — triggers BFS cascade to all dependents

Retry Backoff

delay = min(1 << RetryCount seconds, 30s)

Retry 1: 2s, Retry 2: 4s, Retry 3: 8s. Scheduled via time.AfterFunc — run loop never sleeps.

Failure Cascade

The dangerous bug is stale downstream work.

When a task exhausts retries, propagateFailure(rootID) runs BFS over the dependents map. Every transitive dependent is marked Failed and gets Cancelled.Store(1) — atomic signal for workers to abort.

The visited set prevents diamond-shaped graphs from double-processing any node. Each task is touched exactly once regardless of how many paths lead to it.

Cascade failures are written as AppendCascadeFail with IsCascade: true. During WAL replay this flag prevents retry logic from firing on tasks that were never running.

BFS failure cascade
Expand

Task A fails → BFS marks B, C, D Failed atomically. Workers checking Cancelled abort mid-execution.

Dual atomic cancellation guard
Expand

Dual guard: scheduler sets Cancelled=1, worker checks before executing and before reporting. No stale results delivered.

Dual Cancellation Guard

Two atomic checks. Zero stale results.

1

Scheduler sets Cancelled=1

atomic.Int32 store in propagateFailure(), before WAL write

2

Worker checks before executing

Loads atomic flag at pickup — skips if set

3

Worker checks before reporting

Second check before Complete()/Fail() — prevents stale delivery

Crash Recovery

A crash at any point leaves a recoverable state.

wal.Recover() replays entries in order, classifies task state, and requeues orphaned work. Execution resumes exactly where it stopped.

WAL crash recovery timeline
Expand

Normal execution → FATAL crash → server restart → WAL replay → orphan requeue → execution resumes. WAL is the single source of truth.

01

Open WAL

Server boots, opens wal.log in append mode. If file doesn't exist, starts with clean state.

02

Read Line-by-Line

bufio.Reader reads newline-delimited JSON. First corrupt record stops reading — WAL is truncated to last valid offset.

03

Classify State

Builds CompletedTasks, FailedTasks, InProgressTasks maps from COMPLETE/FAIL/START records.

04

Replay Entries

Re-feeds each WAL entry to the scheduler in order. INGEST restores graph, START marks running, COMPLETE unblocks dependents, FAIL applies retry or cascade logic.

05

Requeue Orphans

RequeueOrphans() moves InProgressTasks (were running at crash) back to Ready. Execution resumes from last consistent state.

The Pull the Plug Demo — 7 Steps

01

Open the UI at /playground

02

Configure a complex DAG using a provided template

03

Submit the graph — watch tasks execute in real-time

04

While tasks are running, click Pull the Plug — FATAL panic kills the process

05

The backend terminates mid-execution. WAL preserves all acknowledged state.

06

Click Recover System — server restarts and opens WAL

07

WAL replays: orphan tasks requeue, execution resumes from last consistent state

OrionScheduler observability dashboard
Expand

Live metrics streamed at 1s intervals via WebSocket. Dropped events are counted — never silently lost.

Observability

Non-blocking. Drop-counted. Always live.

A 1-second ticker goroutine emits a metrics snapshot into EventChan. EventChan is 1000-deep and uses non-blocking sends — the run loop is never stalled by a slow consumer.

Drops are counted atomically in droppedEvents and surfaced in /api/v1/metrics/live. A rising count indicates buffer pressure — not a silent correctness failure.

pending / running / completed / failed / retried

atomic.Int64 counters in Scheduler struct

queue_size

atomic.Int64 queueSize — updated on every enqueue/dequeue

dropped_events

atomic.Int64 droppedEvents — incremented on every non-blocking miss

Prometheus counters

TasksIngestedTotal, TasksCompletedTotal, TasksFailedTotal, TasksRetriedTotal, CurrentQueueSize

Engineering Decisions

5 Decisions That Define the System

Each choice has a clear reason and an honest tradeoff. Sourced directly from the implementation.

ConcurrencyDecision 01

Single-Owner Event Loop Over Mutexes

Problem

Shared mutable state protected by locks means any goroutine can corrupt graph state under a race condition. Locks also make it hard to reason about ordering.

Decision

The scheduler runs as a single goroutine. All graph mutations happen inside one select loop. Workers only send events (via channels) back into that loop — they never touch shared maps directly.

Tradeoff

Sequential dispatch only. No parallel compaction of the task graph. Throughput is bounded by channel round-trip latency, not by CPU parallelism.

Source
// runLoop owns all mutable state.
// Workers communicate via typed channels only.
select {
case req := <-s.ingestChan:
    s.handleIngest(req)
case taskID := <-s.completeChan:
    s.handleComplete(taskID)
case taskID := <-s.failChan:
    s.handleFailure(taskID)
case taskID := <-s.retryChan:
    s.handleRetryEnqueue(taskID)
}
DurabilityDecision 02

WAL fsync on Every Write

Problem

Without fsyncing on every WAL append, a crash between write() and the OS flush leaves the WAL with data the OS buffered but never persisted — tasks are lost silently.

Decision

Every WAL.append() calls file.Sync() before returning. Tasks are durable before the scheduler ever sees them. Ingest fails loudly if fsync fails.

Tradeoff

I/O latency on every ingest. On NVMe this is negligible. On spinning disks or networked storage, each fsync can cost 5–15ms, throttling ingest rate considerably.

Source
func (w *WAL) append(entry WalEntry) error {
    data, _ := json.Marshal(entry)
    data = append(data, '\n')
    w.mu.Lock()
    defer w.mu.Unlock()
    w.file.Write(data)
    // Strict write-ahead durability boundary.
    return w.file.Sync()
}
Failure PropagationDecision 03

BFS Failure Cascade With Visited Set

Problem

When a task permanently fails, all transitive dependents must also fail — but the dependency graph can have diamond shapes where multiple paths lead to the same node, risking double-processing.

Decision

propagateFailure() uses BFS with a visited map. Each node is processed exactly once. Cascade-killed tasks get task.Cancelled.Store(1) so any in-flight workers abort cleanly.

Tradeoff

All downstream tasks fail even if some could theoretically succeed via alternate paths. This is intentionally strict: partial graph execution after a failure creates ambiguous replay state.

Source
func (s *Scheduler) propagateFailure(rootID string) {
    queue := []string{rootID}
    visited := make(map[string]bool)
    for len(queue) > 0 {
        current := queue[0]; queue = queue[1:]
        if visited[current] { continue }
        visited[current] = true
        task.Cancelled.Store(1) // abort in-flight workers
        s.wal.AppendCascadeFail(current)
        queue = append(queue, s.dependents[current]...)
    }
}
WAL SchemaDecision 04

IsCascade Flag in WAL

Problem

During WAL replay, the engine needs to distinguish: did this task fail because a worker ran it and it crashed (organic), or because an upstream failure propagated to it (cascade)? The replay logic is different for each.

Decision

WAL FAIL entries carry IsCascade: true when written by propagateFailure. On replay, cascade tasks are immediately set to Failed — no retry logic, no running counter adjustment, no re-enqueue.

Tradeoff

Slightly more complex WAL schema. But without this flag, replaying a cascaded failure would incorrectly trigger retry logic for tasks that were never running — inflating retry counts and re-executing already-invalidated work.

Source
// WAL entry: cascade vs organic failure.
type WalEntry struct {
    Type      string `json:"type"`
    TaskID    string `json:"task_id,omitempty"`
    IsCascade bool   `json:"is_cascade,omitempty"`
}

// Replay: different paths for different failure origins.
if req.isCascade {
    task.Status = models.StatusFailed
    task.Cancelled.Store(1)
    return // never re-enqueue
}
// Organic: apply normal retry logic.
task.RetryCount++
ObservabilityDecision 05

Non-Blocking Event Emission

Problem

If the WebSocket event hub is slow (slow consumer, network backpressure), a blocking send from the scheduler run loop would stall task dispatch — a slow frontend would break the backend.

Decision

EventChan is 1000-deep. emitEvent() uses a non-blocking select with a default case. Dropped events are counted atomically via droppedEvents.Add(1) and exposed in /api/v1/metrics/live.

Tradeoff

Events can be silently dropped under load. The droppedEvents counter makes drops observable, but a slow consumer won't know which specific events were lost — only that drops occurred.

Source
// emitEvent never blocks the run loop.
func (s *Scheduler) emitEvent(event models.TaskEvent) {
    select {
    case s.EventChan <- event:
    default:
        // Buffer full — count the drop, move on.
        s.droppedEvents.Add(1)
    }
}

Running It

How to Use OrionScheduler

The scheduler ships with a real-time observability console — a Mission Control designed to make execution, failure cascades, and crash-recovery visually verifiable. The UI streams events over WebSocket and acts as a control plane for the Go backend.

Live Demo

Orion Testing Arena

A deployed instance of the orchestration engine with the full Mission Control console. Run the Auto-Demo to watch task scheduling, simulated worker execution, panic induction, and WAL-backed recovery unfold live.

Launch Arena

Run Backend Locally

# Clone the repository
git clone https://github.com/Hrushikesh-ramilla/OrionScheduler.git
cd OrionScheduler

# Start the Go scheduler engine
go run ./cmd/server

# Engine binds to http://localhost:8080
# Generates wal.log in the current directory

Run Frontend Locally

# Open a new terminal
cd frontend

# Install dependencies and start Next.js
npm ci
npm run dev

# Dashboard available at http://localhost:3000
# Points to localhost:8080 by default

Console Features

🌐

Live DAG Visualizer

Interactive node graph showing real-time task states: Pending, Ready, Running, Failed, Completed.

📡

WebSocket Telemetry

Stream of raw JSON events directly from the engine, avoiding polling overhead.

🏗️

Worker Pool Status

Live monitoring of execution concurrency and idle vs active worker threads.

📊

Throughput Metrics

Real-time execution speed, retry rates, and queue pressure visualized.

💥

Simulated Panics

A 'Pull the Plug' button that intentionally crashes the Go backend mid-execution.

🔄

WAL Recovery

A 'Recover System' command that restarts the engine and streams WAL replay events.

🤖

Auto-Demo

A 1-click sequence that orchestrates a full graph execution, crash, and recovery demo.

📝

Algorithm Tooltips

Hoverable nodes explaining the Kahn-style scheduling logic and BFS failure cascade rules.