Workflow Orchestration / Failure Modeling

OrionScheduler

The defining choice is ownership: one loop owns mutable graph state, everything else produces events that can be persisted, replayed, or rejected.

OrionScheduler is a single-node DAG execution engine for dependency-aware workloads. It routes graph mutation through one scheduler loop, uses a priority queue plus worker pool for execution, persists acknowledged state through a write-ahead log, and exposes an interactive UI that makes failure and recovery inspectable instead of implicit.

GoDAGWALRecoveryPrometheusConcurrency

Launch Testing Arena Open Repository Return to Index

Worker Pool

default concurrent executors in the engine

Retry Budget

default retries with capped exponential backoff

Submission Cap

1000 tasks

max tasks accepted per request in the handler layer

The Core Problem

DAG execution is easy. Staying correct under failure is not.

Most task runners share mutable state across goroutines behind a mutex. Under concurrent completion signals, two workers can complete tasks simultaneously, both unlock dependents, and the same child task runs twice.

OrionScheduler eliminates this by design. One goroutine owns all graph state. Workers send events via typed channels into that loop — they never touch shared maps directly. The correctness invariant is structural, not a convention.

Every state change is persisted to a Write-Ahead Log before it is applied. A crash at any point leaves a recoverable, consistent snapshot.

Naive task runner

Shared maps + mutex

races under concurrent completions

OrionScheduler

Single-owner loop

zero mutexes on hot path

Expand

HTTP → WAL → Scheduler → Priority Queue → Worker Pool. Each boundary is a correctness guarantee, not an abstraction.

Architecture

Six components. One coherent ownership model.

Each component has a defined responsibility boundary. No component mutates state owned by another.

Expand

End-to-end: ingestion → WAL durability → single-owner scheduler → priority dispatch → worker execution → event telemetry.

Component Registry

Component	Detail	Key Invariant
HTTP Server	POST /api/v1/dag validates payload, writes WAL.AppendIngest, then calls scheduler.Ingest. Rate-limited to 5 rps for non-local callers. Accepts up to 1000 tasks per request.	WAL write before scheduler visibility
Write-Ahead Log	Append-only newline-delimited JSON. fsync() on every write. Mutex-guarded for concurrent callers. Recover() truncates at first corrupt record.	fsync before returning to caller
Scheduler Run Loop	One goroutine. select over 10 typed channels. Owns tasks map, inDegree map, dependents map, readyQueue. Zero mutexes on hot path.	All graph mutation in one goroutine
Priority Queue	Min-heap ordered by task Priority field. pushLoop() pops ready tasks and feeds them to workers via readyChan. Stale cancelled tasks are skipped on dequeue.	Priority-ordered dispatch
Worker Pool	4 goroutines reading from readyChan. Each worker checks task.Cancelled.Store(1) before executing. Calls scheduler.Complete() or scheduler.Fail() after execution.	Atomic cancellation guard
EventChan	1000-buffered channel. emitEvent() is non-blocking — drops counted in droppedEvents. 1s ticker emits live metrics snapshot. Context cancellation stops the ticker on crash.	Never blocks run loop

Write-Ahead Log

Tasks are durable before the scheduler sees them.

Every state transition is written to the WAL and fsynced before it is applied to in-memory state. Recovery reads entries in order to reconstruct exact pre-crash state.

HTTP handler validates payload

Max 1000 tasks, rate-limited

WAL.AppendIngest(tasks)

JSON marshal → file.Write → file.Sync() — durability boundary

scheduler.Ingest(tasks)

Sends to ingestChan, blocks until runLoop processes

runLoop: handleIngest()

Builds inDegree, dependents maps. Enqueues zero-dependency tasks to readyQueue.

pushLoop → Worker dequeues

Worker checks Cancelled flag, executes, then calls Complete() or Fail()

WAL Entry Types

INGESTtasks[]

When: Before scheduler sees new work

Recovery: Re-ingest all tasks from this batch

STARTtask_id

When: Worker picks up a task

Recovery: Task was running at crash — mark as orphan, re-enqueue

COMPLETEtask_id

When: Worker finishes successfully

Recovery: Mark completed, unblock dependents

FAILtask_id, is_cascade: false

When: Worker reports organic failure

Recovery: Apply retry logic; re-enqueue if retries remain

FAIL (cascade)task_id, is_cascade: true

When: Upstream failure propagates to this task

Recovery: Mark permanently failed, no retry, no re-enqueue

Expand

Five states. Deterministic transitions. Retry uses exponential backoff via time.AfterFunc — the run loop is never blocked.

Task Lifecycle

Five states. Retries cap at 30s backoff.

Pending→Ready

All dependencies completed (inDegree == 0)

Ready→Running

Worker dequeues from priority queue

Running→Completed

Worker reports success

Running→Pending (retry)

Worker fails, RetryCount < MaxRetries — exponential backoff via time.AfterFunc

Running/Pending→Failed (permanent)

Retries exhausted — triggers BFS cascade to all dependents

Retry Backoff

delay = min(1 << RetryCount seconds, 30s)

Retry 1: 2s, Retry 2: 4s, Retry 3: 8s. Scheduled via time.AfterFunc — run loop never sleeps.

Failure Cascade

The dangerous bug is stale downstream work.

When a task exhausts retries, propagateFailure(rootID) runs BFS over the dependents map. Every transitive dependent is marked Failed and gets Cancelled.Store(1) — atomic signal for workers to abort.

The visited set prevents diamond-shaped graphs from double-processing any node. Each task is touched exactly once regardless of how many paths lead to it.

Cascade failures are written as AppendCascadeFail with IsCascade: true. During WAL replay this flag prevents retry logic from firing on tasks that were never running.

Expand

Task A fails → BFS marks B, C, D Failed atomically. Workers checking Cancelled abort mid-execution.

Expand

Dual guard: scheduler sets Cancelled=1, worker checks before executing and before reporting. No stale results delivered.

Dual Cancellation Guard

Two atomic checks. Zero stale results.

Scheduler sets Cancelled=1

atomic.Int32 store in propagateFailure(), before WAL write

Worker checks before executing

Loads atomic flag at pickup — skips if set

Worker checks before reporting

Second check before Complete()/Fail() — prevents stale delivery

Crash Recovery

A crash at any point leaves a recoverable state.

wal.Recover() replays entries in order, classifies task state, and requeues orphaned work. Execution resumes exactly where it stopped.

Expand

Normal execution → FATAL crash → server restart → WAL replay → orphan requeue → execution resumes. WAL is the single source of truth.

Open WAL

Server boots, opens wal.log in append mode. If file doesn't exist, starts with clean state.

Read Line-by-Line

bufio.Reader reads newline-delimited JSON. First corrupt record stops reading — WAL is truncated to last valid offset.

Classify State

Builds CompletedTasks, FailedTasks, InProgressTasks maps from COMPLETE/FAIL/START records.

Replay Entries

Re-feeds each WAL entry to the scheduler in order. INGEST restores graph, START marks running, COMPLETE unblocks dependents, FAIL applies retry or cascade logic.

Requeue Orphans

RequeueOrphans() moves InProgressTasks (were running at crash) back to Ready. Execution resumes from last consistent state.

The Pull the Plug Demo — 7 Steps

Open the UI at /playground

Configure a complex DAG using a provided template

Submit the graph — watch tasks execute in real-time

While tasks are running, click Pull the Plug — FATAL panic kills the process

The backend terminates mid-execution. WAL preserves all acknowledged state.

Click Recover System — server restarts and opens WAL

WAL replays: orphan tasks requeue, execution resumes from last consistent state

Expand

Live metrics streamed at 1s intervals via WebSocket. Dropped events are counted — never silently lost.

Observability

Non-blocking. Drop-counted. Always live.

A 1-second ticker goroutine emits a metrics snapshot into EventChan. EventChan is 1000-deep and uses non-blocking sends — the run loop is never stalled by a slow consumer.

Drops are counted atomically in droppedEvents and surfaced in /api/v1/metrics/live. A rising count indicates buffer pressure — not a silent correctness failure.

pending / running / completed / failed / retried

atomic.Int64 counters in Scheduler struct

queue_size

atomic.Int64 queueSize — updated on every enqueue/dequeue

dropped_events

atomic.Int64 droppedEvents — incremented on every non-blocking miss

Prometheus counters

TasksIngestedTotal, TasksCompletedTotal, TasksFailedTotal, TasksRetriedTotal, CurrentQueueSize

Engineering Decisions

5 Decisions That Define the System

Each choice has a clear reason and an honest tradeoff. Sourced directly from the implementation.

Key Decisions

01 — Concurrency

02 — Durability

03 — Failure Propagation

04 — WAL Schema

05 — Observability

Concurrency—Decision 01

Single-Owner Event Loop Over Mutexes

Problem

Shared mutable state protected by locks means any goroutine can corrupt graph state under a race condition. Locks also make it hard to reason about ordering.

Decision

The scheduler runs as a single goroutine. All graph mutations happen inside one select loop. Workers only send events (via channels) back into that loop — they never touch shared maps directly.

Tradeoff

Sequential dispatch only. No parallel compaction of the task graph. Throughput is bounded by channel round-trip latency, not by CPU parallelism.

Source

// runLoop owns all mutable state.
// Workers communicate via typed channels only.
select {
case req := <-s.ingestChan:
    s.handleIngest(req)
case taskID := <-s.completeChan:
    s.handleComplete(taskID)
case taskID := <-s.failChan:
    s.handleFailure(taskID)
case taskID := <-s.retryChan:
    s.handleRetryEnqueue(taskID)
}

Durability—Decision 02

WAL fsync on Every Write

Problem

Without fsyncing on every WAL append, a crash between write() and the OS flush leaves the WAL with data the OS buffered but never persisted — tasks are lost silently.

Decision

Every WAL.append() calls file.Sync() before returning. Tasks are durable before the scheduler ever sees them. Ingest fails loudly if fsync fails.

Tradeoff

I/O latency on every ingest. On NVMe this is negligible. On spinning disks or networked storage, each fsync can cost 5–15ms, throttling ingest rate considerably.

Source

func (w *WAL) append(entry WalEntry) error {
    data, _ := json.Marshal(entry)
    data = append(data, '\n')
    w.mu.Lock()
    defer w.mu.Unlock()
    w.file.Write(data)
    // Strict write-ahead durability boundary.
    return w.file.Sync()
}

Failure Propagation—Decision 03

BFS Failure Cascade With Visited Set

Problem

When a task permanently fails, all transitive dependents must also fail — but the dependency graph can have diamond shapes where multiple paths lead to the same node, risking double-processing.

Decision

propagateFailure() uses BFS with a visited map. Each node is processed exactly once. Cascade-killed tasks get task.Cancelled.Store(1) so any in-flight workers abort cleanly.

Tradeoff

All downstream tasks fail even if some could theoretically succeed via alternate paths. This is intentionally strict: partial graph execution after a failure creates ambiguous replay state.

Source

func (s *Scheduler) propagateFailure(rootID string) {
    queue := []string{rootID}
    visited := make(map[string]bool)
    for len(queue) > 0 {
        current := queue[0]; queue = queue[1:]
        if visited[current] { continue }
        visited[current] = true
        task.Cancelled.Store(1) // abort in-flight workers
        s.wal.AppendCascadeFail(current)
        queue = append(queue, s.dependents[current]...)
    }
}

WAL Schema—Decision 04

IsCascade Flag in WAL

Problem

During WAL replay, the engine needs to distinguish: did this task fail because a worker ran it and it crashed (organic), or because an upstream failure propagated to it (cascade)? The replay logic is different for each.

Decision

WAL FAIL entries carry IsCascade: true when written by propagateFailure. On replay, cascade tasks are immediately set to Failed — no retry logic, no running counter adjustment, no re-enqueue.

Tradeoff

Slightly more complex WAL schema. But without this flag, replaying a cascaded failure would incorrectly trigger retry logic for tasks that were never running — inflating retry counts and re-executing already-invalidated work.

Source

// WAL entry: cascade vs organic failure.
type WalEntry struct {
    Type      string `json:"type"`
    TaskID    string `json:"task_id,omitempty"`
    IsCascade bool   `json:"is_cascade,omitempty"`
}

// Replay: different paths for different failure origins.
if req.isCascade {
    task.Status = models.StatusFailed
    task.Cancelled.Store(1)
    return // never re-enqueue
}
// Organic: apply normal retry logic.
task.RetryCount++

Observability—Decision 05

Non-Blocking Event Emission

Problem

If the WebSocket event hub is slow (slow consumer, network backpressure), a blocking send from the scheduler run loop would stall task dispatch — a slow frontend would break the backend.

Decision

EventChan is 1000-deep. emitEvent() uses a non-blocking select with a default case. Dropped events are counted atomically via droppedEvents.Add(1) and exposed in /api/v1/metrics/live.

Tradeoff

Events can be silently dropped under load. The droppedEvents counter makes drops observable, but a slow consumer won't know which specific events were lost — only that drops occurred.

Source

// emitEvent never blocks the run loop.
func (s *Scheduler) emitEvent(event models.TaskEvent) {
    select {
    case s.EventChan <- event:
    default:
        // Buffer full — count the drop, move on.
        s.droppedEvents.Add(1)
    }
}

Running It

How to Use OrionScheduler

The scheduler ships with a real-time observability console — a Mission Control designed to make execution, failure cascades, and crash-recovery visually verifiable. The UI streams events over WebSocket and acts as a control plane for the Go backend.

Live Demo

Orion Testing Arena

A deployed instance of the orchestration engine with the full Mission Control console. Run the Auto-Demo to watch task scheduling, simulated worker execution, panic induction, and WAL-backed recovery unfold live.

Launch Arena

Run Backend Locally

# Clone the repository
git clone https://github.com/Hrushikesh-ramilla/OrionScheduler.git
cd OrionScheduler

# Start the Go scheduler engine
go run ./cmd/server

# Engine binds to http://localhost:8080
# Generates wal.log in the current directory

Run Frontend Locally

# Open a new terminal
cd frontend

# Install dependencies and start Next.js
npm ci
npm run dev

# Dashboard available at http://localhost:3000
# Points to localhost:8080 by default

Console Features

🌐

Live DAG Visualizer

Interactive node graph showing real-time task states: Pending, Ready, Running, Failed, Completed.

📡

WebSocket Telemetry

Stream of raw JSON events directly from the engine, avoiding polling overhead.

🏗️

Worker Pool Status

Live monitoring of execution concurrency and idle vs active worker threads.

📊

Throughput Metrics

Real-time execution speed, retry rates, and queue pressure visualized.

💥

Simulated Panics

A 'Pull the Plug' button that intentionally crashes the Go backend mid-execution.

🔄

WAL Recovery

A 'Recover System' command that restarts the engine and streams WAL replay events.

🤖

Auto-Demo

A 1-click sequence that orchestrates a full graph execution, crash, and recovery demo.

📝

Algorithm Tooltips

Hoverable nodes explaining the Kahn-style scheduling logic and BFS failure cascade rules.

Next Case Study

Case 04

Noctivana

A Raspberry Pi 4 infant safety monitor that fuses pose, occlusion, respiration, audio, and nursery conditions locally instead of streaming private video to the cloud.