How It Works

This page provides detailed information about the core working principles and technical architecture of Gthulhu and Qumun schedulers.

Overall Architecture

Dual-Component Design

Gthulhu scheduler adopts a modern dual-component architecture:

1. BPF Scheduler

A BPF scheduler implemented based on the Linux kernel's sched_ext framework, responsible for low-level scheduling functions such as task queue management, CPU selection logic, and scheduling execution. The BPF scheduler communicates with the user-space Gthulhu scheduler through two types of eBPF Maps: ring buffer and user ring buffer.

2. Gthulhu (User Space Scheduler)

The Gthulhu scheduler, developed using the qumun framework, receives information about tasks to be scheduled from the ring buffer eBPF Map and makes decisions based on scheduling policies. Finally, the scheduling results are sent back to the BPF Scheduler through the user ring buffer eBPF Map.

The Gthulhu scheduler supports a plugin-based design, allowing developers to extend and customize scheduling policies according to their needs. Gthulhu/plugin currently implements two schedulers:

Simple Scheduler: A simple scheduler implemented with reference to scx_simple, with core logic of approximately 200 lines.
Gthulhu Scheduler: A virtual runtime-based scheduler with latency-sensitive optimization and CPU topology-aware features.

CPU Topology-Aware Scheduling

Hierarchical CPU Selection

graph TB
    A[Task Needs CPU] --> AA{Single CPU Allowed?}
    AA -->|Yes| AB[Check if CPU is Idle]
    AA -->|No| B{SMT System?}

    AB -->|Idle| AC[Use Previous CPU]
    AB -->|Not Idle| AD[Fail with EBUSY]

    B -->|Yes| C{Previous CPU Full-Idle Core?}
    B -->|No| G{Previous CPU Idle?}

    C -->|Yes| D[Use Previous CPU]
    C -->|No| E{Full-Idle CPU in L2 Cache?}

    E -->|Yes| F[Use CPU in Same L2 Cache]
    E -->|No| H{Full-Idle CPU in L3 Cache?}

    H -->|Yes| I[Use CPU in Same L3 Cache]
    H -->|No| J{Any Full-Idle Core Available?}

    J -->|Yes| K[Use Any Full-Idle Core]
    J -->|No| G

    G -->|Yes| L[Use Previous CPU]
    G -->|No| M{Any Idle CPU in L2 Cache?}

    M -->|Yes| N[Use CPU in Same L2 Cache]
    M -->|No| O{Any Idle CPU in L3 Cache?}

    O -->|Yes| P[Use CPU in Same L3 Cache]
    O -->|No| Q{Any Idle CPU Available?}

    Q -->|Yes| R[Use Any Idle CPU]
    Q -->|No| S[Return EBUSY]

API and Scheduling Policy Design

Gthulhu implements a flexible mechanism to dynamically adjust its scheduling behavior through a RESTful API interface. This allows operators to fine-tune the scheduler's performance characteristics without restarting or recompiling the code.

API Architecture

The API server provides endpoints for retrieving and setting scheduling strategies:

graph TB
    A[Gthulhu Scheduler] -->|Periodic Requests| B[API Server]
    C[Operators/Admins] -->|Configure Strategies| B
    B -->|Return Strategies| A
    A -->|Apply Strategies| D[Task Scheduling]

    subgraph "External Management"
        C
    end

    subgraph "Scheduling System"
        A
        D
    end

API Endpoints

The API server exposes two primary endpoints for scheduling strategy management:

GET /api/v1/scheduling/strategies: Retrieves current scheduling strategies
POST /api/v1/scheduling/strategies: Sets new scheduling strategies

Scheduling Strategy Data Model

A scheduling strategy is represented using the following structure:

{
  "scheduling": [
    {
      "priority": true,
      "execution_time": 20000000,
      "pid": 12345
    },
    {
      "priority": false,
      "execution_time": 10000000,
      "selectors": [
        {
          "key": "tier",
          "value": "control-plane"
        }
      ]
    }
  ]
}

Key components of a scheduling strategy:

Priority (boolean): When true, the task's virtual runtime is set to the minimum value, effectively giving it the highest scheduling priority
Execution Time (uint64): Custom time slice in nanoseconds for the task
PID (int): Process ID to which the strategy applies
Selectors (array): Optional Kubernetes label selectors for targeting groups of processes

Strategy Application Flow

The process of fetching and applying scheduling strategies follows this sequence:

sequenceDiagram
    participant S as Scheduler
    participant A as API Server
    participant T as Task Pool

    S->>S: Initialize scheduler
    S->>S: Start strategy fetcher

    loop Every interval seconds
        S->>A: Request current strategies
        A->>S: Return strategy list
        S->>S: Update strategy map
    end

    Note over S,T: During task scheduling
    T->>S: Task needs scheduling
    S->>S: Check if task has custom strategy
    S->>S: Apply priority setting if needed
    S->>S: Apply custom execution time if specified
    S->>T: Schedule task with applied strategy

Kubernetes Integration

For containerized environments, Gthulhu can map scheduling strategies to specific pods using label selectors:

Label Selector Resolution: The API server translates label selectors into specific PIDs by scanning the system for matching pods
PID Mapping: Each pod's processes are identified and associated with the appropriate scheduling strategy
Dynamic Updates: As pods are created, destroyed, or moved, the scheduler adapts by periodically refreshing its strategies

Strategy Prioritization Logic

When applying scheduling strategies, Gthulhu follows these rules:

Direct PID Match: Strategies that explicitly specify a PID have highest precedence
Label Selector Match: Strategies using label selectors apply to all matching processes
Default Behavior: Processes without specific strategies use the standard scheduling algorithm

Configuration Parameters

The strategy fetching behavior can be configured through the scheduler's configuration file:

api:
  url: "http://api-server:8080"   # API server endpoint
  interval: 10                    # Refresh interval in seconds

This architecture allows for dynamic, fine-grained control over scheduling behavior without interrupting the scheduler's operation.

BPF and User Space Communication

Communication Mechanism

sequenceDiagram
    participant K as BPF (Kernel Space)
    participant U as Go (User Space)

    K->>U: Task Creation Event
    U->>U: Analyze Task Characteristics
    U->>K: Set Scheduling Parameters
    K->>K: Apply Scheduling Decision
    K->>U: Statistics Update
    U->>U: Dynamic Strategy Adjustment

Debugging and Monitoring

BPF Tracing

# Monitor BPF program execution
sudo cat /sys/kernel/debug/tracing/trace_pipe

# Check BPF statistics
sudo bpftool prog show
sudo bpftool map dump name task_info_map

Differences from CFS

Feature	CFS (Completely Fair Scheduler)	Gthulhu
Scheduling Policy	Virtual runtime based	Virtual runtime + latency optimization
Task Classification	Unified processing	Automatic classification optimization
CPU Selection	Basic load balancing	Topology-aware + cache affinity
Dynamic Adjustment	Limited	Comprehensive adaptive adjustment
Extensibility	Kernel built-in	User-space extensible

Future Development Directions

Machine Learning Integration: Use ML models to predict task behavior
Container-Aware Scheduling: Optimization for containerized environments
Energy Optimization: Integration of power management considerations
Real-Time Task Support: Support for hard real-time task scheduling

Deep Dive

For more implementation details, refer to the API Reference and source code comments.