Skip to content

How It Works

This page provides detailed information about the core working principles and technical architecture of Gthulhu and Qumun schedulers.

Overall Architecture

Dual-Component Design

Gthulhu scheduler adopts a modern dual-component architecture:

1. BPF Scheduler

A BPF scheduler implemented based on the Linux kernel's sched_ext framework, responsible for low-level scheduling functions such as task queue management, CPU selection logic, and scheduling execution. The BPF scheduler communicates with the user-space Gthulhu scheduler through two types of eBPF Maps: ring buffer and user ring buffer.

2. Gthulhu (User Space Scheduler)

The Gthulhu scheduler, developed using the qumun framework, receives information about tasks to be scheduled from the ring buffer eBPF Map and makes decisions based on scheduling policies. Finally, the scheduling results are sent back to the BPF Scheduler through the user ring buffer eBPF Map.

The Gthulhu scheduler supports a plugin-based design, allowing developers to extend and customize scheduling policies according to their needs. Gthulhu/plugin currently implements two schedulers:

  • Simple Scheduler: A simple scheduler implemented with reference to scx_simple, with core logic of approximately 200 lines.
  • Gthulhu Scheduler: A virtual runtime-based scheduler with latency-sensitive optimization and CPU topology-aware features.

CPU Topology-Aware Scheduling

Hierarchical CPU Selection

graph TB
    A[Task Needs CPU] --> AA{Single CPU Allowed?}
    AA -->|Yes| AB[Check if CPU is Idle]
    AA -->|No| B{SMT System?}

    AB -->|Idle| AC[Use Previous CPU]
    AB -->|Not Idle| AD[Fail with EBUSY]

    B -->|Yes| C{Previous CPU Full-Idle Core?}
    B -->|No| G{Previous CPU Idle?}

    C -->|Yes| D[Use Previous CPU]
    C -->|No| E{Full-Idle CPU in L2 Cache?}

    E -->|Yes| F[Use CPU in Same L2 Cache]
    E -->|No| H{Full-Idle CPU in L3 Cache?}

    H -->|Yes| I[Use CPU in Same L3 Cache]
    H -->|No| J{Any Full-Idle Core Available?}

    J -->|Yes| K[Use Any Full-Idle Core]
    J -->|No| G

    G -->|Yes| L[Use Previous CPU]
    G -->|No| M{Any Idle CPU in L2 Cache?}

    M -->|Yes| N[Use CPU in Same L2 Cache]
    M -->|No| O{Any Idle CPU in L3 Cache?}

    O -->|Yes| P[Use CPU in Same L3 Cache]
    O -->|No| Q{Any Idle CPU Available?}

    Q -->|Yes| R[Use Any Idle CPU]
    Q -->|No| S[Return EBUSY]

API and Scheduling Policy Design

Gthulhu implements a flexible mechanism to dynamically adjust its scheduling behavior through a RESTful API interface. This allows operators to fine-tune the scheduler's performance characteristics without restarting or recompiling the code.

API Architecture

The API server provides endpoints for retrieving and setting scheduling strategies:

graph TB
    A[Gthulhu Scheduler] -->|Periodic Requests| B[API Server]
    C[Operators/Admins] -->|Configure Strategies| B
    B -->|Return Strategies| A
    A -->|Apply Strategies| D[Task Scheduling]

    subgraph "External Management"
        C
    end

    subgraph "Scheduling System"
        A
        D
    end

API Endpoints

The API server exposes two primary endpoints for scheduling strategy management:

  • GET /api/v1/scheduling/strategies: Retrieves current scheduling strategies
  • POST /api/v1/scheduling/strategies: Sets new scheduling strategies

Scheduling Strategy Data Model

A scheduling strategy is represented using the following structure:

{
  "scheduling": [
    {
      "priority": true,
      "execution_time": 20000000,
      "pid": 12345
    },
    {
      "priority": false,
      "execution_time": 10000000,
      "selectors": [
        {
          "key": "tier",
          "value": "control-plane"
        }
      ]
    }
  ]
}

Key components of a scheduling strategy:

  1. Priority (boolean): When true, the task's virtual runtime is set to the minimum value, effectively giving it the highest scheduling priority
  2. Execution Time (uint64): Custom time slice in nanoseconds for the task
  3. PID (int): Process ID to which the strategy applies
  4. Selectors (array): Optional Kubernetes label selectors for targeting groups of processes

Strategy Application Flow

The process of fetching and applying scheduling strategies follows this sequence:

sequenceDiagram
    participant S as Scheduler
    participant A as API Server
    participant T as Task Pool

    S->>S: Initialize scheduler
    S->>S: Start strategy fetcher

    loop Every interval seconds
        S->>A: Request current strategies
        A->>S: Return strategy list
        S->>S: Update strategy map
    end

    Note over S,T: During task scheduling
    T->>S: Task needs scheduling
    S->>S: Check if task has custom strategy
    S->>S: Apply priority setting if needed
    S->>S: Apply custom execution time if specified
    S->>T: Schedule task with applied strategy

Kubernetes Integration

For containerized environments, Gthulhu can map scheduling strategies to specific pods using label selectors:

  1. Label Selector Resolution: The API server translates label selectors into specific PIDs by scanning the system for matching pods
  2. PID Mapping: Each pod's processes are identified and associated with the appropriate scheduling strategy
  3. Dynamic Updates: As pods are created, destroyed, or moved, the scheduler adapts by periodically refreshing its strategies

Strategy Prioritization Logic

When applying scheduling strategies, Gthulhu follows these rules:

  1. Direct PID Match: Strategies that explicitly specify a PID have highest precedence
  2. Label Selector Match: Strategies using label selectors apply to all matching processes
  3. Default Behavior: Processes without specific strategies use the standard scheduling algorithm

Configuration Parameters

The strategy fetching behavior can be configured through the scheduler's configuration file:

api:
  url: "http://api-server:8080"   # API server endpoint
  interval: 10                    # Refresh interval in seconds

This architecture allows for dynamic, fine-grained control over scheduling behavior without interrupting the scheduler's operation.

BPF and User Space Communication

Communication Mechanism

sequenceDiagram
    participant K as BPF (Kernel Space)
    participant U as Go (User Space)

    K->>U: Task Creation Event
    U->>U: Analyze Task Characteristics
    U->>K: Set Scheduling Parameters
    K->>K: Apply Scheduling Decision
    K->>U: Statistics Update
    U->>U: Dynamic Strategy Adjustment

Debugging and Monitoring

BPF Tracing

# Monitor BPF program execution
sudo cat /sys/kernel/debug/tracing/trace_pipe

# Check BPF statistics
sudo bpftool prog show
sudo bpftool map dump name task_info_map

Differences from CFS

Feature CFS (Completely Fair Scheduler) Gthulhu
Scheduling Policy Virtual runtime based Virtual runtime + latency optimization
Task Classification Unified processing Automatic classification optimization
CPU Selection Basic load balancing Topology-aware + cache affinity
Dynamic Adjustment Limited Comprehensive adaptive adjustment
Extensibility Kernel built-in User-space extensible

Future Development Directions

  1. Machine Learning Integration: Use ML models to predict task behavior
  2. Container-Aware Scheduling: Optimization for containerized environments
  3. Energy Optimization: Integration of power management considerations
  4. Real-Time Task Support: Support for hard real-time task scheduling

Deep Dive

For more implementation details, refer to the API Reference and source code comments.