How It Works
This page provides detailed information about the core working principles and technical architecture of Gthulhu.
Overall Architecture
Gthulhu provides an orchestrable, distributed scheduler solution for the cloud-native ecosystem. The architecture consists of multiple components working together:
graph TB
subgraph "Control Plane"
U[User / Web UI] -->|Configure Strategies| M[Manager
Central Management]
M -->|Persist| DB[(MongoDB)]
M -->|Query Pods| K8S[Kubernetes API
Pod Informer]
end
M -->|Distribute Intents| DM1
M -->|Distribute Intents| DM2
M -->|Distribute Intents| DMN
subgraph "Node 1"
DM1[Decision Maker] --> S1[Gthulhu Scheduler
sched_ext / eBPF]
end
subgraph "Node 2"
DM2[Decision Maker] --> S2[Gthulhu Scheduler
sched_ext / eBPF]
end
subgraph "Node N"
DMN[Decision Maker] --> SN[Gthulhu Scheduler
sched_ext / eBPF]
end
Component Overview
1. Manager (Control Plane)
The Manager serves as the central management service, responsible for:
- User authentication and authorization (JWT)
- Role and permission management (RBAC)
- CRUD operations for scheduling strategies
- Monitoring Pod status via Kubernetes Informer
- Distributing scheduling intents to Decision Makers on each node
- Data persistence to MongoDB (For
v1.0.0+, refer to Deploying Gthulhu with Kubernetes for CRD-based management)
2. Decision Maker (Per-Node Agent)
The Decision Maker runs as a Sidecar alongside the Gthulhu Scheduler and is responsible for:
- Receiving scheduling intents from the Manager
- Scanning
/procfilesystem to discover Pod processes - Converting scheduling strategies (label-based) into concrete PID-based scheduling decisions
- Providing PID-level strategies to the local Gthulhu Scheduler
- Collecting eBPF scheduler metrics and exposing them via Prometheus
3. Gthulhu Scheduler (sched_ext)
The Gthulhu Scheduler is the core scheduling component running on each node, built on a dual-component design:

BPF Scheduler
A BPF scheduler implemented based on the Linux kernel's sched_ext framework, responsible for low-level scheduling functions such as task queue management, CPU selection logic, and scheduling execution. The BPF scheduler communicates with the user-space Gthulhu scheduler through two types of eBPF Maps: ring buffer and user ring buffer.
User Space Scheduler
The user-space scheduler, developed using the qumun framework, receives information about tasks to be scheduled from the ring buffer eBPF Map and makes decisions based on scheduling policies. Finally, the scheduling results are sent back to the BPF Scheduler through the user ring buffer eBPF Map.
Plugin System

Gthulhu supports a plugin-based design using a factory pattern with a plugin registry, allowing developers to extend and customize scheduling policies.
Plugin Interface
The plugin system defines two core interfaces:
Sched: Low-level scheduler operations (DequeueTask,DefaultSelectCPU,GetNrQueued)CustomScheduler: Plugin-level operations that each scheduler must implement:DrainQueuedTask— Drain queued tasks from eBPFSelectQueuedTask— Select a task from the queueSelectCPU— Select an appropriate CPU for the taskDetermineTimeSlice— Calculate the time slice for executionGetPoolCount— Get the number of tasks in the dispatch poolSendMetrics— Send metrics to the monitoring systemGetChangedStrategies— Retrieve changed scheduling strategies
Available Plugins
| Mode | Description |
|---|---|
gthulhu |
Advanced scheduler with API integration, scheduling strategies, JWT authentication, and metrics reporting |
simple |
Simple weighted virtual runtime (vtime) scheduler |
simple-fifo |
Simple FIFO (First-In, First-Out) scheduler |
Plugin Registration
Plugins are registered via Go's init() mechanism using the factory pattern:
func init() {
plugin.RegisterNewPlugin("myplugin", func(ctx context.Context, config *plugin.SchedConfig) (plugin.CustomScheduler, error) {
return NewMyPlugin(config), nil
})
}
Scheduler Execution Flow
The main scheduling loop processes tasks as follows:
flowchart TD
A[Start Scheduler Loop] --> B{Check context Done}
B -->|Yes| D[End]
B -->|No| E[DrainQueuedTask]
E --> F[SelectQueuedTask]
F --> G{Task available?}
G -->|No| H[Block until ready]
H --> B
G -->|Yes| J[Create DispatchedTask]
J --> K[Calculate deadline / vtime]
K --> L[DetermineTimeSlice]
L --> M{Custom execution time?}
M -->|Yes| O[Use custom time slice]
M -->|No| P[Use default algorithm]
O --> Q[SelectCPU]
P --> Q
Q --> R{CPU selected?}
R -->|No| B
R -->|Yes| U[DispatchTask]
U --> V{Dispatch successful?}
V -->|No| B
V -->|Yes| X[NotifyComplete]
X --> B
CPU Topology-Aware Scheduling
Hierarchical CPU Selection
graph TB
A[Task Needs CPU] --> AA{Single CPU Allowed?}
AA -->|Yes| AB[Check if CPU is Idle]
AA -->|No| B{SMT System?}
AB -->|Idle| AC[Use Previous CPU]
AB -->|Not Idle| AD[Fail with EBUSY]
B -->|Yes| C{Previous CPU Full-Idle Core?}
B -->|No| G{Previous CPU Idle?}
C -->|Yes| D[Use Previous CPU]
C -->|No| E{Full-Idle CPU in L2 Cache?}
E -->|Yes| F[Use CPU in Same L2 Cache]
E -->|No| H{Full-Idle CPU in L3 Cache?}
H -->|Yes| I[Use CPU in Same L3 Cache]
H -->|No| J{Any Full-Idle Core Available?}
J -->|Yes| K[Use Any Full-Idle Core]
J -->|No| G
G -->|Yes| L[Use Previous CPU]
G -->|No| M{Any Idle CPU in L2 Cache?}
M -->|Yes| N[Use CPU in Same L2 Cache]
M -->|No| O{Any Idle CPU in L3 Cache?}
O -->|Yes| P[Use CPU in Same L3 Cache]
O -->|No| Q{Any Idle CPU Available?}
Q -->|Yes| R[Use Any Idle CPU]
Q -->|No| S[Return EBUSY]
API and Scheduling Strategy Design
Gthulhu implements a flexible mechanism to dynamically adjust scheduling behavior through RESTful API interfaces. The system uses a dual-mode API architecture with a Manager and per-node Decision Makers.
API Architecture
graph TB
A[User / Web UI] -->|Manage Strategies| B[Manager]
B -->|Store| C[(MongoDB)]
B -->|Query Pods| D[Kubernetes API]
B -->|Distribute Intents| E[Decision Maker
Node 1]
B -->|Distribute Intents| F[Decision Maker
Node N]
E -->|Provide PID Strategies| G[Gthulhu Scheduler]
F -->|Provide PID Strategies| H[Gthulhu Scheduler]
G -->|Report Metrics| E
H -->|Report Metrics| F
Manager Endpoints
The Manager handles user-facing operations:
- POST /api/v1/auth/login: User authentication
- POST /api/v1/strategies: Create scheduling strategy
- GET /api/v1/strategies/self: List own strategies
- GET /api/v1/intents/self: List scheduling intents
Decision Maker Endpoints
The Decision Maker runs on each node and interacts with the Gthulhu Scheduler:
- GET /api/v1/scheduling/strategies: Retrieves PID-level scheduling strategies for the local scheduler
- POST /api/v1/metrics: Receives scheduler metrics data
Scheduling Strategy Data Model
A scheduling strategy at the Decision Maker level is represented using the following structure:
{
"success": true,
"scheduling": [
{
"priority": 1,
"execution_time": 20000000,
"pid": 12345
},
{
"priority": 0,
"execution_time": 10000000,
"pid": 67890
}
]
}
Key components of a scheduling strategy:
- Priority (
int): When greater than 0, the task's virtual runtime is set to the minimum value, giving it the highest scheduling priority - Execution Time (
uint64): Custom time slice in nanoseconds for the task - PID (
int): Process ID to which the strategy applies
Note
Label selectors for Kubernetes Pods are handled at the Manager/Decision Maker level.
The Decision Maker resolves label selectors into specific PIDs by scanning /proc for matching Pod processes before passing them to the scheduler.
Strategy Application Flow
sequenceDiagram
participant M as Manager
participant DM as Decision Maker
participant S as Gthulhu Scheduler
participant T as Task Pool
M->>DM: Distribute scheduling intents
DM->>DM: Resolve Pod labels → PIDs
loop Every interval seconds
S->>DM: Fetch PID-level strategies
DM->>S: Return strategy list
S->>S: Update strategy map
end
Note over S,T: During task scheduling
T->>S: Task needs scheduling
S->>S: Check if task has custom strategy
S->>S: Apply priority setting if needed
S->>S: Apply custom execution time if specified
S->>T: Schedule task with applied strategy
Authentication and Security
The Gthulhu API supports multiple security mechanisms:
- JWT Authentication: RSA asymmetric encryption token-based authentication between the Scheduler and Decision Maker
- Mutual TLS (mTLS): Optional mutual TLS for secure communication between components
- RBAC: Role-Based Access Control for user management on the Manager
Kernel Mode
Gthulhu supports an experimental kernel mode where scheduling decisions are made entirely in BPF space without the user-space scheduling loop. In this mode:
- The BPF scheduler handles task dispatching directly in the kernel
- The user-space component only manages strategy updates and monitoring
- Strategy changes are pushed to the BPF scheduler via eBPF map updates (
UpdatePriorityTaskWithPrio,RemovePriorityTask)
This mode can reduce latency by avoiding the kernel-to-user-space round trip for each scheduling decision.
BPF and User Space Communication
Communication Mechanism
sequenceDiagram
participant K as BPF (Kernel Space)
participant U as Go (User Space)
K->>U: Enqueue tasks via ring buffer
U->>U: Drain queued tasks
U->>U: Select task & determine time slice
U->>U: Select CPU (topology-aware)
U->>K: Dispatch task via user ring buffer
K->>K: Execute scheduling decision
Note over K,U: Periodic metrics reporting
U->>U: Collect BSS data (nr_queued, nr_scheduled, etc.)
U-->>U: Send metrics to API server
Metrics Data
The scheduler collects and reports the following metrics:
| Metric | Description |
|---|---|
nr_queued |
Number of tasks queued in the scheduler |
nr_scheduled |
Number of tasks scheduled |
nr_running |
Number of tasks currently running |
nr_online_cpus |
Number of online CPUs |
nr_user_dispatches |
Number of user-space dispatches |
nr_kernel_dispatches |
Number of kernel-space dispatches |
nr_cancel_dispatches |
Number of canceled dispatches |
nr_bounce_dispatches |
Number of bounce dispatches |
nr_failed_dispatches |
Number of failed dispatches |
nr_sched_congested |
Number of scheduler congestion events |
Configuration
Gthulhu uses a YAML configuration file for all settings:
scheduler:
slice_ns_default: 20000000 # Default time slice (20ms)
slice_ns_min: 1000000 # Minimum time slice (1ms)
mode: gthulhu # Plugin mode: gthulhu, simple, simple-fifo
kernel_mode: false # Experimental kernel-mode scheduling
max_time_watchdog: true # Detect scheduling stalls
api:
url: http://127.0.0.1:8080 # Decision Maker endpoint
interval: 5 # Strategy fetch interval (seconds)
public_key_path: ./config/jwt_public_key.pem
enabled: true # Enable API communication
auth_enabled: true # Enable JWT authentication
mtls:
enable: false # Enable mutual TLS
cert_pem: ""
key_pem: ""
ca_pem: ""
debug: false # Enable debug mode (pprof on :6060)
early_processing: false # Early task processing in BPF
builtin_idle: false # Built-in idle CPU selection in BPF
Debugging and Monitoring
BPF Tracing
# Monitor BPF program execution
sudo cat /sys/kernel/debug/tracing/trace_pipe
# Check BPF statistics
sudo bpftool prog show
sudo bpftool map dump name task_info_map
Differences from CFS
| Feature | CFS (Completely Fair Scheduler) | Gthulhu |
|---|---|---|
| Scheduling Policy | Virtual runtime based | Virtual runtime + latency optimization |
| Task Classification | Unified processing | Automatic classification optimization |
| CPU Selection | Basic load balancing | Topology-aware + cache affinity |
| Dynamic Adjustment | Limited | Comprehensive adaptive adjustment |
| Extensibility | Kernel built-in | User-space extensible (plugin system) |
| Multi-Node | Not applicable | Distributed scheduling via Manager |
| Strategy Management | Static kernel parameters | Dynamic REST API + Kubernetes integration |
Deep Dive
For more implementation details, refer to the API Reference and source code comments.