The Hook
Taking control away from conservative default OS behaviors to maximize throughput.
Your Kernel Is Playing It Safe — And Leaving Performance on the Table
In Part 1, we mapped the thermal landscape. We learned how to read the sensors, spot the throttling, and understand why your SFF box runs hot. Now we cross the boundary from hardware into software — specifically, into the kernel’s power management stack, which makes the real-time decisions that determine whether your CPU runs at 2.8GHz or 800MHz under load.
The default Linux kernel configuration is conservative. It prioritizes stability, battery life, and thermal safety over raw throughput. That’s the right default for a laptop on an airplane. It’s the wrong default for a mini PC sitting on your desk with a compiler running, a game streaming, and a browser with forty tabs.
This post is about understanding what the kernel is doing, why it defaults to caution, and how to tune it for machines that live in thermal pressure cookers.
The Stack: How the Kernel Controls Power
The power management path on modern x86 Linux looks like this:
User Space (schedutil, ondemand, performance)
↓
cpufreq governor (kernel)
↓
cpufreq driver (intel_pstate, amd-pstate, acpi-cpufreq)
↓
CPU microcode (MSR writes, P-state transitions)
↓
Silicon (voltage/frequency changes)
Every layer in that chain can be tuned. Most users never touch anything below the governor. We’re going to look at all of it.
cpufreq Governors: The Decision Makers
The governor is the kernel’s policy engine for frequency scaling. It decides, every few milliseconds, what clock speed the CPU should run at. The decision is based on load history, thermal data, and the governor’s internal algorithm.
The Stock Governors
| Governor | Behavior | Use Case |
|---|---|---|
performance | Max frequency, always | Benchmarking, real-time workloads |
powersave | Min frequency, always | Battery saving, thermal emergency |
ondemand | Ramp to max under load, drop when idle | Legacy default, aggressive |
conservative | Slower ramp than ondemand | Deprecated, rarely useful |
schedutil | Uses scheduler load data directly | Modern default, kernel 4.7+ |
What You Actually Want
For a desktop mini PC, schedutil is usually the right choice — but not the stock schedutil. The default schedutil is tuned for laptops and servers. It ramps slowly, drops quickly, and treats sustained load as a reason to back off rather than lean in.
We want schedutil with faster ramping, higher sustained frequencies, and awareness of our thermal headroom. The kernel exposes tunable parameters for this.
Reading and Setting the Governor
# See current governor for all CPUs
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | sort | uniq -c
# See available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# Set governor to schedutil (temporary, until reboot)
echo schedutil | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
To make it permanent, you have two paths: the systemd service path (works everywhere) or the initramfs path (cleaner, but distro-specific).
Persistent Governor: systemd Service
cat << 'EOF' | sudo tee /etc/systemd/system/cpufreq-performance.service
[Unit]
Description=Set CPU governor to schedutil with fast ramp
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo schedutil | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now cpufreq-performance.service
Tuning schedutil: The Hidden Parameters
schedutil isn’t a black box. It exposes tunables through debugfs and sysfs that control how aggressively it responds to load.
Key Tunables
| Parameter | Path | Default | What It Does |
|---|---|---|---|
rate_limit_us | /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us | 1000 | Minimum microseconds between frequency changes |
up_rate_limit_us | /sys/devices/system/cpu/cpufreq/schedutil/up_rate_limit_us | 500 | Min time between upward changes |
down_rate_limit_us | /sys/devices/system/cpu/cpufreq/schedutil/down_rate_limit_us | 50000 | Min time between downward changes |
The stock down_rate_limit_us is 50ms. That means once schedutil ramps up, it waits 50 milliseconds before it will even consider dropping frequency. That’s fine for bursty workloads. For sustained compile jobs or gaming, it means the CPU stays at high frequency long after the load drops — which is what we want on a thermally constrained machine, because the alternative is throttling.
Fast-Ramp schedutil Profile
#!/bin/bash
# schedutil_fast_ramp.sh — Tune schedutil for SFF desktop performance
set -euo pipefail
# Fast ramp up, slow ramp down
for cpu in /sys/devices/system/cpu/cpufreq/policy*; do
[ -d "$cpu" ] || continue
# Rate limits in microseconds
echo 100 > "$cpu/schedutil/up_rate_limit_us" 2>/dev/null || true
echo 2000 > "$cpu/schedutil/down_rate_limit_us" 2>/dev/null || true
# IO boost — ramp frequency when waiting on disk
echo 1 > "$cpu/schedutil/iowait_boost_enable" 2>/dev/null || true
echo "Tuned: $(basename $cpu)"
done
echo "schedutil fast-ramp profile applied."
Save as schedutil_fast_ramp.sh, run with sudo. The 100μs up-limit means the governor can react in a single scheduler tick. The 2000μs down-limit keeps frequency high through brief idle gaps without the 50ms penalty of stock.
The P-State Drivers: intel_pstate vs amd-pstate
The governor decides what frequency to request. The P-state driver decides how to translate that request into actual hardware states.
intel_pstate
Intel’s driver has two modes:
- Active mode (
intel_pstate=active): The driver makes its own frequency decisions, ignoring the governor entirely. You see this as thepowersaveandperformance“governors” on Intel systems. - Passive mode (
intel_pstate=passive): The driver defers toacpi-cpufreqor the governor you select. This is what you want for tuning.
Forcing Passive Mode
Add to your kernel boot parameters (via GRUB_CMDLINE_LINUX_DEFAULT or systemd-boot):
intel_pstate=passive
Then update your bootloader and reboot. After reboot, you’ll see the real governors (schedutil, ondemand, performance) available instead of Intel’s fake ones.
amd-pstate
AMD’s driver is newer and cleaner. It supports:
- Guided mode: Kernel suggests frequencies, firmware has final say
- Passive mode: Full governor control, like intel_pstate passive
- Active mode: Firmware autonomously manages P-states
For tuning, you want guided or passive. Active mode on AMD is actually quite good — the firmware knows the silicon better than the kernel — but it doesn’t expose the tunables we want.
Switching amd-pstate Modes
# Check current mode
cat /sys/devices/system/cpu/amd_pstate/status
# Switch to passive (requires kernel param or module reload)
echo passive | sudo tee /sys/devices/system/cpu/amd_pstate/status
To make passive persistent, add amd-pstate=passive to kernel boot parameters.
The MSR Layer: When You Need to Bypass Everything
Model-Specific Registers (MSRs) are the CPU’s control panel. The kernel’s P-state driver writes to these. When the driver isn’t giving you what you need, you can write directly — but you’re bypassing all the safety checks.
Reading Current P-States
sudo apt install msr-tools # or equivalent for your distro
sudo modprobe msr
# Read IA32_PERF_STATUS (current P-state multiplier)
sudo rdmsr -a 0x198
# Read IA32_PERF_CTL (requested P-state)
sudo rdmsr -a 0x199
# Read IA32_THERM_STATUS (thermal throttle flags)
sudo rdmsr -a 0x19c
The output is hex. 0x19c bit 3 is the PROCHOT# active flag — set when external hardware (thermal sensor, power regulator) is forcing a throttle. Bit 1 is the thermal status flag, set when the CPU’s own thermal monitor has triggered.
Undervolting via MSR (Intel)
Intel’s IA32_OC_MAILBOX MSR (0x150) controls voltage offsets. The intel-undervolt tool wraps this safely:
sudo pacman -S intel-undervolt # Arch
sudo intel-undervolt read
sudo intel-undervolt apply -core -80 -uncore -80 -analog -50
-core: CPU cores-uncore: Cache, ring bus, integrated graphics-analog: System agent
Negative values reduce voltage. Start conservative (-50mV), test with stress-ng, then iterate. An unstable undervolt crashes the kernel. A working undervolt drops temps 5–15°C at the same frequency.
Curve Optimizer (AMD)
AMD’s equivalent is in firmware, exposed through amd_pstate_ut or the BIOS. On Linux, ryzenadj can tweak power limits and curve optimizer settings:
sudo pacman -S ryzenadj
sudo ryzenadj --info # Read current settings
sudo ryzenadj --stapm-limit=25000 # Set STAPM to 25W
sudo ryzenadj --co-all=-10 # Curve optimizer -10 (start conservative)
Thermal-Aware Tuning: The Integration Point
Here’s where Part 1 and Part 2 connect. The kernel’s thermal subsystem (thermal_zone) can trigger actions when temperature thresholds are crossed. By default, it may force a frequency reduction or shut down cores. We want to tune this so the kernel uses the thermal headroom we mapped in Part 1, rather than panicking early.
Reading Thermal Trip Points
for zone in /sys/class/thermal/thermal_zone*; do
[ -d "$zone" ] || continue
type=$(cat "$zone/type" 2>/dev/null || echo "unknown")
echo "=== $type ==="
for trip in "$zone"/trip_point_*_temp; do
[ -f "$trip" ] || continue
temp=$(cat "$trip")
type_file="${trip/_temp/_type}"
trip_type=$(cat "$type_file" 2>/dev/null || echo "unknown")
printf " %-15s %6.1f°C\n" "$trip_type" "$((temp/1000)).$((temp%1000/100))"
done
done
Typical trip points on x86:
- Critical: Hardware emergency shutdown (usually 100–105°C)
- Passive: Kernel starts reducing performance (usually 85–95°C)
- Active: Fan speed increase or other cooling action
The passive trip point is the one that matters. When crossed, the kernel’s thermal subsystem tells cpufreq to drop frequency. On most systems, this is set conservatively — the kernel starts throttling at 85°C even though the chip can hit 100°C safely.
Raising the Passive Trip Point (Use With Caution)
# Find the x86_pkg_temp zone and its passive trip point
ZONE=$(grep -l "x86_pkg_temp" /sys/class/thermal/thermal_zone*/type 2>/dev/null | head -1 | xargs -I{} dirname {})
if [ -n "$ZONE" ] && [ -f "$ZONE/trip_point_0_temp" ]; then
CURRENT=$(cat "$ZONE/trip_point_0_temp")
echo "Current passive trip: $((CURRENT/1000))°C"
# Set to 95°C (95000 millidegrees) — know your chip's Tjunction first!
echo 95000 | sudo tee "$ZONE/trip_point_0_temp"
echo "New passive trip: 95°C"
fi
Warning: Only do this if you verified in Part 1 that your cooling can sustain 95°C without hitting critical. The kernel’s default conservative threshold exists because most systems have bad cooling. If you fixed the cooling, you can raise the threshold.
The Complete SFF Tuning Profile
Putting it together — a script that applies everything from this post:
cat << 'EOF' > sff_kernel_tuning.sh
#!/bin/bash
# sff_kernel_tuning.sh — Kernel tuning profile for SFF mini PCs
# Run as root. Apply after understanding your thermal limits (see Part 1).
set -euo pipefail
echo "=== SFF Kernel Tuning Profile ==="
echo "Date: $(date)"
echo ""
# --- Governor Setup ---
echo "--- Setting Governor ---"
GOVERNOR="schedutil"
echo "$GOVERNOR" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
echo "Governor set to: $GOVERNOR"
echo ""
# --- schedutil Tuning ---
echo "--- Tuning schedutil ---"
for policy in /sys/devices/system/cpu/cpufreq/policy*; do
[ -d "$policy" ] || continue
echo 100 > "$policy/schedutil/up_rate_limit_us" 2>/dev/null || true
echo 2000 > "$policy/schedutil/down_rate_limit_us" 2>/dev/null || true
echo 1 > "$policy/schedutil/iowait_boost_enable" 2>/dev/null || true
echo " $(basename $policy): up=100μs down=2000μs io_boost=on"
done
echo ""
# --- Intel P-State (if present) ---
if [ -d /sys/devices/system/cpu/intel_pstate ]; then
echo "--- Intel P-State Status ---"
cat /sys/devices/system/cpu/intel_pstate/status 2>/dev/null || echo " Status file not readable"
echo " (Switch to passive mode via kernel param: intel_pstate=passive)"
echo ""
fi
# --- AMD P-State (if present) ---
if [ -f /sys/devices/system/cpu/amd_pstate/status ]; then
echo "--- AMD P-State Status ---"
cat /sys/devices/system/cpu/amd_pstate/status
echo " (Switch to passive/guided via kernel param: amd-pstate=passive)"
echo ""
fi
# --- Thermal Trip Points ---
echo "--- Thermal Trip Points ---"
for zone in /sys/class/thermal/thermal_zone*; do
[ -d "$zone" ] || continue
type=$(cat "$zone/type" 2>/dev/null || echo "unknown")
[ "$type" = "x86_pkg_temp" ] || continue
for trip in "$zone"/trip_point_*_temp; do
[ -f "$trip" ] || continue
temp=$(cat "$trip")
type_file="${trip/_temp/_type}"
trip_type=$(cat "$type_file" 2>/dev/null || echo "unknown")
printf " %-15s %6.1f°C\n" "$trip_type" "$((temp/1000)).$((temp%1000/100))"
done
done
echo ""
# --- Current Frequencies ---
echo "--- Current Frequencies ---"
awk '/MHz/ {printf " %-12s %5.0f MHz\n", $1, $4}' /proc/cpuinfo 2>/dev/null | head -n $(nproc) || true
echo ""
echo "=== Done ==="
echo ""
echo "Next steps:"
echo " 1. Verify with: watch -n1 'grep MHz /proc/cpuinfo | head -1'"
echo " 2. Stress test: stress-ng --cpu \$(nproc) --timeout 120s"
echo " 3. Monitor: ./thermal_audit.sh (from Part 1)"
echo ""
echo "To make permanent:"
echo " - Copy governor set to systemd service"
echo " - Add schedutil tunables to /etc/rc.local or systemd service"
echo " - Set kernel params for P-state driver mode"
EOF
chmod +x sff_kernel_tuning.sh
From Kernel to System
We’ve covered the decision makers (governors), the translators (P-state drivers), and the raw controls (MSRs). The thread that ties them together is this: the kernel defaults to survival, not performance. Every conservative threshold, every slow ramp rate, every early throttle point exists because the kernel was compiled for unknown hardware in unknown environments.
Your SFF machine is known hardware in a known environment. You measured its thermal limits in Part 1. Now you’ve given the kernel permission to operate closer to those limits. The result isn’t just higher benchmark numbers — it’s a system that stays responsive under load, completes compiles faster, and doesn’t mysteriously drop to 800MHz because the default governor panicked at 75°C.
In Part 3, we’ll go deeper into memory — how the kernel allocates, swaps, and caches, and why SFF machines with soldered-down RAM need different tuning than towers with 128GB of DDR5.
The silicon has limits. The kernel was cautious. Now both know the same truth, and the machine is finally allowed to work.