Taming the Beast – Linux Kernel Optimization & Power Management

The Hook

Taking control away from conservative default OS behaviors to maximize throughput.

Your Kernel Is Playing It Safe — And Leaving Performance on the Table

In Part 1, we mapped the thermal landscape. We learned how to read the sensors, spot the throttling, and understand why your SFF box runs hot. Now we cross the boundary from hardware into software — specifically, into the kernel’s power management stack, which makes the real-time decisions that determine whether your CPU runs at 2.8GHz or 800MHz under load.

The default Linux kernel configuration is conservative. It prioritizes stability, battery life, and thermal safety over raw throughput. That’s the right default for a laptop on an airplane. It’s the wrong default for a mini PC sitting on your desk with a compiler running, a game streaming, and a browser with forty tabs.

This post is about understanding what the kernel is doing, why it defaults to caution, and how to tune it for machines that live in thermal pressure cookers.

The Stack: How the Kernel Controls Power

The power management path on modern x86 Linux looks like this:

User Space (schedutil, ondemand, performance)
    ↓
cpufreq governor (kernel)
    ↓
cpufreq driver (intel_pstate, amd-pstate, acpi-cpufreq)
    ↓
CPU microcode (MSR writes, P-state transitions)
    ↓
Silicon (voltage/frequency changes)

Every layer in that chain can be tuned. Most users never touch anything below the governor. We’re going to look at all of it.

cpufreq Governors: The Decision Makers

The governor is the kernel’s policy engine for frequency scaling. It decides, every few milliseconds, what clock speed the CPU should run at. The decision is based on load history, thermal data, and the governor’s internal algorithm.

The Stock Governors

Governor	Behavior	Use Case
`performance`	Max frequency, always	Benchmarking, real-time workloads
`powersave`	Min frequency, always	Battery saving, thermal emergency
`ondemand`	Ramp to max under load, drop when idle	Legacy default, aggressive
`conservative`	Slower ramp than ondemand	Deprecated, rarely useful
`schedutil`	Uses scheduler load data directly	Modern default, kernel 4.7+

What You Actually Want

For a desktop mini PC, schedutil is usually the right choice — but not the stock schedutil. The default schedutil is tuned for laptops and servers. It ramps slowly, drops quickly, and treats sustained load as a reason to back off rather than lean in.

We want schedutil with faster ramping, higher sustained frequencies, and awareness of our thermal headroom. The kernel exposes tunable parameters for this.

Reading and Setting the Governor

# See current governor for all CPUs
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | sort | uniq -c

# See available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

# Set governor to schedutil (temporary, until reboot)
echo schedutil | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

To make it permanent, you have two paths: the systemd service path (works everywhere) or the initramfs path (cleaner, but distro-specific).

Persistent Governor: systemd Service

cat << 'EOF' | sudo tee /etc/systemd/system/cpufreq-performance.service
[Unit]
Description=Set CPU governor to schedutil with fast ramp
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo schedutil | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now cpufreq-performance.service

Tuning schedutil: The Hidden Parameters

schedutil isn’t a black box. It exposes tunables through debugfs and sysfs that control how aggressively it responds to load.

Key Tunables

Parameter	Path	Default	What It Does
`rate_limit_us`	`/sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us`	1000	Minimum microseconds between frequency changes
`up_rate_limit_us`	`/sys/devices/system/cpu/cpufreq/schedutil/up_rate_limit_us`	500	Min time between upward changes
`down_rate_limit_us`	`/sys/devices/system/cpu/cpufreq/schedutil/down_rate_limit_us`	50000	Min time between downward changes

The stock down_rate_limit_us is 50ms. That means once schedutil ramps up, it waits 50 milliseconds before it will even consider dropping frequency. That’s fine for bursty workloads. For sustained compile jobs or gaming, it means the CPU stays at high frequency long after the load drops — which is what we want on a thermally constrained machine, because the alternative is throttling.

Fast-Ramp schedutil Profile

#!/bin/bash
# schedutil_fast_ramp.sh — Tune schedutil for SFF desktop performance

set -euo pipefail

# Fast ramp up, slow ramp down
for cpu in /sys/devices/system/cpu/cpufreq/policy*; do
    [ -d "$cpu" ] || continue
    
    # Rate limits in microseconds
    echo 100 > "$cpu/schedutil/up_rate_limit_us" 2>/dev/null || true
    echo 2000 > "$cpu/schedutil/down_rate_limit_us" 2>/dev/null || true
    
    # IO boost — ramp frequency when waiting on disk
    echo 1 > "$cpu/schedutil/iowait_boost_enable" 2>/dev/null || true
    
    echo "Tuned: $(basename $cpu)"
done

echo "schedutil fast-ramp profile applied."

Save as schedutil_fast_ramp.sh, run with sudo. The 100μs up-limit means the governor can react in a single scheduler tick. The 2000μs down-limit keeps frequency high through brief idle gaps without the 50ms penalty of stock.

The P-State Drivers: intel_pstate vs amd-pstate

The governor decides what frequency to request. The P-state driver decides how to translate that request into actual hardware states.

intel_pstate

Intel’s driver has two modes:

Active mode (intel_pstate=active): The driver makes its own frequency decisions, ignoring the governor entirely. You see this as the powersave and performance “governors” on Intel systems.
Passive mode (intel_pstate=passive): The driver defers to acpi-cpufreq or the governor you select. This is what you want for tuning.

Forcing Passive Mode

Add to your kernel boot parameters (via GRUB_CMDLINE_LINUX_DEFAULT or systemd-boot):

intel_pstate=passive

Then update your bootloader and reboot. After reboot, you’ll see the real governors (schedutil, ondemand, performance) available instead of Intel’s fake ones.

amd-pstate

AMD’s driver is newer and cleaner. It supports:

Guided mode: Kernel suggests frequencies, firmware has final say
Passive mode: Full governor control, like intel_pstate passive
Active mode: Firmware autonomously manages P-states

For tuning, you want guided or passive. Active mode on AMD is actually quite good — the firmware knows the silicon better than the kernel — but it doesn’t expose the tunables we want.

Switching amd-pstate Modes

# Check current mode
cat /sys/devices/system/cpu/amd_pstate/status

# Switch to passive (requires kernel param or module reload)
echo passive | sudo tee /sys/devices/system/cpu/amd_pstate/status

To make passive persistent, add amd-pstate=passive to kernel boot parameters.

The MSR Layer: When You Need to Bypass Everything

Model-Specific Registers (MSRs) are the CPU’s control panel. The kernel’s P-state driver writes to these. When the driver isn’t giving you what you need, you can write directly — but you’re bypassing all the safety checks.

Reading Current P-States

sudo apt install msr-tools   # or equivalent for your distro
sudo modprobe msr

# Read IA32_PERF_STATUS (current P-state multiplier)
sudo rdmsr -a 0x198

# Read IA32_PERF_CTL (requested P-state)
sudo rdmsr -a 0x199

# Read IA32_THERM_STATUS (thermal throttle flags)
sudo rdmsr -a 0x19c

The output is hex. 0x19c bit 3 is the PROCHOT# active flag — set when external hardware (thermal sensor, power regulator) is forcing a throttle. Bit 1 is the thermal status flag, set when the CPU’s own thermal monitor has triggered.

Undervolting via MSR (Intel)

Intel’s IA32_OC_MAILBOX MSR (0x150) controls voltage offsets. The intel-undervolt tool wraps this safely:

sudo pacman -S intel-undervolt   # Arch
sudo intel-undervolt read
sudo intel-undervolt apply -core -80 -uncore -80 -analog -50

-core: CPU cores
-uncore: Cache, ring bus, integrated graphics
-analog: System agent

Negative values reduce voltage. Start conservative (-50mV), test with stress-ng, then iterate. An unstable undervolt crashes the kernel. A working undervolt drops temps 5–15°C at the same frequency.

Curve Optimizer (AMD)

AMD’s equivalent is in firmware, exposed through amd_pstate_ut or the BIOS. On Linux, ryzenadj can tweak power limits and curve optimizer settings:

sudo pacman -S ryzenadj
sudo ryzenadj --info              # Read current settings
sudo ryzenadj --stapm-limit=25000  # Set STAPM to 25W
sudo ryzenadj --co-all=-10        # Curve optimizer -10 (start conservative)

Thermal-Aware Tuning: The Integration Point

Here’s where Part 1 and Part 2 connect. The kernel’s thermal subsystem (thermal_zone) can trigger actions when temperature thresholds are crossed. By default, it may force a frequency reduction or shut down cores. We want to tune this so the kernel uses the thermal headroom we mapped in Part 1, rather than panicking early.

Reading Thermal Trip Points

for zone in /sys/class/thermal/thermal_zone*; do
    [ -d "$zone" ] || continue
    type=$(cat "$zone/type" 2>/dev/null || echo "unknown")
    echo "=== $type ==="
    for trip in "$zone"/trip_point_*_temp; do
        [ -f "$trip" ] || continue
        temp=$(cat "$trip")
        type_file="${trip/_temp/_type}"
        trip_type=$(cat "$type_file" 2>/dev/null || echo "unknown")
        printf "  %-15s %6.1f°C\n" "$trip_type" "$((temp/1000)).$((temp%1000/100))"
    done
done

Typical trip points on x86:

Critical: Hardware emergency shutdown (usually 100–105°C)
Passive: Kernel starts reducing performance (usually 85–95°C)
Active: Fan speed increase or other cooling action

The passive trip point is the one that matters. When crossed, the kernel’s thermal subsystem tells cpufreq to drop frequency. On most systems, this is set conservatively — the kernel starts throttling at 85°C even though the chip can hit 100°C safely.

Raising the Passive Trip Point (Use With Caution)

# Find the x86_pkg_temp zone and its passive trip point
ZONE=$(grep -l "x86_pkg_temp" /sys/class/thermal/thermal_zone*/type 2>/dev/null | head -1 | xargs -I{} dirname {})
if [ -n "$ZONE" ] && [ -f "$ZONE/trip_point_0_temp" ]; then
    CURRENT=$(cat "$ZONE/trip_point_0_temp")
    echo "Current passive trip: $((CURRENT/1000))°C"
    
    # Set to 95°C (95000 millidegrees) — know your chip's Tjunction first!
    echo 95000 | sudo tee "$ZONE/trip_point_0_temp"
    echo "New passive trip: 95°C"
fi

Warning: Only do this if you verified in Part 1 that your cooling can sustain 95°C without hitting critical. The kernel’s default conservative threshold exists because most systems have bad cooling. If you fixed the cooling, you can raise the threshold.

The Complete SFF Tuning Profile

Putting it together — a script that applies everything from this post:

cat << 'EOF' > sff_kernel_tuning.sh
#!/bin/bash
# sff_kernel_tuning.sh — Kernel tuning profile for SFF mini PCs
# Run as root. Apply after understanding your thermal limits (see Part 1).

set -euo pipefail

echo "=== SFF Kernel Tuning Profile ==="
echo "Date: $(date)"
echo ""

# --- Governor Setup ---
echo "--- Setting Governor ---"
GOVERNOR="schedutil"
echo "$GOVERNOR" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
echo "Governor set to: $GOVERNOR"
echo ""

# --- schedutil Tuning ---
echo "--- Tuning schedutil ---"
for policy in /sys/devices/system/cpu/cpufreq/policy*; do
    [ -d "$policy" ] || continue
    echo 100 > "$policy/schedutil/up_rate_limit_us" 2>/dev/null || true
    echo 2000 > "$policy/schedutil/down_rate_limit_us" 2>/dev/null || true
    echo 1 > "$policy/schedutil/iowait_boost_enable" 2>/dev/null || true
    echo "  $(basename $policy): up=100μs down=2000μs io_boost=on"
done
echo ""

# --- Intel P-State (if present) ---
if [ -d /sys/devices/system/cpu/intel_pstate ]; then
    echo "--- Intel P-State Status ---"
    cat /sys/devices/system/cpu/intel_pstate/status 2>/dev/null || echo "  Status file not readable"
    echo "  (Switch to passive mode via kernel param: intel_pstate=passive)"
    echo ""
fi

# --- AMD P-State (if present) ---
if [ -f /sys/devices/system/cpu/amd_pstate/status ]; then
    echo "--- AMD P-State Status ---"
    cat /sys/devices/system/cpu/amd_pstate/status
    echo "  (Switch to passive/guided via kernel param: amd-pstate=passive)"
    echo ""
fi

# --- Thermal Trip Points ---
echo "--- Thermal Trip Points ---"
for zone in /sys/class/thermal/thermal_zone*; do
    [ -d "$zone" ] || continue
    type=$(cat "$zone/type" 2>/dev/null || echo "unknown")
    [ "$type" = "x86_pkg_temp" ] || continue
    
    for trip in "$zone"/trip_point_*_temp; do
        [ -f "$trip" ] || continue
        temp=$(cat "$trip")
        type_file="${trip/_temp/_type}"
        trip_type=$(cat "$type_file" 2>/dev/null || echo "unknown")
        printf "  %-15s %6.1f°C\n" "$trip_type" "$((temp/1000)).$((temp%1000/100))"
    done
done
echo ""

# --- Current Frequencies ---
echo "--- Current Frequencies ---"
awk '/MHz/ {printf "  %-12s %5.0f MHz\n", $1, $4}' /proc/cpuinfo 2>/dev/null | head -n $(nproc) || true
echo ""

echo "=== Done ==="
echo ""
echo "Next steps:"
echo "  1. Verify with: watch -n1 'grep MHz /proc/cpuinfo | head -1'"
echo "  2. Stress test: stress-ng --cpu \$(nproc) --timeout 120s"
echo "  3. Monitor: ./thermal_audit.sh (from Part 1)"
echo ""
echo "To make permanent:"
echo "  - Copy governor set to systemd service"
echo "  - Add schedutil tunables to /etc/rc.local or systemd service"
echo "  - Set kernel params for P-state driver mode"
EOF

chmod +x sff_kernel_tuning.sh

From Kernel to System

We’ve covered the decision makers (governors), the translators (P-state drivers), and the raw controls (MSRs). The thread that ties them together is this: the kernel defaults to survival, not performance. Every conservative threshold, every slow ramp rate, every early throttle point exists because the kernel was compiled for unknown hardware in unknown environments.

Your SFF machine is known hardware in a known environment. You measured its thermal limits in Part 1. Now you’ve given the kernel permission to operate closer to those limits. The result isn’t just higher benchmark numbers — it’s a system that stays responsive under load, completes compiles faster, and doesn’t mysteriously drop to 800MHz because the default governor panicked at 75°C.

In Part 3, we’ll go deeper into memory — how the kernel allocates, swaps, and caches, and why SFF machines with soldered-down RAM need different tuning than towers with 128GB of DDR5.

The silicon has limits. The kernel was cautious. Now both know the same truth, and the machine is finally allowed to work.

The Hook#

Your Kernel Is Playing It Safe — And Leaving Performance on the Table#

The Stack: How the Kernel Controls Power#

cpufreq Governors: The Decision Makers#

The Stock Governors#

What You Actually Want#

Reading and Setting the Governor#

Persistent Governor: systemd Service#

Tuning schedutil: The Hidden Parameters#

Key Tunables#

Fast-Ramp schedutil Profile#

The P-State Drivers: intel_pstate vs amd-pstate#

intel_pstate#

Forcing Passive Mode#

amd-pstate#

Switching amd-pstate Modes#

The MSR Layer: When You Need to Bypass Everything#

Reading Current P-States#

Undervolting via MSR (Intel)#

Curve Optimizer (AMD)#

Thermal-Aware Tuning: The Integration Point#

Reading Thermal Trip Points#

Raising the Passive Trip Point (Use With Caution)#

The Complete SFF Tuning Profile#

From Kernel to System#