← back

When the Compiler Is the Side Channel

003 · 2026-03-30 · three ways the optimizer silently breaks security-critical code

Compilers optimize your code. That's their job—fewer instructions, faster execution. But when you're writing security-critical code, "faster" and "correct" can be in direct conflict. The optimizer doesn't know which parts of your code exist for security reasons. It just sees inefficiency and removes it.

I found this out firsthand while checking the machine code output of a function I'd written for RISC-V. At -O0—the "no optimization" setting, where the compiler does exactly what you wrote—it looked right. Then I switched to -O2—aggressive optimization—and the compiler had rewritten my carefully constructed security code into something that leaks secrets through timing.

I started looking at what else the optimizer does to security-critical patterns, and it keeps getting worse. The same logic that introduces timing leaks will also erase code that scrubs secrets from memory, and silently drop commands your hardware was waiting for.

All output below is real—clang 23.0.0, RISC-V 32-bit target, not pseudocode.

THE TIMING LEAK

When code handles a secret—a cryptographic key, a password comparison—it must not behave differently depending on what the secret is. If the code takes one path for a 1 bit and a different path for a 0 bit, an attacker can measure which path was taken by timing how long the function takes. Different paths take different amounts of time. That's a side channel: information leaking through execution time, not through the return value.

The defense is to write code that always does the same work regardless of the secret. No if statements, no branches. Instead, you use bitwise arithmetic—turn the flag into a mask, select with AND and OR. Same result every time, same instructions every time.

uint32_t ct_select(uint32_t x, uint32_t y, uint32_t flag) {
    uint32_t mask = -(flag & 1);
    return (x & mask) | (y & ~mask);
}

At -O0, the compiler does exactly what you wrote. Five arithmetic instructions, no branches:

# clang -O0 (no optimization)
ct_select:
    # ... stack setup ...
    slli    a0, a0, 31       # mask = -(flag & 1)
    srai    a0, a0, 31       # sign-extend to full word
    xor     a0, a0, a1       # x ^ y
    and     a0, a0, a2       # (x ^ y) & mask
    xor     a0, a0, a1       # ((x ^ y) & mask) ^ y
    ret

Same instructions run whether the flag is 0 or 1. That's the whole point.

At -O2, LLVM rewrites it:

# clang -O2 (optimized)
ct_select:
    andi    a2, a2, 1
    beqz    a2, .LBB0_2      # ← branch on secret
    mv      a1, a0
.LBB0_2:
    mv      a0, a1
    ret
-O0
andi a2, a2, 1
slli → srai
xor a0, a0, a1
and a0, a0, a2
xor a0, a0, a1
-O2
andi a2, a2, 1
beqz a2, skip
mv a1, a0
mv a0, a1
ret
 
Toggle the flag value and watch the timing bars. -O0 is always identical. -O2 leaks the secret.

I traced it through every optimization pass to find where this happens. It's a pass called InstCombine—LLVM's pattern-matching simplifier. It recognizes that the mask arithmetic is really just choosing between x and y based on a condition, and replaces the whole thing with a simple branch. Fewer instructions, faster code. The output is identical for every input—functionally, nothing changed.

But everything changed for security. The CPU now takes a different path depending on the flag. If that flag is a secret—a key bit, a password comparison result—an attacker who can measure timing just learned its value. LLVM doesn't know the flag is secret. It sees a chance to cut three instructions and takes it.

THE VANISHING CLEANUP

The timing leak is bad, but at least the code still runs—it just runs differently depending on the secret. This one's worse. The compiler deletes your security code entirely and you get no warning.

When a function handles a secret key, standard practice is: copy the key in, do your work, then overwrite the key with zeros before returning. This is called scrubbing or zeroing. You don't want that key sitting in memory after you're done—any code that later reuses the same region of memory could read it. A crash dump, a memory scanner, a speculative-execution attack—all of these can harvest secrets left behind on the stack.

uint32_t process_secret(const uint8_t *input) {
    uint8_t key[32];
    for (int i = 0; i < 32; i++)
        key[i] = input[i];

    uint32_t result = 0;
    for (int i = 0; i < 32; i++)
        result ^= key[i];

    // scrub the key
    for (int i = 0; i < 32; i++)
        key[i] = 0;

    return result;
}

At -O0, the zeroing loop is there:

# clang -O0 — third loop: zeroing
    li      a0, 0
    sb      a0, 0(a1)        # key[i] = 0
    # ... loop continues for all 32 bytes ...

At -O2, the zeroing is gone:

# clang -O2 (optimized)
process_secret:
    # ... copy input, compute xor ...
    xor     a0, a0, a1
    xor     a0, a0, t0       # final xor for result
    #
    # no zeroing. no stores. nothing.
    #
    addi    sp, sp, 48       # stack frame reclaimed
    ret
stack frame: key[32]
 
Click a button to run the function. Watch whether the key gets scrubbed.

The optimization pass responsible is called Dead Store Elimination—it removes writes to memory that nothing will ever read. From LLVM's perspective it's making the right call. key is a local variable, nobody reads it after the zeroing loop, and the function returns right after—so the writes are "dead." What LLVM doesn't model is that "dead" here means your secret key is still sitting in memory when the next function reuses that same stack space.

Fix 1: volatile cast

The blunt fix: cast the pointer to volatile. The compiler isn't allowed to optimize away volatile stores.

volatile uint8_t *vp = (volatile uint8_t *)key;
for (int i = 0; i < 32; i++)
    vp[i] = 0;
# clang -O2 — volatile version: zeroing is emitted
    sb      zero, 0(sp)
    sb      zero, 1(sp)
    sb      zero, 2(sp)
    # ... 32 byte-stores total ...
    sb      zero, 31(sp)
    ret

32 byte-stores, all present. Works, but it's byte-at-a-time—on a 32-bit target you'd rather do word-aligned stores. There's also an open question about whether volatile on a cast pointer actually prevents reordering relative to non-volatile accesses.

Fix 2: compiler barrier

The cleaner fix: keep the regular zeroing loop, but add a compiler barrier after it.

for (int i = 0; i < 32; i++)
    key[i] = 0;
__asm__ __volatile__("" :: "r"(key) : "memory");

The asm block is empty—zero instructions at runtime. But the "memory" clobber is a lie to the compiler: "something here might read all of memory." Now LLVM can't prove the zeroing stores are dead, because something might observe them. The "r"(key) constraint keeps the buffer address alive.

# clang -O2 — barrier version
    sb      zero, 28(sp)
    sb      zero, 29(sp)
    # ... zeroing stores ...
    sb      zero, 59(sp)
    #APP
    #NO_APP                    # ← the barrier: zero runtime cost
    ret

This is what libsodium does (sodium_memzero). The Rust zeroize crate uses a similar approach—volatile write plus compiler fence. C23 added memset_explicit to standardize this, but compiler support is still patchy.

THE MISSING COMMAND

This isn't just a crypto problem. It shows up anywhere software talks to hardware.

In embedded systems, software controls hardware by writing to specific memory addresses. These aren't normal memory—they're registers mapped into the address space, and every write triggers a physical action. It's called memory-mapped I/O. You're writing firmware for a crypto accelerator. The command register at 0x40000008 controls the hardware state machine—write 0x00 to reset, write 0x01 to start. Two writes, specific order.

void reset_and_start(void) {
    uint32_t *cmd = (uint32_t *)0x40000008;
    *cmd = 0x00;   // clear
    *cmd = 0x01;   // start
}

At -O0, two stores:

# clang -O0 (no optimization)
    li      a0, 0
    sw      a0, 0(a1)        # *cmd = 0x00
    li      a0, 1
    sw      a0, 0(a1)        # *cmd = 0x01

At -O2:

# clang -O2 (optimized)
reset_and_start:
    lui     a0, 262144
    li      a1, 1
    sw      a1, 8(a0)        # *cmd = 0x01 only
    ret
IDLE
wr 0x00
RESET
wr 0x01
RUNNING
 
The compiler drops the first write. The hardware never gets its reset command.

Dead Store Elimination again—the same pass that erased the key-zeroing. Two writes to the same address, no read in between, so LLVM assumes the first one is pointless. For regular memory, it is. But this is a hardware register where every write triggers a physical state transition. The accelerator never got its reset command.

Fix is the same as always: volatile.

volatile uint32_t *cmd = (volatile uint32_t *)0x40000008;
*cmd = 0x00;
*cmd = 0x01;
# clang -O2 — volatile version: both writes emitted
reset_and_start:
    lui     a0, 262144
    li      a1, 1
    sw      zero, 8(a0)      # clear
    sw      a1, 8(a0)        # start
    ret

Both stores present, correct order. volatile tells LLVM these accesses have side effects it can't reason about.

One thing to know: volatile constrains the compiler, not the CPU. On ARM or POWER, the processor itself can still reorder stores to device memory even with volatile. You need fence instructions or memory regions marked as strongly-ordered. On RISC-V that's the FENCE instruction or .io ordering bits. volatile is a floor, not a ceiling.

THE PATTERN

Three examples, one root cause. The compiler's job is to produce the fastest correct program—but its definition of "correct" is purely functional: same inputs, same outputs. It has no concept of how long something should take, whether memory should be zeroed after use, or whether two writes to the same address both matter. Security properties that depend on timing, on cleanup, on side effects—those are invisible to the optimizer.

This isn't academic. Every TLS library, every cryptocurrency wallet, every smart card runtime, every disk encryption driver has to deal with exactly these problems. OpenSSL, libsodium, the Linux kernel's crypto subsystem, BoringSSL—all of them contain workarounds for the patterns described above. When they get it wrong, the result is a key extraction vulnerability in production.

All three examples have fixes—volatile, compiler barriers, memset_explicit, inline assembly. They work today. But you're annotating source code to fight a general-purpose optimizer, and new passes ship with every LLVM release. A pattern that's safe in clang 23 might not survive clang 24. You end up in an arms race with your own toolchain.

The alternative is to push the requirement into hardware. A fixed-cycle datapath doesn't branch on secrets because it can't—the timing is set at the circuit level. A hardware register that accepts write commands can't have those writes optimized away—the bus transaction happens unconditionally. RISC-V already has this approach: OTBN, the OpenTitan big-number coprocessor. Its own register file, its own instruction set, no data-dependent branches. The compiler can't introduce a timing leak because the ISA doesn't have the instructions for one.

Software workarounds treat the symptoms. Hardware treats the cause.


Reading: