Zero-Allocation Audio Callbacks in Rust

The cpal audio callback in dictate has roughly five milliseconds to return. Miss the deadline and it’s audible — clicks, dropouts, silence where sound should be. I traced the glitches to the heap allocator. The initial implementation touched it twice per callback. Each allocation risked priority inversion.

Two allocations per callback

The original audio pipeline used Rust’s standard mpsc::sync_channel to move samples from the audio thread to a consumer thread. I found two allocations hiding in this path, both touching the global allocator — which turned out to be the problem.

sequenceDiagram
    participant Audio as Audio thread
    participant Channel as mpsc channel
    participant Consumer as Consumer thread

    Audio->>Audio: cpal callback fires
    Audio->>Audio: convert samples → Vec(f32) scratch
    Audio->>Audio: resampler.process(scratch)
    Note over Audio: ALLOCATES new Vec(f32)
    Audio->>Channel: tx.try_send(vec)
    Note over Channel: stores owned Vec(f32)
    alt channel Full
        Note over Audio: drop vec, count lost
    else channel has space
        Consumer->>Channel: rx.recv_timeout(100ms)
        Channel->>Consumer: receives Vec(f32)
        Note over Consumer: takes ownership
        Consumer->>Consumer: chunker.push_samples()
        Note over Consumer: drops Vec (DEALLOCATES)
    end

Allocation one: FrameResampler::process() returned a freshly allocated Vec<f32> on every call. The resampler created a new output buffer each time, wrote samples into it, and returned ownership.

Allocation two: mpsc::SyncSender::try_send(Vec<f32>) moved that Vec into the channel. The channel internally stores the heap-allocated message. On the consumer side, receiving the message and then dropping the Vec triggered a deallocation.

Both operations called into the global allocator. In the Rust standard library implementation I was using, the global allocator uses a lock. On the audio thread, contending on that lock meant priority inversion: the highest-priority thread in the process could end up waiting for a lock held by whatever else was doing heap allocation. In my case, the OS audio subsystem didn’t wait — if the callback didn’t return in time, it filled the output buffer with silence.

Zero allocations after warmup

I made two changes to eliminate both allocations.

First, I added a new resampler method process_into() that appends output to a caller-owned &mut Vec<f32> instead of returning a new Vec. The caller reuses the same buffer — along with the scratch buffer used for sample format conversion — across callbacks. After the first few invocations (while the Vec capacity grows to its stable size), this path doesn’t allocate.

Second, I replaced the mpsc channel with a ringbuf::HeapRb<f32> — a lock-free single-producer, single-consumer ring buffer. Samples get push_slice’d directly into a pre-allocated circular buffer. No ownership transfer. No heap allocation. A memcpy into the ring.

sequenceDiagram
    participant Audio as Audio thread
    participant Ring as ring buffer
    participant Consumer as Consumer thread

    Audio->>Audio: cpal callback fires
    Audio->>Audio: convert samples → scratch (reused)
    Audio->>Audio: resampler.process_into(scratch, output_buf)
    Note over Audio: appends to REUSED Vec
    Audio->>Ring: producer.push_slice(output_buf)
    Note over Ring: memcpy into ring, lock-free
    Audio->>Consumer: signal new data
    alt partial push
        Note over Audio: count dropped samples
    end
    Consumer->>Consumer: condvar.wait_timeout(100ms)
    Consumer->>Ring: consumer.pop_slice()
    Ring->>Consumer: into reused read_buf
    Consumer->>Consumer: chunker.push_samples()
    Note over Consumer: no deallocation

After warmup: zero allocations per callback in dictate’s audio thread.

But eliminating the channel changed something I hadn’t thought about — the channel had been defining backpressure semantics invisibly.

Backpressure changes meaning

The mpsc channel had a fixed capacity measured in messages. Each message was a Vec<f32> of variable length. When the channel was full, try_send failed, and the entire chunk was dropped. If one callback produced 480 samples and another produced 512, the loss accounting was imprecise. “Dropped 3 chunks” could mean 1,440 samples or 1,536.

The ring buffer has a fixed capacity measured in samples. When the buffer fills up, push_slice performs a partial write — it pushes as many samples as will fit and returns how many were written. The remainder gets dropped. Loss accounting is exact: “dropped 47 samples” means 47 samples.

I realized the channel had been treating audio as discrete packages. A chunk either arrives intact or doesn’t arrive at all. The ring buffer treats audio as a continuous stream — under pressure, it drops the samples that don’t fit rather than entire chunks.

The capacity also became independent of callback behavior. The old system’s effective buffer size depended on how many samples each callback produced — a property of the audio driver, not something I control. The new system allocates a fixed number of sample slots. The buffer means the same thing regardless of how the driver chunks its callbacks.

The notification problem

The ring buffer is a data structure, not a channel. It has no built-in “wait for data” mechanism. The producer pushes samples. The consumer pops samples. But if the consumer has drained the buffer and no new data has arrived, it needs to sleep efficiently — not spin.

I reached for a condition variable. The obvious implementation I tried first had a race condition.

Consumer checks if the buffer is empty
If empty, calls condvar.wait_timeout()
Producer pushes data
Producer calls condvar.notify_one()

The race: the producer pushes data and calls notify_one() between step 1 and step 2. The consumer has already checked — the buffer was empty. It enters wait_timeout(). The notification fires into the void — a missed wakeup. In an audio pipeline with 100ms timeouts, that means 100ms of dead air every time the race hits.

The condvar pattern

The fix I found uses a mutex, but not how I initially expected. The mutex here isn’t protecting shared data. The ring buffer handles its own thread safety through atomic operations. I discovered the mutex exists solely to close the race window between “check if empty” and “start waiting.”

sequenceDiagram
    participant Consumer
    participant Mutex
    participant Producer

    Consumer->>Mutex: lock(mutex)
    Consumer->>Consumer: check: buffer empty?

    alt buffer empty
        Note over Consumer: condvar.wait_timeout()
(atomically releases mutex)
        Producer->>Producer: push_slice(data)
        Producer->>Mutex: lock(mutex)
        Note over Producer,Mutex: serializes with consumer
        Producer->>Consumer: condvar.notify_one()
        Producer->>Mutex: unlock(mutex)
        Note over Consumer: wakes up, reacquires mutex
    else buffer has data
        Consumer->>Consumer: pop_slice()
    end

    Consumer->>Mutex: unlock(mutex)

What makes this work: the consumer holds the mutex lock from the moment it checks the buffer through the moment it enters wait_timeout(). The Condvar::wait_timeout() call atomically releases the mutex and begins waiting — this turned out to be the key guarantee the standard library provides. There’s no gap between “I see the buffer is empty” and “I am now waiting for a notification.”

The producer must acquire the same mutex before notifying. This means the producer’s notify_one() can only execute in one of two windows. Either it runs before the consumer acquires the lock — in which case the consumer sees data when it checks and never enters wait_timeout() at all. Or it runs after the consumer has atomically released the lock and entered the wait — in which case the notification wakes the consumer immediately.

There’s no third window — the lock eliminates the gap.

When I first looked at this pattern, my instinct was to ask “what data does this mutex protect?” The answer turned out to be “none.” In this case, the mutex was protecting a timing relationship, not a memory location. I’ve seen this pattern in other codebases since, though I’m still learning to recognize when it’s needed.

Clean shutdown with Drop

A ring buffer has no concept of disconnection — unlike mpsc, it can’t signal “no data ever again.” I added a Drop implementation for the producer that sets an AtomicBool and fires a final condvar notification. The consumer checks the flag only after draining remaining samples. Checking the flag alone would lose the producer’s final writes. Disconnection in dictate means the producer is gone and there’s nothing left to read.

What I learned about real-time constraints

The heap allocator in the standard library is a shared mutable resource with a global lock. I initially considered faster allocators — jemalloc, mimalloc — but realized they’re still allocators with locks, and the worst case is still unbounded. The fix I landed on wasn’t speed. It was elimination. Pre-allocate during setup, reuse in the callback, and move data through structures that never touch the heap.

After these changes, I ran dictate through several hour-long recording sessions without hearing a single click or dropout. Previously, I’d hear glitches within the first few minutes. The audio thread now runs without touching the allocator after the initial warmup period. The problem could return under conditions I haven’t tested — heavier system load, different audio hardware, longer recording durations. But in the scenarios I’ve tested so far, the glitches are gone.

There are probably other ways to solve this — lock-free allocators, different ring buffer implementations, or architectural changes I haven’t considered. This is what I tried, and so far, it’s held up.