Zero-Allocation Audio Callbacks in Rust
The cpal audio callback in dictate has roughly five milliseconds to return. Miss the deadline and it’s audible — clicks, dropouts, silence where sound should be. I traced the glitches to the heap allocator. The initial implementation touched it twice per callback. Each allocation risked priority inversion.
Two allocations per callback
The original audio pipeline used Rust’s standard mpsc::sync_channel to move samples from the audio thread to a consumer thread. I found two allocations hiding in this path, both touching the global allocator — which turned out to be the problem.
sequenceDiagram
participant Audio as Audio thread
participant Channel as mpsc channel
participant Consumer as Consumer thread
Audio->>Audio: cpal callback fires
Audio->>Audio: convert samples → Vec(f32) scratch
Audio->>Audio: resampler.process(scratch)
Note over Audio: ALLOCATES new Vec(f32)
Audio->>Channel: tx.try_send(vec)
Note over Channel: stores owned Vec(f32)
alt channel Full
Note over Audio: drop vec, count lost
else channel has space
Consumer->>Channel: rx.recv_timeout(100ms)
Channel->>Consumer: receives Vec(f32)
Note over Consumer: takes ownership
Consumer->>Consumer: chunker.push_samples()
Note over Consumer: drops Vec (DEALLOCATES)
end
Allocation one: FrameResampler::process() returned a freshly allocated Vec<f32> on every call. The resampler created a new output buffer each time, wrote samples into it, and returned ownership.
Allocation two: mpsc::SyncSender::try_send(Vec<f32>) moved that Vec into the channel. The channel internally stores the heap-allocated message. On the consumer side, receiving the message and then dropping the Vec triggered a deallocation.
Both operations called into the global allocator. In the Rust standard library implementation I was using, the global allocator uses a lock. On the audio thread, contending on that lock meant priority inversion: the highest-priority thread in the process could end up waiting for a lock held by whatever else was doing heap allocation. In my case, the OS audio subsystem didn’t wait — if the callback didn’t return in time, it filled the output buffer with silence.
Zero allocations after warmup
I made two changes to eliminate both allocations.
First, I added a new resampler method process_into() that appends output to a caller-owned &mut Vec<f32> instead of returning a new Vec. The caller reuses the same buffer — along with the scratch buffer used for sample format conversion — across callbacks. After the first few invocations (while the Vec capacity grows to its stable size), this path doesn’t allocate.
Second, I replaced the mpsc channel with a ringbuf::HeapRb<f32> — a lock-free single-producer, single-consumer ring buffer. Samples get push_slice’d directly into a pre-allocated circular buffer. No ownership transfer. No heap allocation. A memcpy into the ring.
sequenceDiagram
participant Audio as Audio thread
participant Ring as ring buffer
participant Consumer as Consumer thread
Audio->>Audio: cpal callback fires
Audio->>Audio: convert samples → scratch (reused)
Audio->>Audio: resampler.process_into(scratch, output_buf)
Note over Audio: appends to REUSED Vec
Audio->>Ring: producer.push_slice(output_buf)
Note over Ring: memcpy into ring, lock-free
Audio->>Consumer: signal new data
alt partial push
Note over Audio: count dropped samples
end
Consumer->>Consumer: condvar.wait_timeout(100ms)
Consumer->>Ring: consumer.pop_slice()
Ring->>Consumer: into reused read_buf
Consumer->>Consumer: chunker.push_samples()
Note over Consumer: no deallocation
After warmup: zero allocations per callback in dictate’s audio thread.
But eliminating the channel changed something I hadn’t thought about — the channel had been defining backpressure semantics invisibly.
Backpressure changes meaning
The mpsc channel had a fixed capacity measured in messages. Each message was a Vec<f32> of variable length. When the channel was full, try_send failed, and the entire chunk was dropped. If one callback produced 480 samples and another produced 512, the loss accounting was imprecise. “Dropped 3 chunks” could mean 1,440 samples or 1,536.
The ring buffer has a fixed capacity measured in samples. When the buffer fills up, push_slice performs a partial write — it pushes as many samples as will fit and returns how many were written. The remainder gets dropped. Loss accounting is exact: “dropped 47 samples” means 47 samples.
I realized the channel had been treating audio as discrete packages. A chunk either arrives intact or doesn’t arrive at all. The ring buffer treats audio as a continuous stream — under pressure, it drops the samples that don’t fit rather than entire chunks.
The capacity also became independent of callback behavior. The old system’s effective buffer size depended on how many samples each callback produced — a property of the audio driver, not something I control. The new system allocates a fixed number of sample slots. The buffer means the same thing regardless of how the driver chunks its callbacks.
The notification problem
The ring buffer is a data structure, not a channel. It has no built-in “wait for data” mechanism. The producer pushes samples. The consumer pops samples. But if the consumer has drained the buffer and no new data has arrived, it needs to sleep efficiently — not spin.
I reached for a condition variable. The obvious implementation I tried first had a race condition.
- Consumer checks if the buffer is empty
- If empty, calls
condvar.wait_timeout() - Producer pushes data
- Producer calls
condvar.notify_one()
The race: the producer pushes data and calls notify_one() between step 1 and step 2. The consumer has already checked — the buffer was empty. It enters wait_timeout(). The notification fires into the void — a missed wakeup. In an audio pipeline with 100ms timeouts, that means 100ms of dead air every time the race hits.
The condvar pattern
The fix I found uses a mutex, but not how I initially expected. The mutex here isn’t protecting shared data. The ring buffer handles its own thread safety through atomic operations. I discovered the mutex exists solely to close the race window between “check if empty” and “start waiting.”
sequenceDiagram
participant Consumer
participant Mutex
participant Producer
Consumer->>Mutex: lock(mutex)
Consumer->>Consumer: check: buffer empty?
alt buffer empty
Note over Consumer: condvar.wait_timeout()
(atomically releases mutex)
Producer->>Producer: push_slice(data)
Producer->>Mutex: lock(mutex)
Note over Producer,Mutex: serializes with consumer
Producer->>Consumer: condvar.notify_one()
Producer->>Mutex: unlock(mutex)
Note over Consumer: wakes up, reacquires mutex
else buffer has data
Consumer->>Consumer: pop_slice()
end
Consumer->>Mutex: unlock(mutex)
What makes this work: the consumer holds the mutex lock from the moment it checks the buffer through the moment it enters wait_timeout(). The Condvar::wait_timeout() call atomically releases the mutex and begins waiting — this turned out to be the key guarantee the standard library provides. There’s no gap between “I see the buffer is empty” and “I am now waiting for a notification.”
The producer must acquire the same mutex before notifying. This means the producer’s notify_one() can only execute in one of two windows. Either it runs before the consumer acquires the lock — in which case the consumer sees data when it checks and never enters wait_timeout() at all. Or it runs after the consumer has atomically released the lock and entered the wait — in which case the notification wakes the consumer immediately.
There’s no third window — the lock eliminates the gap.
When I first looked at this pattern, my instinct was to ask “what data does this mutex protect?” The answer turned out to be “none.” In this case, the mutex was protecting a timing relationship, not a memory location. I’ve seen this pattern in other codebases since, though I’m still learning to recognize when it’s needed.
Clean shutdown with Drop
A ring buffer has no concept of disconnection — unlike mpsc, it can’t signal “no data ever again.” I added a Drop implementation for the producer that sets an AtomicBool and fires a final condvar notification. The consumer checks the flag only after draining remaining samples. Checking the flag alone would lose the producer’s final writes. Disconnection in dictate means the producer is gone and there’s nothing left to read.
What I learned about real-time constraints
The heap allocator in the standard library is a shared mutable resource with a global lock. I initially considered faster allocators — jemalloc, mimalloc — but realized they’re still allocators with locks, and the worst case is still unbounded. The fix I landed on wasn’t speed. It was elimination. Pre-allocate during setup, reuse in the callback, and move data through structures that never touch the heap.
After these changes, I ran dictate through several hour-long recording sessions without hearing a single click or dropout. Previously, I’d hear glitches within the first few minutes. The audio thread now runs without touching the allocator after the initial warmup period. The problem could return under conditions I haven’t tested — heavier system load, different audio hardware, longer recording durations. But in the scenarios I’ve tested so far, the glitches are gone.
There are probably other ways to solve this — lock-free allocators, different ring buffer implementations, or architectural changes I haven’t considered. This is what I tried, and so far, it’s held up.