Skip to content

8 · intermezzo: ctrl — threads, atomics & control rate

An interlude, not a new program. Since osc (part i) almost every real-time example has carried a latent bug: a data race between the audio thread and the control thread that shares a parameter with it. This chapter follows the original closely — it is a whirlwind tour of computer architecture: torn reads, atomicity, mutexes, atomic variables, message passing, a single-producer/single-consumer ring buffer, memory reordering, and control rate.

Offline note. Our offline course is single-threaded, so it has none of these races; you can skip this and still generate every WAV. But it is the chapter you need the moment you go real-time, and it is the strongest showcase of Zig’s concurrency tools — first-class std.atomic.Value, mandatory explicit memory ordering, std.Thread, std.Io.Mutex. As in the other chapters, the original C sits beside each Zig port.

A data race happens when two or more threads read and write the same global variable with no synchronization:

var global: i32 = 0;
fn mainThread() void { global = 1; }
fn audioThread() void { const a = global; _ = a; }

That is exactly the shape of vol in vol_exp and freq in jphs (osc i): written on the OSC/joystick thread, read on the audio thread.

GCC and Clang have -fsanitize=thread (TSan), which prints a warning when it spots a race at runtime. Compiling vol_exp with it and then sending one OSC volume message produces:

WARNING: ThreadSanitizer: data race
Write of size 4 ... in on_message (the liblo/OSC thread T1)
Previous read of size 4 ... in vol_tick → on_process (the JACK audio thread T4)
Location is global 'vol' of size 4

vol is written by on_message on the OSC thread and read by vol_tick on the audio thread — a textbook race.

Zig note — catching races. Zig threads are explicit (std.Thread.spawn(.{}, func, .{args}).join()), and you build with -fsanitize-thread to get the same TSan report. There is no hidden runtime; if two threads touch a var, you wrote that sharing yourself and can see it.

So why did I teach you to write “incorrect” programs? Because a data race is not always a bug. C11/C++11 call it undefined behavior, but our programs were intentionally arranged so the bad outcome can never happen — they always work. Still, you must understand why, or you will eventually hit a race that is a bug and is brutal to reproduce.

Data races cause two problems: torn reads and memory reordering. For sharing a control parameter, only torn reads matter (reordering matters when the variable is used for synchronization — section 5).

A torn read happens when a single read/write cannot be done atomically — it can be interrupted mid-way, exposing a corrupted value. In C global = 42; is one expression, but it may compile to several instructions. On 32-bit x86, writing a 64-bit variable becomes two movs:

12 global = 0x1111111122222222;
mov DWORD PTR [eax+0x20],0x22222222 ; low 32 bits
mov DWORD PTR [eax+0x24],0x11111111 ; high 32 bits

A reader landing between them sees half of each value. The original demonstrates it with torn_64:

var p: u64 = 0;
const p1: u64 = 0x1111111133333333;
const p2: u64 = 0x2222222244444444;
fn thWrite() void {
var ctr: u1 = 0;
while (true) : (ctr ^= 1) {
p = if (ctr == 1) p1 else p2; // a u64 store is atomic on a 64-bit target
}
}
fn thRead() void {
while (true) {
const q = p;
if (q != p1 and q != p2) std.debug.print("torn: 0x{x}\n", .{q});
}
}
// main(): spawn both with std.Thread.spawn(.{}, thWrite, .{}) / thRead, then join.

Compiled -m32, it prints hundreds of torn values per second (0x2222222233333333, 0x1111111144444444 — mixes of the two). Compiled 64-bit, the warnings vanish: an aligned 64-bit access is a single atomic mov. (Even one instruction is not always atomic — ARMv7’s strd may split into two word writes per the manual — but x86_64 guarantees atomicity for aligned 1/2/4/8-byte accesses.)

Concept note — the rule that saved us. Reading/writing an aligned value no wider than the machine word is atomic on any 32-/64-bit CPU. Every parameter we shared was a plain 32-bit f32, not packed in a struct — so it could never tear. That is why the races were harmless.

Zig note — testing atomicity. Port torn_64 by spawning th_write/th_read with std.Thread; a u64 on a 64-bit target stays intact, a non-atomic wider type does not. Zig also exposes @atomicLoad/@atomicStore builtins and std.atomic.Value(T) when you need to guarantee it.

Simplest fix: keep every shared parameter small enough to be atomic (≤ word size, aligned). Then cross-thread access needs no synchronization at all. VST2 and LADSPA enforce exactly this — every plugin parameter is a 32-bit float, just like ours. It is inflexible (no integers, structs, strings, sample buffers) but simple and battle-tested.

// take 1: an inherently-atomic shared parameter
var freq = std.atomic.Value(f32).init(440.0);
// control thread: freq.store(newHz, .monotonic);
// audio thread: const hz = freq.load(.monotonic);

If the shared data is larger than a word, it tears even on 64-bit. Two 64-bit fields:

const Data = struct { a: u64, b: u64 }; // 128 bits — a writer can be caught mid-update

Running the same swap experiment, the reader sees impossible mixes like {.a=1, .b=4} and {.a=2, .b=3}.

Torn reads also bite when a parameter is spread across multiple variables. The original optimizes gate_tick by precomputing b = 1-a:

static float a = 0.9999f;
static float b = 0.0001f;
static float
gate_tick(float gate)
{
static float mem = 0;
mem = b * gate + a * mem;
return mem;
}

Now a and b must satisfy a + b = 1 together, so the two assignments in on_message must be atomic as a pair:

static int
on_message(...)
{
/* atomic start */
a = tau2pole(argv[0]->f);
b = 1-a;
/* atomic end */
return 1;
}
var a: f32 = 0.9999;
var b: f32 = 0.0001; // invariant: a + b == 1
fn gateTick(gate: f32) f32 {
const S = struct { var mem: f32 = 0; };
S.mem = b * gate + a * S.mem;
return S.mem;
}
// the robust fix: bundle the dependent values so they travel as ONE message
const Coeffs = struct { a: f32, b: f32 };
// onMessage(): rb.write(asBytes(&Coeffs{ .a = pole, .b = 1 - pole }));

If the audio thread reads a and b between those two writes, it gets a mismatched pair. Individually-atomic variables do not help when the invariant spans several of them — which is the cue to switch to message passing (§2), sending the whole Coeffs struct as one unit.

The standard tool for making several operations atomic is a mutex:

var mtx: std.Io.Mutex = .init;
var global: i32 = 0;
fn mainThread(io: std.Io) void {
mtx.lockUncancelable(io);
defer mtx.unlock(io);
global = 1;
}
fn audioThread(io: std.Io) i32 {
mtx.lockUncancelable(io); // blocks until free — risky on the audio thread
defer mtx.unlock(io);
return global;
}

The danger for audio: if the OS suspends the low-priority control thread while it holds the lock, the real-time audio thread blocks indefinitely (non-deterministically) — a missed deadline and an audible glitch. This is priority inversion. Because a little control latency is inaudible (≲ 10 ms), the real-time-safe move is trylock: update only if the lock is free right now, otherwise skip it this block.

You still set PTHREAD_PRIO_INHERIT on the mutex so the scheduler boosts the holder’s priority. The Zig equivalent with 0.16’s std.Io.Mutex:

var mtx: std.Io.Mutex = .init;
var global: i32 = 0;
fn audioTryUpdate(io: std.Io) void {
const S = struct { var a: i32 = 0; }; // last good value, kept between calls
if (mtx.tryLock()) { // never blocks
defer mtx.unlock(io);
S.a = global;
} // if the lock was busy, keep S.a unchanged and move on
}

Zig note — std.Io.Mutex, tryLock, defer. In 0.16 the mutex moved under the I/O interface: tryLock() is non-blocking and needs no io, while lock/unlock take an io (they may call the scheduler). defer mtx.unlock(io) runs on scope exit, so the unlock can’t be forgotten. Even with trylock, unlock can be a slow syscall, so for hard real-time the next solution — message passing — is preferred.

C11’s _Atomic (std.atomic.Value(T) in Zig) makes loads/stores indivisible — but it is often the wrong tool for “just stop torn reads”:

var global = std.atomic.Value(u32).init(0); // store/load are indivisible
// global.store(1, .monotonic); const x = global.load(.monotonic);
  • If the type is already atomic (aligned u32/f32), _Atomic changes nothing about tearing and may add cost (default sequentially consistent ordering).
  • If the type is too big, libatomic’s generic atomic load/store is implemented with a hidden mutex — reintroducing priority inversion, and worse than a real mutex because you cannot trylock it or set its attributes.

Check with atomic_is_lock_free:

// Zig has no runtime `atomic_is_lock_free`; the answer is moved to compile time.
// std.atomic.Value(T) only accepts a T the target can access atomically —
// a word-sized integer/pointer/float is fine:
var ok = std.atomic.Value(u64).init(0);
// a 192-bit struct is NOT a valid atomic Value: the atomic ops simply won't
// compile, so you reach for message passing instead (next section).

So atomics’ real job (section 6) is memory ordering, not bundling big data. For big or multi-field data, pass messages.

The real-time-friendly answer is message passing — which you already know from OSC. It fixes torn reads and lets you move arbitrarily large data between threads. LV2 and VST3 both use it. Read-only messages are enough for parameters, and it brings sample-accurate control and an event system for free.

Sharing one variable risks torn reads:

┌───────┐
│ param │
└──△─┬──┘
╭─────────╮ │ │ ╭─────────╮
│ main │ │ │ │ audio │
│ ├────┘ └───▷│ │
╰─────────╯ ╰─────────╯

Instead, give each thread its own copy and send updates through a buffer between them:

╭─────────╮ ┌───┐ ╭─────────╮
│ main │ write │ ■ │ read │ audio │
│ param ├───────▷├───┤───────▷│ param │
│ │ │ │ │ │
╰─────────╯ └───┘ ╰─────────╯

The audio thread only pulls an update once it sees the message was written completely. In pseudocode:

var buf: RingBuffer = undefined; // a thread-safe buffer (built next)
fn mainThread() void {
var a: i32 = 1;
_ = buf.write(std.mem.asBytes(&a));
}
fn audioThread() void { // invoked regularly
var a: i32 = undefined;
while (buf.readSpace() >= @sizeOf(i32)) _ = buf.read(std.mem.asBytes(&a));
}

As long as the buffer itself is thread-safe, we can pass arbitrarily large data with no torn reads. The buffer of choice is a ring buffer.

A ring buffer (circular FIFO) makes a fixed array act endless: a write pointer wp and read pointer rp, each wrapping at the end. As long as the reader keeps up, the writer can keep writing.

write
───────▷┌───┐
│ │
├───┤
│ │ read
├───┤◁──────
│ ■ │
├───┤
│ ■ │
└───┘

The author’s warning. Many “lock-free ring buffer” implementations online are subtly broken (JACK’s own was, for ~20 years, until 1.9.22). A correct one must use at least one of: atomic variables / stdatomic, memory barriers, compiler atomic intrinsics, or a real-time-safe mutex. If it uses none of these synchronization primitives, it can be broken. We will see exactly why in section 5.

The C interface — a single-producer, single-consumer buffer with capacity size - 1:

The Zig port is a struct; rb_init/rb_free become init/deinit with an allocator:

const RingBuffer = struct {
rp: std.atomic.Value(usize), // only the reader advances this
wp: std.atomic.Value(usize), // only the writer advances this
size: usize,
buf: []u8,
alloc: std.mem.Allocator,
fn init(alloc: std.mem.Allocator, size: usize) !RingBuffer {
return .{
.rp = std.atomic.Value(usize).init(0),
.wp = std.atomic.Value(usize).init(0),
.size = size,
.buf = try alloc.alloc(u8, size),
.alloc = alloc,
};
}
fn deinit(self: *RingBuffer) void {
self.alloc.free(self.buf);
}
// ... space/write/peek/read below ...
};

Bytes available to read — three cases (wp == rp empty → 0; wp > rpwp - rp; wp < rpsize - (rp - wp)):

fn readSpace(self: *const RingBuffer) usize {
const rp = self.rp.load(.acquire);
const wp = self.wp.load(.acquire);
return if (wp >= rp) wp - rp else self.size - (rp - wp);
}

Bytes available to write. To disambiguate full from empty, “wp one slot behind rp” is treated as full, so usable capacity is size - 1:

fn writeSpace(self: *const RingBuffer) usize {
const rp = self.rp.load(.acquire);
const wp = self.wp.load(.acquire);
if (rp == wp) return self.size - 1;
if (rp > wp) return rp - wp - 1;
return self.size - (wp - rp) - 1;
}

Concept note — why single-producer/single-consumer is safe. Only the writer moves wp, only the reader moves rp. So after rb_write_space returns, a concurrent reader can only increase free space, never shrink it — the check stays valid. This is the assumption the whole design rests on; multiple producers/consumers need a far more complex algorithm.

Check space, copy (one memcpy, or two if it wraps the end), then advance wp:

fn write(self: *RingBuffer, data: []const u8) bool {
if (self.writeSpace() < data.len) return false;
const wp = self.wp.load(.monotonic); // only the writer touches wp
if (wp + data.len < self.size) {
@memcpy(self.buf[wp .. wp + data.len], data);
} else { // wraps the end → two copies
const s = self.size - wp;
@memcpy(self.buf[wp..self.size], data[0..s]);
@memcpy(self.buf[0 .. data.len - s], data[s..]);
}
self.wp.store((wp + data.len) % self.size, .release); // publish AFTER the copy
return true;
}

rb_peek copies out without moving rp (useful to read a message’s size field first); rb_read peeks then advances rp:

fn peek(self: *const RingBuffer, out: []u8) bool {
if (self.readSpace() < out.len) return false;
const rp = self.rp.load(.monotonic);
if (rp + out.len < self.size) {
@memcpy(out, self.buf[rp .. rp + out.len]);
} else {
const s = self.size - rp;
@memcpy(out[0..s], self.buf[rp..self.size]);
@memcpy(out[s..], self.buf[0 .. out.len - s]);
}
return true;
}
fn read(self: *RingBuffer, out: []u8) bool {
if (!self.peek(out)) return false;
const rp = self.rp.load(.monotonic);
self.rp.store((rp + out.len) % self.size, .release);
return true;
}

Zig note — slices vs. pointer arithmetic. C does memcpy(rbbuf + wp, buf, size); Zig writes @memcpy(self.buf[wp..wp+len], data), a bounds-checked slice copy. rp/wp are std.atomic.Value(usize) so cross-thread access is defined. read only ever moves rp and write only ever moves wp — that, plus the orderings in section 6, is what makes it correct.

This whole implementation was compiled and run on Zig 0.16 with two real std.Threads passing 100-byte sequence-numbered messages: zero corruption.

The ring buffer above still has a subtle bug — rb_read will occasionally read garbage. It cannot reproduce on x86, only on weak-memory CPUs like ARM (Apple Silicon, phones). This is the bug that hid in JACK for ~20 years. Two facts:

  1. The compiler can reorder memory accesses at compile time.
  2. The CPU can reorder them at runtime.

Both for performance. The danger: if the wp increment is reordered before the data copy finishes, the reader sees space and reads a half-written value.

This program reorders under gcc -O2 (the store to b is emitted before the store to a):

int a, b;
void reorder(void)
{
a = b + 1;
b = 0;
}

A compiler barrier — an inline-asm "memory" clobber — stops it:

void no_reorder(void)
{
a = b + 1;
__asm__ __volatile__("" ::: "memory");
b = 0;
}
// Zig avoids the inline-asm barrier: a release atomic store fixes BOTH the
// compiler and the CPU reordering at once (this is what the ring buffer uses).
fn noReorder() void {
a = b + 1;
@atomicStore(i32, &b, 0, .release); // everything above is published before b
}

A compiler barrier does not stop the CPU from reordering the two stores at runtime (impossible on x86, allowed on ARM). That needs a hardware memory barriermfence on x86, dmb on ARM — or, portably, __sync_synchronize(). The clean fix is to make the pointers atomic, which inserts the right barriers automatically:

That is exactly what the Zig RingBuffer does, with the orderings written out: the writer copies, then publishes with a .release store; the reader observes with an .acquire load before touching the data:

self.wp.store((wp + data.len) % self.size, .release); // writer: copy first, THEN publish wp
const wp = self.wp.load(.acquire); // reader: see published wp, THEN read data

Concept note — acquire/release in one sentence. A .release store guarantees everything written before it is visible to any thread that acquires that value; a matching .acquire load guarantees you see all of it. So wp.store(.., .release) after the copy publishes the bytes, and wp.load(.acquire) in readSpace ensures the reader sees the bytes before it sees the advanced pointer — no torn message, on any architecture.

Zig note — ordering is mandatory and explicit. C’s plain _Atomic access defaults to (slow) sequential consistency. Zig forces you to name the order on every atomic op: .monotonic (atomic but unordered — fine where a single thread owns the pointer), .acquire, .release, .acq_rel, .seq_cst. The synchronization is visible in the source, and you pay only for what you use.

Even with safe sharing: when do you apply updates? Usually once per block, at the top of the callback — so parameters change every nframes samples, not every sample. That rate is the control rate:

kr=fsnframesk_r = \frac{f_s}{\text{nframes}}

At 48 kHz with 512-sample blocks, 48000/512 = 93.75 updates/second.

while (rb.readSpace() >= @sizeOf(Msg)) {
var m: Msg = undefined;
_ = rb.read(std.mem.asBytes(&m));
// update parameter p
}
for (out) |*s| s.* = audioTick(); // p is fixed for the whole block

Concept note — block size leaks into the output. Because kr depends on nframes, the same project can render differently at a different block size (some DAWs, even offline — the original calls out REAPER). For reproducibility, decouple control from block size with a fixed control duration kdur = sr/kr, pulling updates every kdur samples regardless of block size.

fn process(out: []f32, kdur: usize) void {
var ktick: usize = 0;
for (out) |*s| {
if (ktick == 0) ctrlTick(); // recompute parameters
ktick = (ktick + 1) % kdur;
s.* = audioTick();
}
}

A fixed rate still squashes fast events — two changes between ticks collapse to the last, losing the original timing.

Two ways to recover exact timing. Variable block size: the host calls process with block boundaries placed exactly at each event (Ardour, FL Studio — the latter even has a “Use fixed size buffers” toggle to disable it for fragile plugins). Time-stamped events: since the ring buffer carries arbitrary data, attach a sample offset to each message:

const Event = struct { timestamp: usize, value: f32 };
fn processEvents(out: []f32, events: []const Event) void {
var i: usize = 0;
for (events) |e| {
while (i < e.timestamp and i < out.len) : (i += 1) out[i] = audioTick();
if (i >= out.len) break;
applyParam(e.value); // exact-sample parameter change
}
while (i < out.len) : (i += 1) out[i] = audioTick();
}

A real-time message that arrives mid-block can only be seen next block, so sample-accurate real-time control costs exactly one block of latency — fixed, and easy to compensate. (JACK can timestamp incoming messages via jack_time_to_frames(client, jack_get_time()); most plugin formats cannot, because a DAW’s timeline is not linear.)

Concept/offline note — this is just automation. In an offline renderer, “control rate” is your automation resolution, and time-stamped events are how you place a change at a precise sample. The gate array in adsr and the frequency glide in osc i are tiny instances of exactly this.

  • kb: main → audio, sample-accurate.
  • oso: audio → main.

Original chapter — with the full disassembly, the JACK ring-buffer bug story, and the lock-free discussion: mu.krj.st/ctrl. Paul McKenney, Is Parallel Programming Hard?; Jeff Preshing’s blog on memory ordering; Herb Sutter, “atomic<> weapons”; Timur Doumler, “Using locks in real-time audio processing, safely”; the Zig docs for std.atomic and std.Thread.


An intermezzo — it sits between adsr and delay. Back to the index.