Skip to content

2 · wave — sound, sampling, and the WAV file

Adapted from μ — a modular approach to audio programming (Yü Fang, CMSC388V), rewritten for Zig 0.16 with extra explanations of both the Zig and the math. The original C is kept beside every Zig port so you can compare. Everything here runs offline: we compute samples and write a .wav file you play afterwards — no real-time audio, no drivers.

Interactivity inspires musicians: press a piano key and you hear the string instantly. That immediacy is what makes an instrument feel alive — but it is expensive to build. Real-time audio means hard latency deadlines and a lot of program complexity. It is far easier to generate, transform, and analyze sound when you are allowed a long delay before hearing the result.

So we start offline. Our goal: write — from scratch — a program that generates an audio file in WAV format that any media player can open. Tedious once, illuminating forever. But first, a question.

Sound is a rapid change in air pressure caused by mechanical vibration. Clap your hands and the air between them is forced out; that compression pushes the next layer of air, and the disturbance travels until it reaches your eardrum.

A computer cannot touch air pressure directly, so getting sound into and out of a computer takes a chain of conversions:

pressure → microphone → analog signal → ADC → digital signal → ... DAC → speaker

A microphone (a transducer) turns pressure into a continuous electrical voltage. We model that as a function of time:

x(t),tR0x(t), \qquad t \in \mathbb{R}_{\ge 0}

t is a real number — infinitely fine. A speaker runs the chain backwards, turning x(t)x(t) back into vibration. Analog processing (guitar pedals, analog synths) transforms that voltage with circuits; we will not go there. We work entirely on the computer.

A computer has finite resolution, so to store x(t)x(t) as binary we must throw away some detail. An analog-to-digital converter (ADC) records; a digital-to-analog converter (DAC) plays back. You will see these two acronyms constantly.

To generate audio we only need to mimic an ADC:

  1. Have an analog signal (a continuous function) in mind.
  2. Sample it — record its value at regular instants.
  3. Trust that it can be reconstructed with minimal distortion.

On a computer we never record real air — we construct a function in code and sample it. Two steps make a signal digital: sampling and quantization.

Sampling records the signal every TT seconds, producing a discrete sequence indexed by an integer nn:

x[n]=x(nT)x[n] = x(nT)

Notation. Square brackets x[n] mean a discrete (sampled) signal; parentheses x(t) mean a continuous one.

TT is the sampling period (seconds). Its inverse is the sample rate fsf_s (written sr in code), measured in samples/second, i.e. hertz (Hz):

fs=1Tf_s = \frac{1}{T}

It seems like we lose enormous information by sampling, yet — remarkably — sampling is theoretically lossless as long as the signal contains no frequency at or above fs/2f_s/2. That threshold is the Nyquist frequency, and the result is the Nyquist–Shannon sampling theorem.

Math note — why a half? Intuitively, to pin down a wave going up and down you need at least two samples per cycle: one near the peak, one near the trough. Fewer than two and a fast wave is indistinguishable from a slow one — they produce identical samples. That impostor is called an alias. CD audio uses fs=44100f_s = 44100 Hz, so it captures everything below 44100/2=2205044100/2 = 22050 Hz — comfortably above the ~20 kHz ceiling of human hearing.

Because frequencies above Nyquist masquerade as lower ones, a real ADC first runs the signal through an analog low-pass filter (an anti-aliasing filter) to remove them. That is a hardware concern; when we synthesize, we simply avoid generating those frequencies in the first place.

To store a sample as bits we must round it to one of a finite set of levels. The common scheme is linear pulse-code modulation (LPCM): the value is the amplitude, quantized in uniform steps. With a bit depth of NN bits there are 2N2^N levels. CD audio uses 16-bit (6553665536 levels); studios use 24-bit.

Quantization is lossy — each sample is nudged to the nearest level, off by up to half a step. That error behaves like added noise. We measure it with the signal-to-noise ratio (SNR), usually in decibels:

SNRdB=20log10 ⁣(AsignalAnoise)\mathrm{SNR_{dB}} = 20\log_{10}\!\left(\frac{A_\text{signal}}{A_\text{noise}}\right)

Math note — the 6 dB rule. Each extra bit halves the quantization step, which doubles the SNR amplitude ratio. Since 20log10(2)6.0220\log_{10}(2) \approx 6.02, every bit buys about 6 dB of dynamic range. So 16-bit gives ≈ 96 dB and 24-bit ≈ 144 dB. Human hearing spans ≈ 120 dB, so 20-plus bits is effectively lossless. Quieter signals use fewer levels, so their SNR is worse — which is exactly why you set a healthy recording level instead of a faint one.

The practical workflow everyone uses: decode LPCM → process in floating point in the range [1,1][-1, 1] → encode back to LPCM. Floating point has a huge dynamic range, so intermediate math practically never overflows. We will compute in f32 and convert to i16 only at the moment of writing.

Zig note — types up front. Zig never converts between number types silently. You will write @floatFromInt, @intFromFloat, and @as(T, x) explicitly. It feels verbose at first, but it means every rounding or truncation is visible in the source — no surprise conversions.

A WAV file is just a container for encoded samples, built from the RIFF format (Microsoft/IBM, 1991). RIFF is made of chunks — tagged containers:

┌─────────────┐
│chunk │
│┌──────────┐ │
││id(4) │ │ id : 4-byte tag, e.g. "RIFF"
│├──────────┤ │ size : 4-byte length of the body
││size(4) │ │ body : the actual data (may be sub-chunks)
│├──────────┤ │ pad : 1 zero byte if size is odd
││body │ │
│└──────────┘ │
└─────────────┘

A whole WAV file is one RIFF chunk whose body, in the simplest form, holds two sub-chunks: fmt (the format metadata) and data (the samples). Grouping everything before the samples into a header, the layout is:

WAV file
├── RIFF header
│ ├── id(4) "RIFF"
│ ├── size(4) 36 + data_size
│ └── type(4) "WAVE"
├── fmt chunk
│ ├── id(4) "fmt " (note the trailing space!)
│ ├── size(4) 16
│ ├── fmt_tag(2) 1 = linear PCM
│ ├── channels(2) 1 = mono
│ ├── samples/sec(4) 44100
│ ├── bytes/sec(4) channels × sr × bits/8
│ ├── block_align(2) channels × bits/8
│ └── bits/sample(2) 16
└── data chunk
├── id(4) "data"
├── size(4) number of sample bytes
└── data the samples themselves

Two rules: the 4-byte tags ("RIFF", "WAVE", "fmt ", "data") are raw ASCII and endian-less (do not reverse them); everything else is little-endian.

Math note — the derived fields. bytes/sec = channels × samples/sec × bits/8; a player uses it to know how fast to stream. block_align = channels × bits/8 is the size of one frame (all channels of a single instant). For stereo, samples are interleaved L R L R ….

The original course models the header as a nested struct:

#include <stdint.h>
typedef int8_t fourcc[4];
struct riff_hdr {
fourcc id;
uint32_t size;
fourcc type;
};
struct fmt_ck {
fourcc id;
uint32_t size;
uint16_t fmt_tag;
uint16_t channels;
uint32_t samples_per_sec;
uint32_t bytes_per_sec;
uint16_t block_align;
uint16_t bits_per_sample;
};
struct data_hdr {
fourcc id;
uint32_t size;
};
struct wav_hdr {
struct riff_hdr riff;
struct fmt_ck fmt;
struct data_hdr data;
};

The parameters, hard-coded for simplicity:

typedef int16_t sample_t;
#define SAMPLE_MAX 32767
#define DURATION 5
#define SR 44100
#define NCHANNELS 1
#define NSAMPLES (NCHANNELS*DURATION*SR)

Then fill the struct and fwrite it:

struct wav_hdr hdr = {0};
FILE *fp = fopen("output.wav", "wb");
/* RIFF header */
memcpy(&hdr.riff.id, "RIFF", 4);
hdr.riff.size = 36 + NSAMPLES*sizeof(sample_t);
memcpy(&hdr.riff.type, "WAVE", 4);
/* FMT chunk */
memcpy(&hdr.fmt.id, "fmt ", 4);
hdr.fmt.size = 16;
hdr.fmt.fmt_tag = 1; /* linear PCM */
hdr.fmt.channels = NCHANNELS;
hdr.fmt.samples_per_sec = SR;
hdr.fmt.bytes_per_sec = NCHANNELS*SR*sizeof(sample_t);
hdr.fmt.block_align = NCHANNELS*sizeof(sample_t);
hdr.fmt.bits_per_sample = 8*sizeof(sample_t);
/* DATA header */
memcpy(&hdr.data.id, "data", 4);
hdr.data.size = NSAMPLES*sizeof(sample_t);
fwrite(&hdr, sizeof(struct wav_hdr), 1, fp);

Writing a struct directly with fwrite is fragile: the compiler may insert padding between fields, and the bytes come out in the machine’s native endianness (wrong on a big-endian CPU). This particular struct happens to be padding-free on x86, but it is a trap.

Zig sidesteps both traps by writing each field explicitly, in chosen byte order, through a buffered Writer. No structs, no padding, no endian guesswork:

const std = @import("std");
const sample_t = i16; // 16-bit signed samples
const sample_max: f32 = 32767;
const duration = 5; // seconds
const sr = 44100; // sample rate (Hz)
const nchannels = 1; // mono
const nsamples = nchannels * duration * sr;
fn writeWavHeader(w: *std.Io.Writer, data_len: u32) !void {
// RIFF header
try w.writeAll("RIFF");
try w.writeInt(u32, 36 + data_len, .little);
try w.writeAll("WAVE");
// fmt chunk
try w.writeAll("fmt ");
try w.writeInt(u32, 16, .little); // chunk size
try w.writeInt(u16, 1, .little); // 1 = linear PCM
try w.writeInt(u16, nchannels, .little);
try w.writeInt(u32, sr, .little);
try w.writeInt(u32, nchannels * sr * @sizeOf(sample_t), .little); // bytes/sec
try w.writeInt(u16, nchannels * @sizeOf(sample_t), .little); // block align
try w.writeInt(u16, 8 * @sizeOf(sample_t), .little); // bits/sample
// data header
try w.writeAll("data");
try w.writeInt(u32, data_len, .little);
}

Zig note — writeInt(u32, x, .little). This single call solves the C program’s two hardest portability problems at once. It writes exactly the bytes you ask for (no struct padding) in the byte order you name (.little), so the output is identical on any CPU. writeAll writes raw bytes — perfect for the endian-less ASCII tags. The try propagates any write error out of the function (its return type is !void, “void or an error”).

The “hello world” of audio is a 440 Hz sine — the note A above middle C:

x(t)=sin(2π440t)x(t) = \sin(2\pi \cdot 440 \cdot t)

To sample it, substitute t=i/fst = i / f_s for i=0,1,,N1i = 0, 1, \dots, N{-}1, and scale the [1,1][-1, 1] result up to the 16-bit range. In C:

(lrint rounds to the nearest integer.) The complete Zig program — header plus samples plus the all-important flush:

pub fn main() !void {
var dbg: std.heap.DebugAllocator(.{}) = .init;
defer _ = dbg.deinit();
const gpa = dbg.allocator();
var threaded: std.Io.Threaded = .init(gpa, .{});
defer threaded.deinit();
const io = threaded.io();
var file = try std.Io.Dir.cwd().createFile(io, "output.wav", .{});
defer file.close(io);
var wbuf: [4096]u8 = undefined;
var fw = file.writer(io, &wbuf);
const w = &fw.interface;
const data_len: u32 = nsamples * @sizeOf(sample_t);
try writeWavHeader(w, data_len);
var i: usize = 0;
while (i < nsamples) : (i += 1) {
const t: f32 = @floatFromInt(i);
const s = std.math.sin(2.0 * std.math.pi * 440.0 * t / @as(f32, sr));
const v: sample_t = @intFromFloat(sample_max * s);
try w.writeInt(sample_t, v, .little);
}
try w.flush(); // <-- buffered bytes only hit disk on flush; forgetting this truncates the file
}

Zig note — the I/O setup (new in 0.16). Zig 0.16 made I/O an explicit value you pass around. The three lines DebugAllocator → Threaded → io() give you a memory allocator and an io handle; file.writer(io, &wbuf) wraps the file in a buffered writer (it batches bytes for speed). &fw.interface is the *std.Io.Writer we hand to writeWavHeader. The catch beginners always hit: a buffered writer holds the last chunk in memory until you call w.flush() — miss it and the file ends up short or empty. defer runs cleanup (deinit, close) automatically when the scope exits.

Math note — reading the sine line. t = i / sr is the time of sample i in seconds. Feeding 2π·440·t to sin makes the wave complete 440 cycles per second — that is the pitch. Double 440 → one octave up; halve it → one octave down. Pitch is multiplicative, which is why octaves are ratios, not fixed offsets.

Run it (zig run wave.zig), then turn your volume below 20 % before playing — a full-scale sine is genuinely loud, and a bug can be worse. The complete C program lives on tig.

A full-scale sine fills the entire amplitude range, which is far too loud for one instrument in a mix (you rarely want a single voice above ~0.5). Audio signals are functions, so to make one quieter you multiply by a factor below 1:

const v: sample_t = @intFromFloat(sample_max * 0.2 * s); // 0.2 ≈ −14 dB, plenty of headroom

Even at 0.2 you will hear it clearly, because hearing is logarithmic (more on that in mix). From the oscillator chapter on, we keep generation and volume separate and leave signals at full amplitude — so keep your system volume down.

For fun: bytebeat (discovered 2011 by viznut) makes chiptune-ish music from a single integer expression. Drop to 8-bit samples at 8 kHz:

const sample_t = u8; // 8-bit samples, 0..255
const sr = 8000; // lo-fi sample rate

Then fill the buffer with one expression built only from integer math, bitwise ops, and comparisons:

var t: usize = 0;
while (t < nsamples) : (t += 1) {
const x: u8 = @truncate(t *% 5 & t >> 7 | t *% 3 & t >> 10);
try w.writeByte(x);
}

Zig note — wrapping vs. checked math. In C, unsigned overflow silently wraps. Zig makes that a choice: plain * panics on overflow in safe builds, while *% is the explicit wrapping multiply — which is what bytebeat relies on. @truncate keeps only the low 8 bits to land back in u8. Making “I really do want wraparound here” visible is very on-brand for Zig.

A few characters and it already sounds like music. Tweak the expression and see what falls out.

Portability / endianness. The C struct-write breaks on a big-endian machine, so the original adds an is_le() check and per-field byte-swapping wrappers. In Zig there is nothing to fix: writeInt(..., .little) already pins the byte order, so the program is correct everywhere by construction.

Variable duration. If you do not know the length up front (e.g. recording until Stop), you cannot fill the size fields in advance. The fix is the same in both languages: remember the file position, write a placeholder, and after the samples seek back and patch the two size fields (ftell/fseek in C; fw.seekTo / file.seekTo in Zig).

  1. Change 440 to 220 and 880 Hz; confirm each is “the same note” an octave away. Why is octave = ×2, not +constant?
  2. Write a stereo file: set nchannels = 2, fix bytes/sec and block_align, and write samples interleaved L, R. Pan a tone hard left by writing it to L and 0 to R.
  3. Inspect the bytes: xxd output.wav | head -3. Find RIFF, WAVE, fmt , data, and confirm 44 AC 00 00 (44100) appears.
  4. Comment out w.flush() and observe what happens to the file. Feel the bug once now.

Gareth Loy’s Musimathics (vol. 1) for the physics of sound; Ken Pohlmann’s Principles of Digital Audio for sampling/quantization/codecs; xiph.org’s “Digital Show & Tell” video for an intuitive sampling demo. Original chapter and full source: mu.krj.st/wave.


Next: 3 · osc (part i) — turning a counter into any waveform.