All Projects
Project · DSP / Embedded

Real-Time FM Software-Defined Radio

Real-time FM SDR on Raspberry Pi 4 — mono, stereo, and RDS from RF. Three-thread pipeline with polyphase resampling, holds real-time at 600 MHz, 101 taps.

Platform Raspberry Pi 4
Modes 4 (Mode 0/1/2/3 — RF 1.84–2.4 MS/s, audio 38.4–48 kS/s)
Filter taps 101 (verified at 13 / 75 / 301 across all modes)
Real-time threshold 5 min run, no underrun @ 600 MHz, 75 taps
Resampler speedup 1.4× from naive convolve-decimate to polyphase
Threads 3 — IF (producer) → mono/stereo + RDS (consumers)

The 3DY4 capstone was a real-time FM software-defined radio on a Raspberry Pi 4. The receiver consumes IQ samples from an RF dongle and produces three outputs in parallel: mono audio, stereo audio, and decoded RDS metadata (station name, song title, traffic info). Front-end filtering, FM demodulation, three independent audio/data paths, and the threading layer that ties them together all run in software on a $50 board, with custom RF / IF / audio rates assigned per group.

It was a group project (Ratish Gupta, Gorazd Bocev, Manav Patel, Jaavin Mohanakumar). My piece was the mono-path optimization that unblocked real-time operation, plus the RDS modelling and most of the C++ debugging on the RDS side: band-pass channel extraction, carrier recovery via squaring non-linearity, demodulation, and Manchester / differential decoding.

Mode set (group 62)

ModeRF Fs (kS/s)IF Fs (kS/s)Audio Fs (kS/s)Block (ms)RDS
02400.0240.048.040yes
11843.2307.238.425no
22400.0240.044.170yes
31920.0384.044.130no

The four modes force every block to be parameterized in tap count and sample rate. A naive resampler that closes timing for one mode will underrun on another.

Architecture

The receiver is a three-stage pipeline. The RF front end runs an anti-aliasing FIR on the raw IQ stream and decimates to the IF rate. The FM demodulator turns IQ into a real-valued baseband. From there the signal forks into three independent paths:

  • Mono path. LPF and decimate to the audio rate.
  • Stereo path. 19 kHz pilot BPF → PLL recovers the 38 kHz stereo carrier (NCO phase locked) → mixer downconverts the L−R difference signal → LPF + decimation → combined with the mono branch into L/R channels.
  • RDS path. 57 kHz channel BPF → squaring non-linearity (BPSK has phase ambiguity at 0 and π; squaring doubles the frequency to 114 kHz and removes it) → PLL on 114 kHz with NCO + delay match → mixer + LPF → polyphase rational resampler to SPS × 2375 sym/s → root-raised cosine matched filter → peak-tracking CDR → Manchester + differential decode → frame sync.

Threading is producer-consumer with bounded queues. The IF thread is the producer. Mono+stereo and RDS are independent consumers. There are two thread-safe queues, each guarded by a mutex and two condition variables (full / empty) so threads block in wait() rather than polling. Queue depth is 10, enough to absorb jitter without bloating latency. End-of-stream is signalled by a wrapper struct with an eos flag pushed into both queues.

Mono path: the optimization that unblocked real time

The mono path was the foundation, and getting it real-time was the gating problem. The lab’s convolve-and-decimate fused operation worked at 1.5 GHz / 75 taps but underran the moment we pushed to lower clocks or higher tap counts.

Three rounds of optimization, in the order I made them:

  1. Polyphase rational resampler. Modelled it in Python first to confirm the math, then ported it. The polyphase commutator skips zero-multiplications and computes only the samples actually retained. For a U/DU/D resampler at 101 taps, that cuts the multiply count by roughly the upsample factor. 1.4× wall-clock speedup on the Pi.
  2. Loop unrolling and precomputed constants. Hot inner loops were rewritten to pull invariants (resize bounds, division factors) out of the body.
  3. Conditional resize / reserve. Output buffers were reserve()-ed once, and the resize-vs-clear path was guarded so the allocator wasn’t hit per block.

After all three, mono ran clean at 600 MHz, 101 taps, no underrun for 5 minutes across every mode. The same FIR machinery then served stereo and RDS.

Stereo path: PLL and the off-by-one

Stereo was modelled in Python first, with PSD plots, isolated-channel inspection, and PLL debugging. The first port to C++ had two real bugs:

  • PLL block-state transition. The last sample of one block was supposed to be reused as the first of the next. The state was being carried across, but the sample itself was duplicated, kept in the previous block AND prepended to the next, making each block one sample too long. That off-by-one smeared into stereo “bleed” between channels.
  • Missing Hann window on the C++ BPF. The Python BPF used a Hann window, the C++ port didn’t. The DC offset that introduced was caught by a unit test that I initially dismissed as overly strict. In hindsight, failed validation should be investigated even when the measured error looks small.

The fix was to compute the all-pass mono-branch delay dynamically from the tap count (it had been hardcoded), correct the block-boundary handoff in the PLL, and apply the window. After the fixes, channel separation was clean.

RDS path

This is where I spent most of my time on the model and the debugging. The signal flow:

# Carrier recovery: squaring non-linearity removes BPSK phase ambiguity.
# Signal at 57 kHz with phase in {0, π} → squared signal at 114 kHz with no phase ambiguity.
sq = rds_band ** 2
# Gain of 2 to conserve power (squaring halves it)
sq *= 2
# 114 kHz BPF + PLL + NCO recovers the carrier; delay-match the I path
i_carrier, q_carrier = pll_114k(sq)
# Mix back to baseband, LPF, polyphase resample to SPS × 2375 sym/s
i_bb = lowpass(mixer(rds_band_delayed, i_carrier))
i_resampled = polyphase_resample(i_bb, U, D)
# Root-raised cosine matched filter shapes pulses for zero-ISI sampling
i_rrc = rrc(i_resampled, sps, beta=0.5)

Two pieces took real debugging.

Peak-tracking CDR. The first version used a fixed decimation phase and fell out of sync within seconds. The replacement is a peak-tracking CDR with early-late timing detection. The first block (after a 0.5 s warmup for PLL lock) brute-forces every possible decimation phase and picks the one with the highest “score”, which is the mean absolute value of samples spaced sps apart. After that, a drift accumulator tracks the running difference between the early and late samples, and when it crosses a threshold, the phase shifts by one. Constellation plots before and after were night and day. Points went from a smear to two well-separated clusters at ±0.5\pm 0.5 on the I axis.

Manchester decoding edge case. The lab tested only against Samples3.raw, where the “next sample” decoding rule worked. With other captures, one rule produced a flood of HH/LL invalid pairs and the other didn’t, so the right choice depends on the recording. The fix was to brute-force both rules on the first block at the start of a stream, pick whichever produced fewer invalid pairs, then commit to it.

Frame synchronization (Gorazd’s piece, worth documenting because it’s the last hop) used per-block syndrome computation against the RDS spec, with state saved across chunk boundaries so a lock formed in block N would persist into block N+1. False-positive syndromes in invalid block transitions (A→C, for example) were filtered by enforcing the expected ordering and 26-bit spacing.

Measurements

At 101 taps, frontend time on Mode 0 ranged from 8.27 ms/block at 1.5 GHz to 20.57 ms/block at 600 MHz, roughly linear with clock as expected for a MAC-bound workload. RDS adds another ~25 ms/block at 600 MHz on top, dominated by the RDS LPF (13.5 ms) and carrier recovery (12.4 ms). At 13 taps the audio was legible but staticky. At 301 taps it was indistinguishable from 101 taps. So 101 sat at the knee of the diminishing-returns curve, and the final system ran there.

Lessons

  • Optimize once, reuse three times. The polyphase resampler unlocked mono first and then carried us through stereo and RDS without re-paying the optimization cost.
  • Trust the unit tests, even the strict ones. The Hann-window DC-offset failure was a real bug that I had dismissed as a flaky check. It cost a day of stereo-bleed debugging.
  • MACs aren’t the whole picture. Resampling, when not implemented as polyphase, can dominate timing while still looking cheap in a MAC count. Profile, don’t estimate.