INT8 Systolic MAC Array
Parameterized 8×8 output-stationary INT8/INT32 systolic MAC array in SystemVerilog. Designed for transformer Q/K/V/O and FFN matmuls. Closes timing at 100 MHz on Artix-7 with +3.76 ns slack, 64 DSP48E1 slices, peak 12.8 GOPS.
RTL design and verification on FPGA.
Aiming for graduate research in ML-accelerator microarchitecture.
I'm a Computer Engineering undergrad at McMaster. Most of my work is digital RTL and hardware verification on FPGA, with a focus on ML hardware accelerators. The most recent piece I've finished is an 8×8 INT8 systolic MAC array. It verifies bit-exact against a NumPy reference and closes timing at 100 MHz on Artix-7.
After undergrad I want to do research on ML-accelerator microarchitecture. It sits between the algorithm choices and the silicon you can actually build, and that interface is the part of the field I find most interesting.
Parameterized 8×8 output-stationary INT8/INT32 systolic MAC array in SystemVerilog. Designed for transformer Q/K/V/O and FFN matmuls. Closes timing at 100 MHz on Artix-7 with +3.76 ns slack, 64 DSP48E1 slices, peak 12.8 GOPS.
Real-time FM SDR on Raspberry Pi 4. Recovers mono audio, stereo audio, and RDS metadata from RF input through a three-thread producer-consumer pipeline with polyphase resampling. The polyphase rewrite was a 1.4× speedup over the naive version. Holds real-time at 600 MHz, 101 taps, no underruns over five minutes.
JPEG-style FPGA image decoder on the Altera DE1-SoC at 50 MHz. Around 2,600 lines of SystemVerilog. The pipeline does chroma upsampling, then YCbCr→RGB, then a 2-D inverse DCT. Four hardware-multiplexed multipliers feed six outputs per pixel pair, and a dual-port RAM hides the IDCT transpose.