Hardware Implementation
Ternary neural network inference on an AMD Kria KV260 FPGA. From HLS-generated bloat to hand-written RTL, achieving 69x fewer flip-flops for the same inference task.
Target Platform
v11 - HLS Generated
First deployment. Proved ternary AI runs on FPGA.
Vitis HLS auto-generated RTL from C++. Ternary weights with float ReLU activations. It worked, but the tool unrolled loops into parallel hardware, burning resources.
Activations: float32 (BN + ReLU)
500 test images, exact match with C-sim
True TGN - Hand-Written RTL
The breakthrough. Fully ternary. 69x smaller.
Hand-written Verilog. One compute unit, time-multiplexed across all layers. Like a keyboard controller scanning a matrix. One compute unit, reused for every layer.
Activations: ternary {-1, 0, +1} - zero float
100 test images, bit-exact with C-sim
How Ternary Computation Works
No multiply instruction exists in the datapath. A ternary weight of +1 means add. A weight of -1 means subtract. A weight of 0 generates no hardware at all, eliminated during synthesis. At runtime, approximately 45-50% of weights are zero, meaning half the network is physically absent from the chip.
The RTL Core
Two Verilog modules handle all inference. One compute unit processes every layer sequentially, like a keyboard controller scanning a key matrix. The same hardware runs any model size by swapping weights between passes.
Road Ahead
Verification and Methodology
True TGN RTL: 70 correct out of 100 CIFAR-10 test images = 70.00%. Tested on KV260 via AXI-Lite register interface using kv260_true_tgn_v2.py (pure numpy, no TensorFlow dependency). Results match iverilog RTL simulation bit-for-bit (integer outputs).