Hardware Implementation

Ternary neural network inference on an AMD Kria KV260 FPGA. From HLS-generated bloat to hand-written RTL, achieving 69x fewer flip-flops for the same inference task.

The result: A fully ternary neural network running on an FPGA in 955 flip-flops. Zero multipliers. The ternary operation is just add, subtract, or skip. 70% CIFAR-10 accuracy, bit-exact with simulation, using 1% of the chip. The other 99% is free.

Target Platform

Board
AMD Kria KV260
SoC
Zynq UltraScale+ MPSoC
Logic Cells
256K
DSP Slices
1,248

v11 - HLS Generated

First deployment. Proved ternary AI runs on FPGA.

Approach

Vitis HLS auto-generated RTL from C++. Ternary weights with float ReLU activations. It worked, but the tool unrolled loops into parallel hardware, burning resources.

Accuracy (CIFAR-10)77.60%
Flip-Flops66,141
LUTs67,916 (58.0%)
DSP Slices280 (22.4%)
Block RAM83 (57.6%)
Inference Latency147ms / image
PL Power40mW PL overhead
DeployedMarch 16, 2026
Datapath
Weights: ternary {-1, 0, +1} - zero multipliers
Activations: float32 (BN + ReLU)
500 test images, exact match with C-sim

True TGN - Hand-Written RTL

The breakthrough. Fully ternary. 69x smaller.

Approach

Hand-written Verilog. One compute unit, time-multiplexed across all layers. Like a keyboard controller scanning a matrix. One compute unit, reused for every layer.

Accuracy (CIFAR-10)70.00%
Flip-Flops955
LUTs1,217
DSP Slices4
Block RAM15
VerificationBit-exact with C-sim
Utilization1% of KV260 utilization
DeployedMarch 18, 2026
Fully Ternary Datapath
Weights: ternary {-1, 0, +1} - zero multipliers
Activations: ternary {-1, 0, +1} - zero float
100 test images, bit-exact with C-sim
66,141 FF
HLS Generated
955 FF
Hand-Written RTL
69x
smaller

How Ternary Computation Works

+1
Dopamine
acc += activation
Excitatory - reinforce pathway
0
Acetylcholine
skip - no hardware
Neutral - conserve energy
-1
Serotonin
acc -= activation
Inhibitory - suppress pathway

No multiply instruction exists in the datapath. A ternary weight of +1 means add. A weight of -1 means subtract. A weight of 0 generates no hardware at all, eliminated during synthesis. At runtime, approximately 45-50% of weights are zero, meaning half the network is physically absent from the chip.

The RTL Core

Two Verilog modules handle all inference. One compute unit processes every layer sequentially, like a keyboard controller scanning a key matrix. The same hardware runs any model size by swapping weights between passes.

tgn_block_core
3x3 ternary convolution + threshold. State machine walks each output pixel, accumulates via add/subtract/skip, applies ternary threshold. ~150 FF.
tgn_gate_apply
Per-channel neuromodulation. Applies a ternary attention gate: +1 = pass, 0 = silence (clock-gated), -1 = invert. ~50 FF.
farscape_ternary_core.v
Complete shareable Verilog file. Both modules with full documentation. No vendor-specific primitives. Synthesizes on Xilinx, Intel, Lattice, or ASIC.
Contact for access to the full RTL source.

Road Ahead

1
Scale the RTL core
99% of KV260 is unused. Larger models with more channels fit the same core, requiring only more weight ROM.
2
Tape-out to silicon
TinyTapeout for a proof chip. ChipIgnite (SkyWater 130nm) for a full ASIC. The Verilog is vendor-agnostic and ports directly.
3
SPI peripheral
At 955 FF, the core is smaller than most I/O controllers. Wrap it with an SPI interface and it becomes a ternary AI coprocessor any microcontroller can talk to.

Verification and Methodology

Resource Numbers
All flip-flop, LUT, BRAM, and DSP numbers come from Vivado 2025.2 post-place utilization reports (report_utilization). Device: xck26-sfvc784-2LV-c (AMD Kria KV260). The v11 HLS report shows 66,141 CLB Registers (28.24% of 234,240 available). The True TGN RTL report shows 955 FF, 1,217 LUT, 15 BRAM, 4 DSP.
Accuracy Testing
v11 HLS: 388 correct out of 500 CIFAR-10 test images = 77.60%. Tested on KV260 ARM PS using kv260_full_test_fast.py with libflush.so for DMA cache coherency. Results match Vitis HLS C-simulation exactly (bit-identical float32 outputs).

True TGN RTL: 70 correct out of 100 CIFAR-10 test images = 70.00%. Tested on KV260 via AXI-Lite register interface using kv260_true_tgn_v2.py (pure numpy, no TensorFlow dependency). Results match iverilog RTL simulation bit-for-bit (integer outputs).
Inference Pipeline
Both deployments use a hybrid ARM + PL architecture. The ARM Cortex-A53 runs the float stem (2x Conv2D + BN + ReLU) and the final classifier (Global Average Pool + Dense). The programmable logic (PL) runs all ternary convolution blocks. Data is transferred between PS and PL via DMA (v11 HLS) or AXI-Lite register writes (True TGN RTL).
69x Reduction
66,141 FF (v11 HLS) / 955 FF (True TGN RTL) = 69.26x. Both designs perform CIFAR-10 inference with ternary weights on the same KV260 board. The reduction comes from time-multiplexing: one compute unit processes all layers sequentially instead of instantiating parallel hardware per layer.
All numbers from Vivado 2025.2 synthesis reports on AMD Kria KV260 (xck26-sfvc784-2LV-c). Project FARSCAPE.