Hardware Implementation

Ternary neural network inference on an AMD Kria KV260 FPGA. From HLS-generated bloat to hand-written RTL, achieving 69x fewer flip-flops for the same inference task.

The result: A fully ternary neural network running on an FPGA in 955 flip-flops. Zero multipliers. The ternary operation is just add, subtract, or skip. 70% CIFAR-10 accuracy, bit-exact with simulation, using 1% of the chip. The other 99% is free.

Target Platform

Board

AMD Kria KV260

SoC

Zynq UltraScale+ MPSoC

Logic Cells

256K

DSP Slices

1,248

v11 - HLS Generated

First deployment. Proved ternary AI runs on FPGA.

Approach

Vitis HLS auto-generated RTL from C++. Ternary weights with float ReLU activations. It worked, but the tool unrolled loops into parallel hardware, burning resources.

Accuracy (CIFAR-10)77.60%

Flip-Flops66,141

LUTs67,916 (58.0%)

DSP Slices280 (22.4%)

Block RAM83 (57.6%)

Inference Latency147ms / image

PL Power40mW PL overhead

DeployedMarch 16, 2026

Datapath

Weights: ternary {-1, 0, +1} - zero multipliers
Activations: float32 (BN + ReLU)
500 test images, exact match with C-sim

True TGN - Hand-Written RTL

The breakthrough. Fully ternary. 69x smaller.

Approach

Hand-written Verilog. One compute unit, time-multiplexed across all layers. Like a keyboard controller scanning a matrix. One compute unit, reused for every layer.

Accuracy (CIFAR-10)70.00%

Flip-Flops955

LUTs1,217

DSP Slices4

Block RAM15

VerificationBit-exact with C-sim

Utilization1% of KV260 utilization

DeployedMarch 18, 2026

Fully Ternary Datapath

Weights: ternary {-1, 0, +1} - zero multipliers
Activations: ternary {-1, 0, +1} - zero float
100 test images, bit-exact with C-sim

66,141 FF

HLS Generated

955 FF

Hand-Written RTL

69x

smaller

How Ternary Computation Works

Dopamine

acc += activation

Excitatory - reinforce pathway

Acetylcholine

skip - no hardware

Neutral - conserve energy

-1

Serotonin

acc -= activation

Inhibitory - suppress pathway

No multiply instruction exists in the datapath. A ternary weight of +1 means add. A weight of -1 means subtract. A weight of 0 generates no hardware at all, eliminated during synthesis. At runtime, approximately 45-50% of weights are zero, meaning half the network is physically absent from the chip.

The RTL Core

Two Verilog modules handle all inference. One compute unit processes every layer sequentially, like a keyboard controller scanning a key matrix. The same hardware runs any model size by swapping weights between passes.

tgn_block_core

3x3 ternary convolution + threshold. State machine walks each output pixel, accumulates via add/subtract/skip, applies ternary threshold. ~150 FF.

tgn_gate_apply

Per-channel neuromodulation. Applies a ternary attention gate: +1 = pass, 0 = silence (clock-gated), -1 = invert. ~50 FF.

farscape_ternary_core.v

Complete shareable Verilog file. Both modules with full documentation. No vendor-specific primitives. Synthesizes on Xilinx, Intel, Lattice, or ASIC.

Contact for access to the full RTL source.

Road Ahead

Scale the RTL core

99% of KV260 is unused. Larger models with more channels fit the same core, requiring only more weight ROM.

Tape-out to silicon

TinyTapeout for a proof chip. ChipIgnite (SkyWater 130nm) for a full ASIC. The Verilog is vendor-agnostic and ports directly.

SPI peripheral

At 955 FF, the core is smaller than most I/O controllers. Wrap it with an SPI interface and it becomes a ternary AI coprocessor any microcontroller can talk to.

Verification and Methodology

Resource Numbers

All flip-flop, LUT, BRAM, and DSP numbers come from Vivado 2025.2 post-place utilization reports (report_utilization). Device: xck26-sfvc784-2LV-c (AMD Kria KV260). The v11 HLS report shows 66,141 CLB Registers (28.24% of 234,240 available). The True TGN RTL report shows 955 FF, 1,217 LUT, 15 BRAM, 4 DSP.

Accuracy Testing

v11 HLS: 388 correct out of 500 CIFAR-10 test images = 77.60%. Tested on KV260 ARM PS using kv260_full_test_fast.py with libflush.so for DMA cache coherency. Results match Vitis HLS C-simulation exactly (bit-identical float32 outputs).

True TGN RTL: 70 correct out of 100 CIFAR-10 test images = 70.00%. Tested on KV260 via AXI-Lite register interface using kv260_true_tgn_v2.py (pure numpy, no TensorFlow dependency). Results match iverilog RTL simulation bit-for-bit (integer outputs).

Inference Pipeline

Both deployments use a hybrid ARM + PL architecture. The ARM Cortex-A53 runs the float stem (2x Conv2D + BN + ReLU) and the final classifier (Global Average Pool + Dense). The programmable logic (PL) runs all ternary convolution blocks. Data is transferred between PS and PL via DMA (v11 HLS) or AXI-Lite register writes (True TGN RTL).

69x Reduction

66,141 FF (v11 HLS) / 955 FF (True TGN RTL) = 69.26x. Both designs perform CIFAR-10 inference with ternary weights on the same KV260 board. The reduction comes from time-multiplexing: one compute unit processes all layers sequentially instead of instantiating parallel hardware per layer.

All numbers from Vivado 2025.2 synthesis reports on AMD Kria KV260 (xck26-sfvc784-2LV-c). Project FARSCAPE.