● Live notes + builds
Uncategorized
Reading mode

Building Efficient Neural Networks for FPGAs: Part 1

A practical guide for embedded engineers stepping into ML acceleration.


Edge inference is having a moment for one simple reason: latency. If your model needs to react at the sensing level, sending data to the cloud or even to a remote GPU is often too slow, too power-hungry, or too unreliable.

This post walks through an end-to-end path to convert a trained neural network into synthesizable HLS code using hls4ml, simulate it, validate accuracy under fixed-point quantization, and explore precision-resource deployment on FPGA

All code for this post is available on GitHub: oluseyivictor/edge-nn-fpga-hls4ml


Why FPGA for ML Inference?

If you’ve spent time deploying models on microcontrollers (TFLite Micro, Edge Impulse), you’ve hit the ceiling: limited parallelism, no pipelining, and models that either fit in flash or don’t. GPUs solve the compute problem but bring power budgets and latency profiles that don’t suit many edge applications.

An FPGA sits in a different part of the design space. You’re not executing instructions, you’re building a custom dataflow machine for your specific network:

  • Parallel evaluation: multiple neurons and MAC operations execute simultaneously in dedicated hardware.
  • Pipelining: each layer operates as a streaming pipeline stage, which means layer 2 starts processing while layer 1 handles the next input.
  • Deterministic latency: no cache misses, no OS scheduling jitter. Inference takes the same number of clock cycles every time.
  • Power efficiency: for fixed-function inference workloads, FPGAs typically consume a fraction of what a GPU draws.

The catch has always been the development effort. Writing RTL for a neural network by hand is tedious and error-prone. That’s what hls4ml solves.


What hls4ml Does

hls4ml is an open-source Python library (developed at CERN and collaborating institutions) that converts trained neural network models into HLS C++ code. The generated code can be synthesized by Xilinx Vitis HLS (or Intel HLS) into an RTL IP block.


The Example Project: Weather Classification at the Edge

The Dataset

We’ll use a public Kaggle weather dataset covering weather history from Leeds, England (2005–2016). Features include temperature, apparent temperature, humidity, wind speed, visibility, cloud cover, and precipitation type.

Why this dataset works well for an FPGA demo:

  • Moderate feature count which makes it good for compact Multi-Layer perceptron that synthesize cleanly.
  • Clear classification targets with physical meaning.
  • Small enough that you can see the entire pipeline without getting lost in data engineering.

Feature Selection: Less is More on an FPGA

You can throw all features into a model, but on hardware you pay for every input:

  • More inputs → more weights → more multipliers/adders → more DSP/LUT usage.
  • Bigger network → higher latency, or lower clock frequency, or lower throughput.

A practical approach is correlation-based feature selection: compute the correlation (or mutual information) between each feature and the target label, then keep only the most predictive features. For this demo, we’ll start with just Humidity and Temperature (°C), the two features that strongly predict our two classes and keep the hardware footprint minimal. You can expand to wind speed, cloud cover, and others once the pipeline is working.


Step 1: Environment Setup

I used a local Anaconda environment on my machine for development, for more intensive training and faster start, Google Colab is a great environment as it can provide GPU access for training. A local environment however let me have both hls4ml and Vitis HLS in one place. You need to have Vitis HLS or Vitis Unified IDE installed on your machine.

conda create -n edge-hls4ml python=3.10 -y
conda activate edge-hls4ml
pip install tensorflow hls4ml scikit-learn pandas numpy matplotlib

Note: hls4ml generates HLS C++ code. you don’t need Vitis HLS installed to generate the project, only to synthesize it. This means you can train and convert on Colab, then move the generated project to your synthesis workstation.


Step 2: Train a Compact Neural Network

The full training script is at python/train.py. Here’s what matters and why.

Architecture: 2 → 16 → 10 → 1. This gives us roughly 2×16 + 16×10 + 8×1 = 200 weights. Fully parallelized, that’s 200 multipliers, well within the DSP budget of even small FPGAs.

Every architecture choice is driven by what synthesizes well:

  • ReLU activations: Trivial in hardware. it’s a multiplexer, not a lookup table. Sigmoid and tanh require approximation circuits that cost resources and add latency.
  • StandardScaler normalization: Bounded, normalized inputs map cleanly to fixed-point ap_fixed representations. Without normalization, you’d need wider bit-widths to handle the dynamic range of raw sensor values, wasting resources.
  • BatchNormalization: hls4ml folds BN into the preceding Dense layer’s weights during conversion. No extra hardware cost, but it improves the model’s tolerance to quantization.

Step 3: Convert to HLS with hls4ml

This is where the ML world meets the FPGA world.

The key decisions happen in the configuration:

Precision: ap_fixed<16,6> means 16 total bits, 6 integer bits (including sign), 10 fractional bits. This is a good starting point, aggressive enough to save resources vs. float32, conservative enough to preserve accuracy for most small MLPs.

Reuse factor: 1 means full parallelism, every multiply gets its own hardware. Lowest latency, highest resource use.

I/O type: io_stream generates AXI-Stream-friendly interfaces with lower BRAM usage than io_parallel. io_parallel exposes the model’s inputs and outputs as flat arrays. For simplicity of the project io_parallel is utilised.

Backend: Vitis backend is the right choice for the project. It generates a clean HLS project that you can C-simulate, synthesize, and co-simulate without any board-specific AXI wrapper overhead.


Step 4: Synthesize and Export the IP

Vitis does three things:

  1. Synthesizes the HLS C++ into RTL (Verilog/VHDL), scheduling operations across clock cycles and mapping them to DSPs, LUTs, and FFs.
  2. Reports resource utilization estimates and timing analysis.
  3. Exports the design as a packaged IP (.zip or .xci) that Vivado can import.

Understanding the Synthesis Report

For a 2→16→8→3 MLP with ap_fixed<16,6> and reuse factor 1, expect numbers in this ballpark on a Zynq-7020 (xc7z020):

ResourceUsed (approx.)Available (7020)Utilization
DSP4820–40220~10–18%
LUT2,000–5,00053,200~4–9%
FF1,500–4,000106,400~1–4%
BRAM2–8140~1–6%

Step 5: Create the Vivado Project and Import the IP

Now we leave the HLS world and enter Vivado proper.

  1. Launch Vivado and create a new RTL project. Select the Basys 3 board (or manually choose the xc7a35tcpg236-1 Artix-7 device). Even though we’re only simulating in this post, targeting a specific device ensures the synthesis estimates are realistic.
  2. Add the hls4ml IP repository. Go to Settings → IP → Repository and click the + button. Navigate to the impl/ip/ directory from the HLS export. Vivado will detect the packaged IP and make it available in the IP catalog.
  3. Create a Block Design. In the Flow Navigator, click Create Block Design. Give it a name like weather_nn_sim.
  4. Add the hls4ml IP. In the block design, click Add IP (the + icon) and search for your module name (e.g., myproject). Add it to the canvas.
  5. Inspect the ports. With io_parallel, you’ll see explicit ports on the IP block:
  • ap_clk — clock input
  • ap_rst — reset input (active high)
  • ap_start — trigger to begin inference
  • ap_done — goes high when inference completes
  • ap_idle — high when the accelerator is idle
  • ap_ready — high when ready to accept new inputs
  • input_1_V — input array port (2 elements for humidity and temperature)
  • layer7_out_V — output array port (3 elements for class scores) The exact port names depend on your Keras model’s layer names and the hls4ml configuration. Check the block design canvas or the generated VHDL/Verilog wrapper for the exact names.

6. Validate the block design (F6) and Create HDL Wrapper (right-click the block design source → Create HDL Wrapper → let Vivado manage it).


    Step 6: Write the Testbench

    The testbench is a standard SystemVerilog DUT (Device Under Test) wrapper that drives the hls4ml IP with known inputs and captures the outputs. This is the file at sim/tb_weather_nn.sv in the repo.

    What the Testbench Does

    The testbench follows the standard HLS IP handshake protocol:

    1. Reset the IP (assert ap_rst for several cycles).
    2. Apply normalized inputs to the input_1_V ports. The apply_input task handles normalization using the same scaler parameters from training — the same values you’d embed in firmware.
    3. Pulse ap_start high for one clock cycle to trigger inference.
    4. Wait for ap_done to go high, counting clock cycles to measure latency.
    5. Read the output scores from layer7_out_V and compute the argmax to get the predicted class.

    The four test cases cover different weather conditions so you can verify the model’s decisions make physical sense.

    Important: hls4ml generates port names based on your Keras model’s layer names and the io_parallel flattening convention. Check the HDL wrapper that Vivado created for the exact signal names, they’ll be in the wrapper’s port list.


    Step 7: Run the Vivado Simulation

    1. In the Flow Navigator, click Run Simulation → Run Behavioral Simulation.
    2. Vivado compiles the testbench and the hls4ml IP’s RTL, then opens the waveform viewer.

    What to Look For in the Waveform

    Add the key signals to the waveform viewer and you’ll see the inference pipeline in action:

    • ap_clk — your 100 MHz clock.
    • ap_start — the single-cycle pulse that triggers each inference.
    • ap_done — goes high when the result is ready. The gap between ap_start rising and ap_done rising is your inference latency in clock cycles.
    • input_layer_6[]— the fixed-point input values. Set the radix to signed decimal or fixed-point in the waveform viewer to see human-readable numbers.
    • layer9_out[] — the three class scores. Watch them transition from X (undefined) to valid values when ap_done asserts.

    The Tcl console at the bottom of Vivado will show the $display output from the testbench — the class predictions, scores, and cycle counts for each test case.


    What We’ve Built So Far

    At this point, you have:

    • A trained Keras model (weather_classifier.h5) optimized for hardware deployment.
    • A synthesized RTL IP core exported from Vitis HLS, using io_parallel for clean, explicit I/O ports.
    • A Vivado block design with the IP integrated and ports exposed.
    • A SystemVerilog testbench that normalizes inputs, drives the HLS handshake protocol, measures latency, and reports classification results.
    • Verified simulation results showing correct inference across multiple test cases with deterministic, cycle-accurate timing.
    • A precision sweep showing the resource–accuracy trade-space for your specific model.

    Discover more from NeuralonEdge

    Subscribe now to keep reading and get access to the full archive.

    Continue reading