Building Efficient Neural Networks for FPGAs: Part 1
A practical guide for embedded engineers stepping into ML acceleration.
Edge inference is having a moment for one simple reason: latency. If your model needs to react at the sensing level, sending data to the cloud or even to a remote GPU is often too slow, too power-hungry, or too unreliable.
This post walks through an end-to-end path to convert a trained neural network into synthesizable HLS code using hls4ml, simulate it, validate accuracy under fixed-point quantization, and explore precision-resource deployment on FPGA
All code for this post is available on GitHub: oluseyivictor/edge-nn-fpga-hls4ml
Why FPGA for ML Inference?
If you’ve spent time deploying models on microcontrollers (TFLite Micro, Edge Impulse), you’ve hit the ceiling: limited parallelism, no pipelining, and models that either fit in flash or don’t. GPUs solve the compute problem but bring power budgets and latency profiles that don’t suit many edge applications.
An FPGA sits in a different part of the design space. You’re not executing instructions, you’re building a custom dataflow machine for your specific network:
- Parallel evaluation: multiple neurons and MAC operations execute simultaneously in dedicated hardware.
- Pipelining: each layer operates as a streaming pipeline stage, which means layer 2 starts processing while layer 1 handles the next input.
- Deterministic latency: no cache misses, no OS scheduling jitter. Inference takes the same number of clock cycles every time.
- Power efficiency: for fixed-function inference workloads, FPGAs typically consume a fraction of what a GPU draws.

The catch has always been the development effort. Writing RTL for a neural network by hand is tedious and error-prone. That’s what hls4ml solves.
What hls4ml Does
hls4ml is an open-source Python library (developed at CERN and collaborating institutions) that converts trained neural network models into HLS C++ code. The generated code can be synthesized by Xilinx Vitis HLS (or Intel HLS) into an RTL IP block.

The Example Project: Weather Classification at the Edge
The Dataset
We’ll use a public Kaggle weather dataset covering weather history from Leeds, England (2005–2016). Features include temperature, apparent temperature, humidity, wind speed, visibility, cloud cover, and precipitation type.
Why this dataset works well for an FPGA demo:
- Moderate feature count which makes it good for compact Multi-Layer perceptron that synthesize cleanly.
- Clear classification targets with physical meaning.
- Small enough that you can see the entire pipeline without getting lost in data engineering.
Feature Selection: Less is More on an FPGA
You can throw all features into a model, but on hardware you pay for every input:
- More inputs → more weights → more multipliers/adders → more DSP/LUT usage.
- Bigger network → higher latency, or lower clock frequency, or lower throughput.
A practical approach is correlation-based feature selection: compute the correlation (or mutual information) between each feature and the target label, then keep only the most predictive features. For this demo, we’ll start with just Humidity and Temperature (°C), the two features that strongly predict our two classes and keep the hardware footprint minimal. You can expand to wind speed, cloud cover, and others once the pipeline is working.
Step 1: Environment Setup
I used a local Anaconda environment on my machine for development, for more intensive training and faster start, Google Colab is a great environment as it can provide GPU access for training. A local environment however let me have both hls4ml and Vitis HLS in one place. You need to have Vitis HLS or Vitis Unified IDE installed on your machine.
conda create -n edge-hls4ml python=3.10 -yconda activate edge-hls4mlpip install tensorflow hls4ml scikit-learn pandas numpy matplotlib
Note: hls4ml generates HLS C++ code. you don’t need Vitis HLS installed to generate the project, only to synthesize it. This means you can train and convert on Colab, then move the generated project to your synthesis workstation.
Step 2: Train a Compact Neural Network
The full training script is at python/train.py. Here’s what matters and why.
Architecture: 2 → 16 → 10 → 1. This gives us roughly 2×16 + 16×10 + 8×1 = 200 weights. Fully parallelized, that’s 200 multipliers, well within the DSP budget of even small FPGAs.
Every architecture choice is driven by what synthesizes well:
- ReLU activations: Trivial in hardware. it’s a multiplexer, not a lookup table. Sigmoid and tanh require approximation circuits that cost resources and add latency.
- StandardScaler normalization: Bounded, normalized inputs map cleanly to fixed-point
ap_fixedrepresentations. Without normalization, you’d need wider bit-widths to handle the dynamic range of raw sensor values, wasting resources. - BatchNormalization: hls4ml folds BN into the preceding Dense layer’s weights during conversion. No extra hardware cost, but it improves the model’s tolerance to quantization.
Step 3: Convert to HLS with hls4ml
This is where the ML world meets the FPGA world.
The key decisions happen in the configuration:
Precision: ap_fixed<16,6> means 16 total bits, 6 integer bits (including sign), 10 fractional bits. This is a good starting point, aggressive enough to save resources vs. float32, conservative enough to preserve accuracy for most small MLPs.
Reuse factor: 1 means full parallelism, every multiply gets its own hardware. Lowest latency, highest resource use.
I/O type: io_stream generates AXI-Stream-friendly interfaces with lower BRAM usage than io_parallel. io_parallel exposes the model’s inputs and outputs as flat arrays. For simplicity of the project io_parallel is utilised.
Backend: Vitis backend is the right choice for the project. It generates a clean HLS project that you can C-simulate, synthesize, and co-simulate without any board-specific AXI wrapper overhead.
Step 4: Synthesize and Export the IP
Vitis does three things:
- Synthesizes the HLS C++ into RTL (Verilog/VHDL), scheduling operations across clock cycles and mapping them to DSPs, LUTs, and FFs.
- Reports resource utilization estimates and timing analysis.
- Exports the design as a packaged IP (
.zipor.xci) that Vivado can import.
Understanding the Synthesis Report
For a 2→16→8→3 MLP with ap_fixed<16,6> and reuse factor 1, expect numbers in this ballpark on a Zynq-7020 (xc7z020):
| Resource | Used (approx.) | Available (7020) | Utilization |
|---|---|---|---|
| DSP48 | 20–40 | 220 | ~10–18% |
| LUT | 2,000–5,000 | 53,200 | ~4–9% |
| FF | 1,500–4,000 | 106,400 | ~1–4% |
| BRAM | 2–8 | 140 | ~1–6% |
Step 5: Create the Vivado Project and Import the IP
Now we leave the HLS world and enter Vivado proper.
- Launch Vivado and create a new RTL project. Select the Basys 3 board (or manually choose the
xc7a35tcpg236-1Artix-7 device). Even though we’re only simulating in this post, targeting a specific device ensures the synthesis estimates are realistic. - Add the hls4ml IP repository. Go to Settings → IP → Repository and click the
+button. Navigate to theimpl/ip/directory from the HLS export. Vivado will detect the packaged IP and make it available in the IP catalog. - Create a Block Design. In the Flow Navigator, click Create Block Design. Give it a name like
weather_nn_sim. - Add the hls4ml IP. In the block design, click Add IP (the
+icon) and search for your module name (e.g.,myproject). Add it to the canvas. - Inspect the ports. With
io_parallel, you’ll see explicit ports on the IP block:
ap_clk— clock inputap_rst— reset input (active high)ap_start— trigger to begin inferenceap_done— goes high when inference completesap_idle— high when the accelerator is idleap_ready— high when ready to accept new inputsinput_1_V— input array port (2 elements for humidity and temperature)layer7_out_V— output array port (3 elements for class scores) The exact port names depend on your Keras model’s layer names and the hls4ml configuration. Check the block design canvas or the generated VHDL/Verilog wrapper for the exact names.
6. Validate the block design (F6) and Create HDL Wrapper (right-click the block design source → Create HDL Wrapper → let Vivado manage it).

Step 6: Write the Testbench
The testbench is a standard SystemVerilog DUT (Device Under Test) wrapper that drives the hls4ml IP with known inputs and captures the outputs. This is the file at sim/tb_weather_nn.sv in the repo.
What the Testbench Does
The testbench follows the standard HLS IP handshake protocol:
- Reset the IP (assert
ap_rstfor several cycles). - Apply normalized inputs to the
input_1_Vports. Theapply_inputtask handles normalization using the same scaler parameters from training — the same values you’d embed in firmware. - Pulse
ap_starthigh for one clock cycle to trigger inference. - Wait for
ap_doneto go high, counting clock cycles to measure latency. - Read the output scores from
layer7_out_Vand compute the argmax to get the predicted class.
The four test cases cover different weather conditions so you can verify the model’s decisions make physical sense.
Important: hls4ml generates port names based on your Keras model’s layer names and the
io_parallelflattening convention. Check the HDL wrapper that Vivado created for the exact signal names, they’ll be in the wrapper’s port list.
Step 7: Run the Vivado Simulation
- In the Flow Navigator, click Run Simulation → Run Behavioral Simulation.
- Vivado compiles the testbench and the hls4ml IP’s RTL, then opens the waveform viewer.
What to Look For in the Waveform
Add the key signals to the waveform viewer and you’ll see the inference pipeline in action:
- ap_clk — your 100 MHz clock.
- ap_start — the single-cycle pulse that triggers each inference.
- ap_done — goes high when the result is ready. The gap between
ap_startrising andap_donerising is your inference latency in clock cycles. - input_layer_6[]— the fixed-point input values. Set the radix to signed decimal or fixed-point in the waveform viewer to see human-readable numbers.
layer9_out[]— the three class scores. Watch them transition from X (undefined) to valid values whenap_doneasserts.
The Tcl console at the bottom of Vivado will show the $display output from the testbench — the class predictions, scores, and cycle counts for each test case.

What We’ve Built So Far
At this point, you have:
- A trained Keras model (
weather_classifier.h5) optimized for hardware deployment. - A synthesized RTL IP core exported from Vitis HLS, using
io_parallelfor clean, explicit I/O ports. - A Vivado block design with the IP integrated and ports exposed.
- A SystemVerilog testbench that normalizes inputs, drives the HLS handshake protocol, measures latency, and reports classification results.
- Verified simulation results showing correct inference across multiple test cases with deterministic, cycle-accurate timing.
- A precision sweep showing the resource–accuracy trade-space for your specific model.