Uncategorized

Reading mode

Building Efficient Neural Networks for FPGAs: Part 2

08 Mar 2026 8 min read Victor Oloniyo

Running ML inference on an Artix-7 or Kintex FPGA without a hard processor using a MicroBlaze soft CPU to drive an hls4ml accelerator. Part 2 of 2

In Part 1, we trained a compact weather classifier, converted it to HLS using hls4ml, synthesized it, validated it in Xilinx Vivado and and explored the simulation behaviour. Now it needs to live inside a system that can feed it data and read results.

This post covers integrating the hls4ml IP core with a MicroBlaze soft processor on a pure FPGA, building the block design in Vivado, writing bare-metal firmware to drive inference, and handling the resource and performance trade-offs unique to this approach.

All code for this series is available on GitHub: oluseyivictor/edge-nn-fpga-hls4ml

Why MicroBlaze?

When working with an FPGA without a hard processor, you have a few options for orchestrating inference:

MicroBlaze — Xilinx’s mature 32-bit soft processor. Well-supported BSP, mature driver infrastructure, full Vitis/SDK toolchain integration. The pragmatic default.
RISC-V soft cores (VexRiscv, PicoRV32, NEORV32) — Open-source, no licensing concerns, growing ecosystem. More work to integrate but increasingly viable.
Pure RTL state machine — No CPU at all; a custom FSM feeds data to the accelerator and reads results. Maximum efficiency, minimum flexibility. Appropriate when the inference pipeline is fixed and never changes.

MicroBlaze hits the sweet spot for me because of prototyping and production scenarios: you get a C programming environment, standard peripheral drivers (UART, SPI, I2C, GPIO), and well-documented AXI integration all without paying for a Zynq device.

The Cost of a Soft CPU

MicroBlaze is not free. It consumes fabric resources that would otherwise be available for your neural network or other logic:

MicroBlaze Config	LUT	FF	BRAM	DSP
Minimal (no cache, no FPU)	~1,000–1,500	~800–1,200	2–4	0–3
Typical (8KB I/D cache)	~2,500–3,500	~2,000–2,800	8–16	0–3
Full (caches, FPU, MMU)	~4,000–6,000	~3,500–5,000	16–32	3–6

For our weather classifier (which uses ~2,000–5,000 LUTs and 20–40 DSPs), a minimal MicroBlaze roughly doubles the LUT footprint. On an Artix-7 35T (33,280 LUTs), that’s still very manageable. On a Spartan-7 7S or 15S, you’d need to be more careful.

System Architecture: Bridging AXI and Parallel Ports

The hls4ml IP from Part 1 uses io_parallel, it exposes flat input/output ports and HLS handshake signals (ap_start, ap_done, etc.). These are raw wires, not AXI bus interfaces. MicroBlaze, on the other hand, communicates with peripherals exclusively through AXI.

The bridge is AXI GPIO. Each AXI GPIO block acts as a translator: MicroBlaze writes a 32-bit value to a GPIO register over AXI, and that value appears on the GPIO output pins as parallel wires which connect directly to the hls4ml IP’s ports. Reading works the same way in reverse.

We need three AXI GPIO blocks:

GPIO Block	Direction	Width	What it carries
GPIO_INPUT	MicroBlaze → IP	32 bits (1 channel)	`{temp_fp[15:0], hum_fp[15:0]}` — both inputs packed into one word
GPIO_OUTPUT	IP → MicroBlaze	2 channels × 32 bits	Ch1: `{score1[15:0], score0[15:0]}`, Ch2: `score2[15:0]`
GPIO_CTRL	Bidirectional	2 channels	Ch1 (out): `ap_start`, Ch2 (in): `{ap_ready, ap_idle, ap_done}`

Why AXI GPIO and not AXI-Lite registers on the IP itself? Because io_parallel doesn’t generate AXI interfaces — it generates plain wires. You’d need to either re-run hls4ml with a different backend (adding AXI wrapper overhead), or use GPIO as a lightweight bridge. For a 2-input, 3-output model, GPIO is the simpler and more resource-efficient option.

Step 1: Create the Vivado Block Design

Open Vivado and create a new project targeting the Basys 3 board (xc7a35tcpg236-1).

Using the Same IP from Part 1

No need to re-run hls4ml. The io_parallel IP you exported in Part 1 is exactly what we need here. The same IP repository path works — just add it to this new Vivado project.

Add MicroBlaze

Create a new Block Design.
Click Add IP → search for MicroBlaze → add it.
Run Block Automation. Configure:

Local Memory: 32 KB (sufficient for our firmware).
Cache: Enable instruction cache (8 KB) for performance. Data cache is optional for this application.
Debug Module: Enable if you want JTAG debugging (recommended during development).
Peripheral AXI port: Enable — this is how MicroBlaze talks to GPIO and other peripherals.

Block Automation will add: clock wizard, processor system reset, local memory controller, and AXI interconnect.

Add the hls4ml IP

Go to Settings → IP → Repository and add the path to your hls4ml project’s impl/ip/ directory (same path from Part 1). Vivado will detect the packaged IP.
Add IP → search for your hls4ml module name (e.g., myproject) → add it to the block design.
You’ll see the familiar parallel ports: ap_clk, ap_rst, ap_start, ap_done, ap_idle, ap_ready, input_1_V_0, input_1_V_1, layer7_out_0, layer7_out_1, layer7_out_2.

Add AXI GPIO Blocks

Now add three AXI GPIO blocks to bridge MicroBlaze to the IP’s parallel ports.

GPIO_INPUT — feeding data to the IP:

Add IP → search for AXI GPIO → add it. Double-click to configure:

GPIO: Enable. Width = 32 bits. Check All Outputs (MicroBlaze writes, IP reads).
GPIO2: Disable (we only need one channel — both 16-bit inputs pack into 32 bits).

Rename it to gpio_input for clarity (right-click → Rename).

GPIO_OUTPUT — reading results from the IP:

Add IP → AXI GPIO again. Configure:
- GPIO: Enable. Width = 32 bits. Check All Inputs (IP writes, MicroBlaze reads). This carries {score1, score0}.
- GPIO2: Enable. Width = 16 bits. Check All Inputs. This carries score2.
Rename to gpio_output.

GPIO_CTRL — handshake signals:

Add IP → AXI GPIO again. Configure:
- GPIO: Enable. Width = 1 bit. Check All Outputs. This drives ap_start.
- GPIO2: Enable. Width = 3 bits. Check All Inputs. This reads {ap_ready, ap_idle, ap_done}.
Rename to gpio_ctrl.

Wire the Connections

This is the part that requires manual work in the block design — Connection Automation won’t know how to wire GPIO pins to hls4ml ports.

Clock and reset for the hls4ml IP:

Connect ap_clk on the hls4ml IP to the same clock driving MicroBlaze (typically clk_wiz_0/clk_out1).
Connect ap_rst to the processor system reset’s peripheral_reset output (active-high reset).

Data inputs — GPIO_INPUT to hls4ml:

The gpio_input block’s GPIO output port is a 32-bit bus. You need to slice it into two 16-bit signals:
- Add a Slice IP (xlslice). Configure: Din Width = 32, Din From = 15, Din Down To = 0. This extracts bits [15:0] → connect to input_1_V_0 (humidity).
- Add another Slice IP. Configure: Din Width = 32, Din From = 31, Din Down To = 16. This extracts bits [31:16] → connect to input_1_V_1 (temperature).
- Connect both slices’ Din to gpio_input‘s GPIO output.

Data outputs — hls4ml to GPIO_OUTPUT:

Use a Concat IP (xlconcat) to pack two output scores into 32 bits:
- Add a Concat IP. Configure: Number of Ports = 2, In0 Width = 16, In1 Width = 16.
- Connect layer7_out_0 → In0, layer7_out_1 → In1.
- Connect Dout → gpio_output‘s GPIO channel 1 input.
Connect layer7_out_2 directly to gpio_output‘s GPIO2 channel input.

Control signals — GPIO_CTRL to/from hls4ml:

Connect gpio_ctrl‘s GPIO channel 1 output → ap_start on the hls4ml IP.
Use a Concat IP (3 ports, 1 bit each) to pack ap_done, ap_idle, ap_ready into 3 bits:
- ap_done → In0 (bit 0), ap_idle → In1 (bit 1), ap_ready → In2 (bit 2).
- Connect Dout → gpio_ctrl‘s GPIO2 channel input.

Add UART and Run Connection Automation

Add AXI UART Lite for serial output (debugging, result reporting).
Run Connection Automation — this wires all three GPIO blocks and UART to the AXI interconnect, and connects clocks and resets.
Verify the Address Editor — each GPIO block and UART should have unique, non-overlapping base addresses.

Validate and Generate

Click Validate Design (F6) — fix any critical warnings. Common issues at this stage: unconnected ports, width mismatches on slice/concat blocks.
Create HDL Wrapper → right-click the block design → Generate HDL Wrapper.
Add the Basys 3 constraints file (.xdc) for pin assignments (UART TX/RX on the USB-UART bridge, clock, reset button).
Generate Bitstream.

Step 2: Export Hardware and Create the Firmware Project

After bitstream generation:

File → Export → Export Hardware — include the bitstream.
Tools → Launch Vitis
Create a new Component targeting the exported hardware platform.
Choose the Empty Application template.

Step 3: Write the Firmware

The firmware drives the entire inference pipeline through GPIO: normalize input sensor data from uart, pack it into a 32-bit word, write it to GPIO_INPUT, pulse ap_start via GPIO_CTRL, poll ap_done, and read the output scores from GPIO_OUTPUT and print it out on Uart.

The firmware is in file at firmware/weather_nn.c in the repo.

Connecting Real Sensors

For a complete edge inference pipeline, connect a sensor to MicroBlaze via SPI or I2C through the Basys 3’s PMOD headers.

Example: BME280 (temperature + humidity + pressure) over SPI

Add AXI Quad SPI to your block design.
Configure for standard SPI mode. In the constraints file, map the SPI signals to a PMOD header
In firmware, use the Xilinx SPI driver to read the BME280:

What We’ve Built

We now have a complete, self-contained inference system on a Basys 3:

MicroBlaze soft processor running bare-metal firmware.
hls4ml neural network accelerator with io_parallel ports.
AXI GPIO bridge connecting the AXI world to the parallel port world.
Sensor interface (SPI/I2C via PMOD) for real-world data input.
UART output for results and debugging.

No hard processor, no external CPU, no cloud just fabric and firmware. The AXI GPIO approach keeps the integration simple and avoids the complexity of DMA or AXI-Stream for a model this size.

Building Efficient Neural Networks for FPGAs: Part 2