‘Computing on Programmable Logic’ at Microsoft Research Faculty Summit 2016

Yesterday I had the privilege of speaking on Computing on Programmable Logic (slides) in the ‘Computing with Exotic Technologies and Platforms’ session at the Microsoft Research Faculty Summit 2016.

Abstract: “We have seen the birth of many exotic architectures in recent years, from a quantum computer that promises to achieve exponential speed-ups over conventional computers, to DNA computation that performs disease diagnostics and therapy, to Field Programmable Gate Arrays (FPGAs) that provide a flexible toolkit for implementing architectures such as Microsoft’s Catapult fabric for large-scale datacenters. Each of these exotic technologies enable novel solutions to challenging problems and require equally novel methods to program and design them. We will highlight the advances in their applications and the challenges behind developing their toolchains and programming environments.”

GRVI Phalanx Update

An update on the work-in-progress GRVI Phalanx.


An extended abstract and brief talk on GRVI Phalanx was presented at the 2nd International Workshop on Overlay Architectures (OLAF-2) at FPGA 2016.

GRVI Phalanx was discussed in the short talk Software-First, Software Mostly: Fast Starting with Parallel Programming for Processor Array Overlays at the Arduino-like Fast-Start for FPGAs pre-conference workshop at FCCM 2016. [Slides]

The first refereed paper on GRVI Phalanx was presented yesterday at the 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2016): GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator and received the FCCM 2016 Best Short Paper Award. [PDF]

Hardware Changes: Version 0.2

Here are some of the changes made to the GRVI Phalanx design since it was first described at the 3rd RISC-V Workshop. This is now version 0.2.


  • LB/LBU/LH/LHU/SB/SH: Load/store byte and halfword alignment functionality is now configured OFF in the GRVI PEs. The LdMux and StMux units have been factored out of GRVI and into the GRVI cluster, each set of muxes shared by a pair of cores.
  • MUL/MULH/MULHU/MULHSU: The multiply instructions from the RISC-V “M” extension are now enabled by default and are implemented in the GRVI cluster. Each pair of processors shares one DSP-based multiplier. This consumes 200 DSP48s in the 400 PE GRVI Phalanx for Kintex UltraScale 040, leaving 1720 DSP48s for use by accelerators.
  • SL*/SR*: By default, fast left and right shift instructions are also implemented in these DSP-based multipliers.
  • LR/SC: These atomic instructions from the RISC-V “A” extension are now enabled by default. Part of the implementation is in the GRVI core and part in the GRVI cluster memory arbiters. The implementation considerations were discussed on the RISC-V mailing lists here.


  • A Phalanx system may be configured to replace the cluster at (NX-1,NY-1) with a character mode VGA cluster with a 32 KB text frame buffer.
  • Hoplite multicast message routing is now enabled by default. An agent can sent a message to every cluster on a given row, given column, or to every cluster on the NOC. If desired, all IRAMs in all clusters in a Phalanx may be updated with a single burst of 1024 XY-multicast messages.

Introducing GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator

GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area as a processing element (PE) in a parallel processor. GRVI implements a 2 or 3 stage single issue pipeline, typically consumes 320 6-LUTS in a Xilinx UltraScale FPGA, and currently runs at 300-375 MHz in a Kintex UltraScale (-2) in a standalone configuration with most favorable placement of local BRAMs.

Phalanx is massively parallel FPGA accelerator framework, designed to reduce the effort and cost of developing and maintaining FPGA accelerators. A Phalanx is a composition of many clusters of soft processors and accelerator cores with extreme bandwidth memory and I/O interfaces on a Hoplite NOC.

GRVI Phalanx was introduced today at the 3rd RISC-V Workshop at Redwood Shores, CA.

A work-in-progress 5x10x8 = 400 processor configuration in a KU040 in a Xilinx KCU105 and a 2x2x8 = 32 processor configuration in a Xilinx Artix-7 35T in an Digilent Arty were demonstrated in the demo/poster session.

A 10x5x8 = 400 processor GRVI Phalanx

For more information please visit the GRVI Phalanx page.

Introducing Hoplite: An FPGA-Optimal Router for Extreme Bandwidth NOCs

Hoplite is a configurable, general purpose, FPGA-optimal 2D router and tools for implementation of efficient network on chip (NOC) interconnection of diverse processors, accelerators, other client cores, and extreme bandwidth (100+ Gb/s) interfaces.

The first paper on Hoplite is presented today at FPL 2015:

Hoplite: Building Austere Overlay NoCs for FPGAs, Nachiket Kapre, Jan Gray.
25th International Conference on Field-Programmable Logic and Applications, Sept. 2015. Received the Michael Servit Best Paper Award. [PDF]

For more information please visit the Hoplite page.

Stop Everything, We’re Doing 8-LUTs

[With apologies/thanks to The Onion]

Commentary • Opinion • business • ISSUE 15•03 • March 6, 2015
By Josh Gane, CEO and President, XilTera

Would someone tell me how this happened? We were the vanguard of programmable logic. The XilTera StraTex was the cool device to design in. Then the other guy came out with a 6-LUT FPGA. Were we scared? Hell, no. Because we hit back with a little thing called the CLM. That’s a 6-LUT with a dual ported synchronous LUT RAM. For register files and so much more. But you know what happened next? Shut up, I’m telling you what happened—the bastards went to superfracturable LUTs. Now we’re standing around looking foolish selling plain 6-LUTs and a LUT RAM. RAM or not, suddenly we’re the chumps. Well, screw it. We’re going to 8-LUTs.

Sure, we could go to 7-LUTs next, like the competition. That seems like the logical thing to do. After all, 6-LUTs worked out pretty well, and seven is the next number after six. So let’s play it safe. Let’s make a 7-input 6,6-LUT and call it the UltraLUT. Why innovate when we can follow? Oh, I know why: Because we’re a business, that’s why!

You think it’s crazy? It is crazy. But I don’t give a damn. From now on, we’re the ones who have the edge in the LUT inputs game. Are they the best programmable logic a designer can get? Hell, no. XilTera is the most programmable programmable logic around.

What part of this don’t you understand? If 4-LUTs are good, and 6-LUTs are better, obviously 8-LUTs would make us the best freaking FPGA that ever existed. Comprende? We didn’t claw our way to the top of the FPGA game by clinging to the 4-LUT industry standard. We got here by taking chances. Well, 8-LUTs are the biggest chance of all.

Here’s the report from Engineering. I don’t want to hear those three damnable letters V.P.R. ever again! They don’t tell me what to invent—I tell them. And I’m telling them to stick two more LUT inputs in there. I don’t care how. Make the mux tree FinFETs so thin they’re invisible. Put some LUT configuration cells in the CLM cluster switchbox. Partially populate it. I don’t care if they have to cram the seventh and eighth input in perpendicular to the other six, just do it!

You’re taking the “balanced” part of “balanced architecture” too literally, grandma. Cut the strings and soar. Let’s hit it. Let’s roll. This is our chance to make programmable logic history. Let’s dream big. All you have to do is say that 8-LUTs can happen, and it will happen. If you aren’t on board, then screw you. And if you’re on the board, then screw you and your father. Hey, if I’m the only one who’ll take risks, I’m sure as hell happy to hog all the glory when the 8-LUT FPGA becomes the device for the U.S. of “this is how we reconfigurably compute now” A.

People said we couldn’t go to 6-LUTs. It’ll blow up the size of the input muxes and configuration bitstream, the LUT mux tree wiring will be too long, they said. Well, we did it. Now some egghead in a lab is screaming “8-LUTs are crazy?” Well, perhaps he’d be more comfortable in the labs at Lactel, working on antifuse crossbars. ONO, my ass!

Maybe I’m wrong. Maybe we should just ride in AltLinx’s wake and make voltage regulators. Ha! Not on your damn life! The day I shadow a penny-ante outfit like AltLinx is the day I leave the FPGA game for good, and that won’t happen until the day I die! The market? Listen, we make the market. All we have to do is put her out there with a little jingle. It’s as easy as, “Hey, technology mapping complex datapaths with anything less than 8-input LUTs is like playing pushing-on-a-rope whack-a-mole with some bad-approximation-timing-model physical synthesis CAD.” Or “There’ll be so few logic levels between the hard logic blocks, I could put Intel out of business.”

I know what you’re thinking now: What’ll people say? Mew mew mew. Oh, no, what will people say?! Grow up. When you’re on top, people talk. That’s the price you pay for being on top. Which XilTera is, always has been, and forever shall be, Amen, 8-LUTs, sweet Jesus in heaven.

Wait. I just had a stroke of genius. Are you ready? Open your mouth, baby birds, cause Mama’s about to drop you one sweet, fat nightcrawler. Here she comes: Put another carry chain on that sucker, too. And hell, why stop there? Two write ports. That’s right. 8-LUTs, fracturable into four 6-LUTs, four clock enables on the flops, two carry chains, 256b LUT RAM, and two write ports. You heard me—two write ports. It’s a whole new way to think about memory bandwidth intensive massively parallel accelerators. Don’t question it. Don’t say a word. Just key the music, and call the chorus girls, because we’re on the edge—the rising clock edge—and I feel like dancing.

Welcome Xilinx UltraScale+ and Zynq UltraScale+

This week Xilinx announced UltraScale+ and Zynq UltraScale+, its new family of 16 nm TSMC 16FF+ FinFET based FPGA and FPGA-MPSoC products. First tape out in 2Q15, first product ship 4Q15.

In the old days one could read a new FPGA’s ~30 page data sheet, digest it for an hour, and write a concise summary of all the new capabilities. But these UltraScale+ devices and tools, designed to appeal to diverse markets and applications, are so comprehensive and complex as to defy such a short take. Instead you ought to set aside some quality time to review these overview documents and pages:

In lieu of an overview summary, here are three new UltraScale+ highlights that are most significant and enabling from my perspective:

  1. Energy efficiency: the 16 nm FinFETs bring a one technology node reprieve of classic Dennard Scaling. In particular “The -1L and -2L speed grades in the UltraScale+ families can run at one of two different Vccint operating voltages. At 0.72V, they operate at similar performance to the Kintex UltraScale and Virtex UltraScale devices with up to 30% reduction in power consumption. At 0.85V, they consume similar power to the Kintex UltraScale and Virtex UltraScale devices, but operate over 30% faster.” The process advance and numerous architectural and IP/tool advances will provide welcome and enabling power savings.
  2. SRAM memory capacity: the new UltraRAM tier of block RAM, dual port 4Kx72 (32 KB), complements good old RAMB36 block RAM, dual port 1Kx36 (~4 KB). The new devices include 320-1536 UltraRAM blocks (90-432 Mb, 10-49 MB) of high bandwidth integrated SRAM. Just as an example, in the largest announced device, XCVU13P, there is SRAM enough for 1024 soft processor cores to each have a private 4 KB L1 I$, 4 KB L1 D$, and 32 KB L2$, and also share a many-banked 16 MB L3$ (wow!), not including leftover BRAM and all the distributed LUT RAM.
  3. MPSoC: the four 64b ARM Cortex-A53 cores, MALI GPU, 64b/72b DDR4 DRAM, much greater PS-PL interconnect, and many, many other features, in this “Zynq v2” line, reposition it from limited embedded system roles (<= 1 GB DRAM) towards hosting almost any application scenario you can imagine, from Android ultrasound tablets to driver assist (can we say “self driving cars” yet?) to data center accelerators.

I congratulate and thank the thousands of staff at Xilinx, TSMC, ARM, and their partners for another stupendous engineering tour de force that will enable us to develop myriad new applications and change the world.

The Past and Future of FPGA Soft Processors

Earlier this month I had the privilege of giving a keynote at Reconfig 2014. I decided to speak on the past and future of FPGA soft processors. This is my twentieth anniversary of working (on and off) in this field so this seemed an apt time and opportunity to share my perspective on where FPGA soft processors came from and what their continuing utility and prospects might be in the decade ahead — the autumn of Moore’s Law, the winter of Dennard Scaling.

Design productivity is still a challenge for reconfigurable computing. It is expensive to port a software workload to RTL, to maintain the RTL as the workload evolves, and to wait for hours to recompile a bitstream after each design change. Soft processors can help mitigate these costs, and provide new pathways to application acceleration. A mid-range FPGA can now host hundreds of soft CPUs and their interconnection network, and such heterogeneous massively parallel processor and accelerator arrays can sustain hundreds of operations, memory accesses, and branches per cycle.

This talk will look back on the history and diversity of soft processor cores for FPGAs, and their continuing relevance for the decade ahead. What new tools, IP, and infrastructure will help us to exploit the coming million LUT, 10 TFLOPS FPGAs? Along the way we will revisit an austere design esthetic and an implementation methodology for crafting FPGA-optimized soft cores, and see how the lessons of mapping one processor into one 1995 FPGA can inform us how to design massively parallel programmable accelerators going forward.

Here are the slides.

Quick FPGA Hacks: Population count

Presented at FPL2014 the paper “Pipelined Compressor Tree Optimization using Integer Linear Programming” by Martin Kumm and Peter Zipf, University of Kassel uses clever (Xilinx) slice-wide (four 6-LUTs + CARRY4) based compressors to reduce as many as 12 input bits to 5 sum bits for an efficiency of up to (12-5) / 4 = 1.75 bits/LUT. This reminded me of a nice application of compressors in FPGAs — population count a.k.a. bit count, e.g. determine the number of one bits set in a word — which uses simpler LUT-based (no carry chain) compressors to good effect. I first learned about this approach in some Altera documentation. Using modern FPGAs’ 6-LUTs it is easy to build a 6-bit wide population count from a bitvector [5:0] input in one LUT delay. This is a “6:3 compressor”:

// 6:3 compressor as a 64x3b ROM -- three 6-LUTs
module c63(
    input [5:0] i,
    output reg [2:0] o);    // # of 1's in i

    always @* begin
        case (i)
        6'h00: o = 0; 6'h01: o = 1; 6'h02: o = 1; 6'h03: o = 2;
        6'h04: o = 1; 6'h05: o = 2; 6'h06: o = 2; 6'h07: o = 3;
        6'h08: o = 1; 6'h09: o = 2; 6'h0A: o = 2; 6'h0B: o = 3;
        6'h0C: o = 2; 6'h0D: o = 3; 6'h0E: o = 3; 6'h0F: o = 4;
        6'h10: o = 1; 6'h11: o = 2; 6'h12: o = 2; 6'h13: o = 3;
        6'h14: o = 2; 6'h15: o = 3; 6'h16: o = 3; 6'h17: o = 4;
        6'h18: o = 2; 6'h19: o = 3; 6'h1A: o = 3; 6'h1B: o = 4;
        6'h1C: o = 3; 6'h1D: o = 4; 6'h1E: o = 4; 6'h1F: o = 5;
        6'h20: o = 1; 6'h21: o = 2; 6'h22: o = 2; 6'h23: o = 3;
        6'h24: o = 2; 6'h25: o = 3; 6'h26: o = 3; 6'h27: o = 4;
        6'h28: o = 2; 6'h29: o = 3; 6'h2A: o = 3; 6'h2B: o = 4;
        6'h2C: o = 3; 6'h2D: o = 4; 6'h2E: o = 4; 6'h2F: o = 5;
        6'h30: o = 2; 6'h31: o = 3; 6'h32: o = 3; 6'h33: o = 4;
        6'h34: o = 3; 6'h35: o = 4; 6'h36: o = 4; 6'h37: o = 5;
        6'h38: o = 3; 6'h39: o = 4; 6'h3A: o = 4; 6'h3B: o = 5;
        6'h3C: o = 4; 6'h3D: o = 5; 6'h3E: o = 5; 6'h3F: o = 6;

A 36b popcount may be implemented using nine such 6:3 compressors and a small ternary adder. The first six compressors determine the no. of ones in each bundle of 6 input bits:

// 36-bit bitcount
module pop36(
    input [35:0] i,
    output [5:0] sum);  // # of 1's in i

    wire [2:0] c0500, c1106, c1712, c2318, c2924, c3530, c0, c1, c2;

    // add six bundles of six bits
    c63 c0500_(i[ 5: 0], c0500);
    c63 c1106_(i[11: 6], c1106);
    c63 c1712_(i[17:12], c1712);
    c63 c2318_(i[23:18], c2318);
    c63 c2924_(i[29:24], c2924);
    c63 c3530_(i[35:30], c3530);

These transform the problem of adding 36 bits to that of adding six 3b vectors. Surprisingly these can also be added with another round of compressors, by focusing on the [0], [1], and [2] bit positions separately.

    // sum the bits set in the [0], [1], and [2] bit positions separately
    c63 c0_({c0500[0],c1106[0],c1712[0],c2318[0],c2924[0],c3530[0]}, c0);
    c63 c1_({c0500[1],c1106[1],c1712[1],c2318[1],c2924[1],c3530[1]}, c1);
    c63 c2_({c0500[2],c1106[2],c1712[2],c2318[2],c2924[2],c3530[2]}, c2);

Now the total number of bits is (c0 * 2^0) + (c1 * 2^1) + (c2 * 2^2):

    assign sum = {2'b0,c0} + {1'b0,c1,1'b0} + {c2,2'b0};

The ternary adder is sufficiently simple that synthesis maps it to two levels of pure LUTs (no carry chain). All totaled, this has a delay of four LUTs and runs at 370 MHz in a Kintex-7 -1 speed grade.  By adding a pipeline register between the compressors and the ternary add, the circuit is reduced to two LUT delays, and runs at 550 MHz.  By adding a pipeline register between each level of LUTs, the critical path is under 1.2 ns, which could run at 860 MHz. However, the minimum clock period for this device is 1.6 ns (625 MHz). Happy hacking!

Microsoft Catapult at ISCA 2014, In the News

This week at ISCA 2014 Andrew Putnam presented A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (PDF).

Some of the Catapult team members (Microsoft Research and Bing)

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the tail latency by 29%.



How to Design and Access a Memory-Mapped Device in Programmable Logic from Linaro Ubuntu Linux on Xilinx Zynq on the ZedBoard, Without Writing a Device Driver — Part Two


Following part one, this is the second half of a two part tutorial series on how access a memory-mapped device implemented in Zynq’s programmable logic fabric.


So far we’ve built a new ZedBoard project from scratch. It has a pair-of-32-bit-counters peripheral in the programmable logic. Thanks to the XPS Base System Builder Wizard, its processing system is preconfigured with support for UART, GPIO, SD card, Quad SPI, USB, Ethernet, and 512 MB of DRAM. We’ve built an FSBL that sets this all up, loads the PL, and loads and launches u-boot; and we’re going to reuse the same good old u-boot.elf Linux boot loader from last time.

Notice the system we have just built from scratch does not include the nice ADI HDMI display controller. Certainly we could have added our counters to that system, but to focus on the essentials I thought a brand new, minimalist design would be better. Not to worry, we can still use desktop Ubuntu over the network from another computer.

Headless desktop Linaro Ubuntu Linux

This recapitulates headless operation from last time.

  • Boot your Linaro Ubuntu system from SD card.
  • Install the VNC and RDP servers. From serial port console or Terminal, $ sudo apt-get install xrdp . Then you will be able to boot and run headless. You may still need to use the serial port console to determine the DHCP-assigned IP address ($ ifconfig).
  • On Windows 7, run Remote Desktop Connection (a.k.a. mstsc.exe). Specify your ZedBoard’s IP address, and log in. Voila, desktop Ubuntu over the network. HDMI monitor not required.

Now let’s build a devicetree.dtb that does not probe for not configure the HDMI display controller or any other PL peripherals.

  • $ cd linux/arch/arm/boot/dts; cp zynq-zed-adv7511.dts zynq-zed-no-pl.dts
  • Edit it to delete all programmable logic peripherals. It will look like this:

/include/ "zynq-zed.dtsi"
  • Back up your SD card.
  • Build your new empty PL devicetree blob: $ cd linux; make zynq-zed-no-pl.dtb
  • Copy it to the SD card. cp arch/arm/boot/zynq-zed-no-pl.dtb /media/BOOT/devicetree.dtb
  • Boot Linaro Ubunto Linux on your ZedBoard. Notice your HDMI monitor is now blank!
  • Fear not. On your TeraTerm serial port console, get the IP address ($ ifconfig), and connect and login to your ZedBoard via RDP or VNC. All should work as before, even though you are not using any of the programmable logic fabric.

Device access via /dev/mem

The first, simplest, most elemental, most hacky, most dangerous way to access our peripheral from Linux is by opening /dev/mem and using mmap() to map a view of the device’s physical address space into our process’s virtual address space. Sven Andersson’s UIO blog entry summarizes these considerations perfectly so I’ll just quote him extensively here:

“… Here are some of the characteristics:

  • Userspace interface to system address space
  • Accessed via mmap() system call
  • Must be root or have appropriate permissions
  • Quite a blunt tool – must be used carefully
  • Can bypass protections provided by the MMU Possible to corrupt kernel, device or other processes memory


  • Very simple – no kernel module or code
  • Good for quick prototyping / IP verification
  • peek/poke utilities
  • Portable (in a very basic sense)


  • No interrupt handling possible
  • No protection against simultaneous access
  • Need to know physical address of IP Hard-code?”

Continuing with our zed_counters project, with headless Ubuntu set up, let’s reboot our Linux system, configured to use the zed_counters’ design BOOT.BIN image we built earlier.

  • Copy it to your SD card $ cp …/zed_fsbl/bootimage/u-boot.bin /media/BOOT/BOOT.BIN

The SD card BOOT partition now contains our new BOOT.BIN with the zed_counters design; a new devicetree.dtb blog which denies any devices in the PL fabric, and the uImage Linux kernel we built last time. Safe-eject it, insert it into your ZedBoard, and reboot. Start a remote desktop connection. Open a Terminal. Fetch this /dev/mem access-based test application from Andersson’s blog entry.

	/* Open /dev/mem file */
	fd = open ("/dev/mem", O_RDWR);
	if (fd < 1) {
		return -1;

	/* mmap the device into memory */
	page_addr = (gpio_addr & (~(page_size-1)));
	page_offset = gpio_addr - page_addr;
	ptr = mmap(NULL, page_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, page_addr);
  • There’s nothing GPIO specific about it, so it can be used to test our counters device.
  • Build it. $ cc -o gpio-dev-mem-test gpio-dev-mem-test.c
  • Run it. $ ./gpio-dev-mem-test -g 0x70000000 -i =>  “GPIO access through /dev/mem. ./gpio-dev-mem-test: Permission denied”
  • Oops! This highlights a shortcoming of the /dev/mem approach. /dev/mem is a character special file owned by root and unreadable by regular users:
    $ ls -l /dev/mem => crw-r—– 1 root kmem 1, 1 Dec 31 1969 /dev/mem
  • Make the test program SETUID root so it can open the file. $ sudo chown root gpio-dev-mem-test; sudo chmod u+s gpio-dev-mem-test Congratulations, you have now added a massive security and robustness hole to your Linux system.
  • Try again. Read (write) the up-counter at address 0x70000000:
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000000
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000001
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000002
  • $ ./gpio-dev-mem-test -g 0x70000000 -o 7
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000007
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000008
  • It counts! Read the free-running counter at address 0x70000004:
  • $ ./gpio-dev-mem-test -g 0x70000004 -i => … input: 577c9e79
  • $ ./gpio-dev-mem-test -g 0x70000004 -i => … input: 62502c56
  • $ ./gpio-dev-mem-test -g 0x70000004 -i => … input: 6b764783
  • It counts at 100 MHz.

It’s great that we can kick the tires on our new device without any device drivers or kernel configuration, but as Mr. Andersson writes, direct /dev/mem access is “OK for prototyping – not recommended for production.” Next we’ll see how to employ user-mode I/O to access just the counters device — with better safety and security.

Device Access via UIO

In this final section, we will see how to configure, modify, and rebuild your Linux kernel so you can configure UIO devices in your device tree, and how to access them from your application program.

Your ADI or Xilinx Linux source tree already contains the source to two useful UIO drivers, uio_pdrv (“UIO platform driver”) and uio_pdrv_genirq (“UIO platform driver with generic interrupts”). Unfortunately the default ADI and Xilinx build configurations do not build or include these drivers in the kernel (nor as loadable driver modules).

To enable these drivers run $ cd linux; make menuconfig. This pops a character-mode GUI titled .config — … Kernel Configuration.

  • In Kernel Configuration, select Device Drivers —>
  • In Device Drivers, select Userspace I/O drivers —>
  • In Userspace I/O drivers, select <*> as built-in both Userspace I/O platform driver and Userspace I/O platform driver with generic IRQ handling.
  • Exit. Exit. Exit. Save. Check: $ grep UIO .config =>

Now (before we rebuild the kernel) it is time to briefly review the UIO drivers kernel source. This won’t hurt much.

  • $ cd linux/drivers/uio
  • Review uio.c, uio_pdrv.c, and uio_pdrv_genirq.c.
  • Note that uio.c contains common support routines that are used by the uio_pdrv and uio_pdrv_genirq drivers.
  • Note that only uio_pdrv_genirq defines a devicetree/”open firmware device id” compatible key:  … { .compatible = “generic-uio”, },
  • Reviewing the driver change history in github, we see this line was added in 11/2012 for “microblaze: UIO setup compatible property” to “Setup compatible property which matches petalinux”.
  • So “out of the box”, after enabling UIO drivers in menuconfig, the only UIO driver you can use that will properly initialize with a device tree blob configuration is uio_pdrv_genirq, and to use that, you must describe your peripheral as a “generic-uio”.
  • Furthermore, the uio_pdrv_genirq requires the device tree blob to also define the device interrupt number and the interrupt parent device.
  • To make this work for our interrupt-less counters device, we can lie, pick a free interrupt number, and pretend our counters are wired up to the Zynq GIC interrupt controller, just like interrupt-issuing Zynq peripherals do. Interrupt numbers are biased by -32 for some reason. Here is what the ensuing DTS device tree specification looks like:

/include/ "zynq-zed.dtsi"

/ {
    counters@70000000 {
        compatible = "generic-uio";
        reg = < 0x70000000 0x1000 >;
        interrupts = < 0 57 0 >;
        interrupt-parent = <&gic>;
  • We declare this device is a “generic-uio”, which will match uio_pdrv_genirq;
  • Its registers are at physical address 0x70000000 and are 0x1000 (4 KB) in size.
  • It generates interrupt 57+32 = 89. These are issued to the GIC. Not really.
  • Where did the fake interrupt number 57+32=89 come from? I reviewed the interrupt assignments in various DTS files that may intersect this project, such as arch/arm/boot/dts/{zynq.dtsi, zynq-zed.dtsi, zynq-zed-adv7511.dts}, and picked one that wasn’t in use there.

Now we can (and I have) built a devicetree blob from this DTS file, booted Linux, and my device has probed, configured, initialized, and set up a /dev/uio0 (major device 251, minor device 0), and (as we’ll see below) I have accessed this from a user application.

In fact if we already had a peripheral with both memory-mapped I/O and interrupts, this existing driver uio_gen_pdrv would be ideal, we’d be done, and this seemingly endless tutorial would be over already.

But since example device counters doesn’t have interrupts, I am still not satisfied. What does it take to get the interrupt-less uio_pdrv working with device tree configuration?

  • We have to modify drivers/uio/uio_pdrv.c to add an open firmware device id compatible string for it to match a device specification in the device tree. We’ll use the unremarkable name “uio_pdrv”.
  • We also have to bring some code forward (and now, since I’m not a Linux kernel developer, this is getting dangerously close to cargo cult programming) into the uio_pdrv_probe() function to dynamically allocate a uio_info structure for the dynamically discovered uio_pdrv instance.
  • For symmetry, we’ll add the compatible name “uio_pdrv_genirq” to the uio_pdrv_genirq device; it may then be used interchangeably with the existing name “generic-uio”.
  • Here are the context diffs:
diff --git a/drivers/uio/uio_pdrv.c b/drivers/uio/uio_pdrv.c
index 72d3646..cc50312 100644
--- a/drivers/uio/uio_pdrv.c
+++ b/drivers/uio/uio_pdrv.c
@@ -14,6 +14,9 @@
 #include <linux/module.h> 
 #include <linux/slab.h>

+#include <linux/of.h>
+#include <linux/of_platform.h>
 #define DRIVER_NAME "uio_pdrv"

 struct uio_platdata {
@@ -28,6 +31,19 @@ static int uio_pdrv_probe(struct platform_device *pdev)
    int ret = -ENODEV;
    int i;

+   if (!uioinfo) {
+       /* devicetree -- alloc uioinfo for one device */
+       uioinfo = kzalloc(sizeof(*uioinfo), GFP_KERNEL);
+       if (!uioinfo) {
+           ret = -ENOMEM;
+           dev_err(&pdev->dev, "unable to kmallocn");
+           goto err_uioinfo;
+       }
+       uioinfo->name = pdev->dev.of_node->name;
+       uioinfo->version = "devicetree";
+       uioinfo->irq = UIO_IRQ_NONE;
+   }
    if (!uioinfo || !uioinfo->name || !uioinfo->version) {
        dev_dbg(&pdev->dev, "%s: err_uioinfon", __func__);
        goto err_uioinfo;
@@ -95,12 +111,23 @@ static int uio_pdrv_remove(struct platform_device *pdev)
    return 0;

+#ifdef CONFIG_OF
+static const struct of_device_id uio_pdrv_of_match[] = {
+   { .compatible = "uio_pdrv", },
+   { },
+MODULE_DEVICE_TABLE(of, uio_pdrv_of_match);
+# define uio_pdrv_of_match NULL
 static struct platform_driver uio_pdrv = {
    .probe = uio_pdrv_probe,
    .remove = uio_pdrv_remove,
    .driver = {
        .name = DRIVER_NAME,
        .owner = THIS_MODULE,
+       .of_match_table = uio_pdrv_of_match,

diff --git a/drivers/uio/uio_pdrv_genirq.c b/drivers/uio/uio_pdrv_genirq.c
index 1f5ec28..3c7d8de 100644
--- a/drivers/uio/uio_pdrv_genirq.c
+++ b/drivers/uio/uio_pdrv_genirq.c
@@ -264,7 +264,8 @@ static const struct dev_pm_ops uio_pdrv_genirq_dev_pm_ops = {
 #ifdef CONFIG_OF
 static const struct of_device_id uio_of_genirq_match[] = {
    { .compatible = "generic-uio", },
-   { /* empty for now */ },
+   { .compatible = "uio_pdrv_genirq", },
+   { },
 MODULE_DEVICE_TABLE(of, uio_of_genirq_match);

Now that we have configured UIO and added devicetree support to uio_pdrv, we are ready to rebuild the kernel and copy it to the SD card.

  • $ cd linux; make uImage LOADADDR=0x00008000
  • $ cp arch/arm/boot/uImage /media/BOOT

We can now use a nice simple devicetree .DTS for our zed_counters project:


/include/ "zynq-zed.dtsi"

/ {
    counters@70000000 {
        compatible = "uio_pdrv";
        reg = < 0x70000000 0x1000 >;
  • Counters is now a uio_pdrv device with physical addresses 0x70000000-0x70000fff.
  • No interrupt fakery required!
  • Put that in arch/arm/boot/dts/zynq-zed-counters.dts, rebuild the device tree blob, and copy it to the SD card. $ make zynq-zed-counters.dtb;
    $ cp arch/arm/boot/zynq-zed-counters.dtb /media/BOOT/devicetree.dtb
  • Together with the BOOT.BIN for our zed-counters design, we’re all set.
  • Safe-eject the SD card, insert it into the ZedBoard, and boot Linux!
  • There is no helpful console message confirming our counters device is initialized. That’s OK, we can see for ourselves.
  • $ ls -l /dev/uio0 => crw——- 1 root root 251, 0 Dec 31  1969 /dev/uio0
  • $ grep 251 /proc/devices => 251 uio
  • $ cd /sys/devices/70000000.counters/uio/uio0; cat name version uevent =>

Now to test UIO access to our counters. Andersson’s GPIO UIO test application is worthy of your study, but is not ideal to exercise our two counters. Instead, below, I have modified it (quickly hacked it) into a trivial counters-specific test. First we check counters[0] increments on every read. Next we do a few timing tests with our free running counters[1].

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <sys/mman.h>
#include <fcntl.h>

#define MAP_SIZE 0x1000

void usage(void)
    printf("usage: test_counters -d <UIO_DEV_FILE>n");

int main(int argc, char *argv[])
    int c;
    int i;
    int fd = 0;
    char *uiod = 0;
    int value = 0;
    unsigned start, end;

    volatile unsigned *counters;

    while ((c = getopt(argc, argv, "d:io:h")) != -1) {
        switch(c) {
        case 'd':
            uiod = optarg;
        case 'h':
            return 0;
            printf("invalid option: %cn", c);
            return -1;

    if (!uiod) {
        return -1;

    /* Open the UIO device file */
    fd = open(uiod, O_RDWR);
    if (fd < 1) {
        printf("Invalid UIO device file: '%s'n", uiod);
        return -1;

    /* mmap the UIO device */
    counters = (volatile unsigned *)mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (!counters) {
        return -1;

    counters[0] = 1;
    assert(counters[0] == 1);
    assert(counters[0] == 2);
    assert(counters[0] == 3);

    counters[0] = 100;
    assert(counters[0] == 100);
    assert(counters[0] == 101);
    assert(counters[0] == 102);

    printf("wr-rd delay on free-running counter: ");
    for (i = 0; i < 20; ++i) {
        counters[1] = start = 0;
        end = counters[1];
        printf("%u ", end - start);

    printf("rd-rd delay on free-running counter: ");
    for (i = 0; i < 20; ++i) {
        counters[1] = 0;
        start = counters[1];
        end = counters[1];
        printf("%u ", end - start);

    munmap((void*)counters, MAP_SIZE);
    return 0;

Now let’s run it!

  • $ cc -o counters counters.c
  • $ ./counters -d /dev/uio0 => ./counters: Permission denied
  • $ ls -l /dev/uio0 => crw——- 1 root root 251, 0 Dec 31  1969 /dev/uio0 Oh, right. The uio_pdrv driver created the device file but made it inaccessible to non-superusers.
  • Fix that. $ sudo chmod 666 /dev/uio0
  • $ ./counters -d /dev/uio0 =>
    wr-rd delay on free-running counter: 13 14 13 14 13 14 14 13 13 13 14 14 13 14 13 14 14 …
    rd-rd delay on free-running counter: 14 14 14 15 14 14 15 14 15 15 14 14 14 15 14 14 15 …

It works! Here we see that counter[0] correctly increments on every read, and successive read accesses to the free-running counter[1] each take 13-15 cycles e.g. 130-150 ns.

Unfortunately each time I reboot, permissions on /dev/uio0 revert to -rw——. I don’t know how to set them from within the device driver. Any suggestions? Thank you.

This ends the tutorial. Thank you for reading along. I hope it is of help in your Zynq Linux work. Please leave me your comments here or via twitter @jangray. Good luck and happy hacking.