Stop Everything, We’re Doing 8-LUTs

[With apologies/thanks to The Onion]

Commentary • Opinion • business • ISSUE 15•03 • March 6, 2015
By Josh Gane, CEO and President, XilTera

Would someone tell me how this happened? We were the vanguard of programmable logic. The XilTera StraTex was the cool device to design in. Then the other guy came out with a 6-LUT FPGA. Were we scared? Hell, no. Because we hit back with a little thing called the CLM. That’s a 6-LUT with a dual ported synchronous LUT RAM. For register files and so much more. But you know what happened next? Shut up, I’m telling you what happened—the bastards went to superfracturable LUTs. Now we’re standing around looking foolish selling plain 6-LUTs and a LUT RAM. RAM or not, suddenly we’re the chumps. Well, screw it. We’re going to 8-LUTs.

Sure, we could go to 7-LUTs next, like the competition. That seems like the logical thing to do. After all, 6-LUTs worked out pretty well, and seven is the next number after six. So let’s play it safe. Let’s make a 7-input 6,6-LUT and call it the UltraLUT. Why innovate when we can follow? Oh, I know why: Because we’re a business, that’s why!

You think it’s crazy? It is crazy. But I don’t give a damn. From now on, we’re the ones who have the edge in the LUT inputs game. Are they the best programmable logic a designer can get? Hell, no. XilTera is the most programmable programmable logic around.

What part of this don’t you understand? If 4-LUTs are good, and 6-LUTs are better, obviously 8-LUTs would make us the best freaking FPGA that ever existed. Comprende? We didn’t claw our way to the top of the FPGA game by clinging to the 4-LUT industry standard. We got here by taking chances. Well, 8-LUTs are the biggest chance of all.

Here’s the report from Engineering. I don’t want to hear those three damnable letters V.P.R. ever again! They don’t tell me what to invent—I tell them. And I’m telling them to stick two more LUT inputs in there. I don’t care how. Make the mux tree FinFETs so thin they’re invisible. Put some LUT configuration cells in the CLM cluster switchbox. Partially populate it. I don’t care if they have to cram the seventh and eighth input in perpendicular to the other six, just do it!

You’re taking the “balanced” part of “balanced architecture” too literally, grandma. Cut the strings and soar. Let’s hit it. Let’s roll. This is our chance to make programmable logic history. Let’s dream big. All you have to do is say that 8-LUTs can happen, and it will happen. If you aren’t on board, then screw you. And if you’re on the board, then screw you and your father. Hey, if I’m the only one who’ll take risks, I’m sure as hell happy to hog all the glory when the 8-LUT FPGA becomes the device for the U.S. of “this is how we reconfigurably compute now” A.

People said we couldn’t go to 6-LUTs. It’ll blow up the size of the input muxes and configuration bitstream, the LUT mux tree wiring will be too long, they said. Well, we did it. Now some egghead in a lab is screaming “8-LUTs are crazy?” Well, perhaps he’d be more comfortable in the labs at Lactel, working on antifuse crossbars. ONO, my ass!

Maybe I’m wrong. Maybe we should just ride in AltLinx’s wake and make voltage regulators. Ha! Not on your damn life! The day I shadow a penny-ante outfit like AltLinx is the day I leave the FPGA game for good, and that won’t happen until the day I die! The market? Listen, we make the market. All we have to do is put her out there with a little jingle. It’s as easy as, “Hey, technology mapping complex datapaths with anything less than 8-input LUTs is like playing pushing-on-a-rope whack-a-mole with some bad-approximation-timing-model physical synthesis CAD.” Or “There’ll be so few logic levels between the hard logic blocks, I could put Intel out of business.”

I know what you’re thinking now: What’ll people say? Mew mew mew. Oh, no, what will people say?! Grow up. When you’re on top, people talk. That’s the price you pay for being on top. Which XilTera is, always has been, and forever shall be, Amen, 8-LUTs, sweet Jesus in heaven.

Wait. I just had a stroke of genius. Are you ready? Open your mouth, baby birds, cause Mama’s about to drop you one sweet, fat nightcrawler. Here she comes: Put another carry chain on that sucker, too. And hell, why stop there? Two write ports. That’s right. 8-LUTs, fracturable into four 6-LUTs, four clock enables on the flops, two carry chains, 256b LUT RAM, and two write ports. You heard me—two write ports. It’s a whole new way to think about memory bandwidth intensive massively parallel accelerators. Don’t question it. Don’t say a word. Just key the music, and call the chorus girls, because we’re on the edge—the rising clock edge—and I feel like dancing.

Welcome Xilinx UltraScale+ and Zynq UltraScale+

This week Xilinx announced UltraScale+ and Zynq UltraScale+, its new family of 16 nm TSMC 16FF+ FinFET based FPGA and FPGA-MPSoC products. First tape out in 2Q15, first product ship 4Q15.

In the old days one could read a new FPGA’s ~30 page data sheet, digest it for an hour, and write a concise summary of all the new capabilities. But these UltraScale+ devices and tools, designed to appeal to diverse markets and applications, are so comprehensive and complex as to defy such a short take. Instead you ought to set aside some quality time to review these overview documents and pages:

In lieu of an overview summary, here are three new UltraScale+ highlights that are most significant and enabling from my perspective:

  1. Energy efficiency: the 16 nm FinFETs bring a one technology node reprieve of classic Dennard Scaling. In particular “The -1L and -2L speed grades in the UltraScale+ families can run at one of two different Vccint operating voltages. At 0.72V, they operate at similar performance to the Kintex UltraScale and Virtex UltraScale devices with up to 30% reduction in power consumption. At 0.85V, they consume similar power to the Kintex UltraScale and Virtex UltraScale devices, but operate over 30% faster.” The process advance and numerous architectural and IP/tool advances will provide welcome and enabling power savings.
  2. SRAM memory capacity: the new UltraRAM tier of block RAM, dual port 4Kx72 (32 KB), complements good old RAMB36 block RAM, dual port 1Kx36 (~4 KB). The new devices include 320-1536 UltraRAM blocks (90-432 Mb, 10-49 MB) of high bandwidth integrated SRAM. Just as an example, in the largest announced device, XCVU13P, there is SRAM enough for 1024 soft processor cores to each have a private 4 KB L1 I$, 4 KB L1 D$, and 32 KB L2$, and also share a many-banked 16 MB L3$ (wow!), not including leftover BRAM and all the distributed LUT RAM.
  3. MPSoC: the four 64b ARM Cortex-A53 cores, MALI GPU, 64b/72b DDR4 DRAM, much greater PS-PL interconnect, and many, many other features, in this “Zynq v2” line, reposition it from limited embedded system roles (<= 1 GB DRAM) towards hosting almost any application scenario you can imagine, from Android ultrasound tablets to driver assist (can we say “self driving cars” yet?) to data center accelerators.

I congratulate and thank the thousands of staff at Xilinx, TSMC, ARM, and their partners for another stupendous engineering tour de force that will enable us to develop myriad new applications and change the world.

The Past and Future of FPGA Soft Processors

Earlier this month I had the privilege of giving a keynote at Reconfig 2014. I decided to speak on the past and future of FPGA soft processors. This is my twentieth anniversary of working (on and off) in this field so this seemed an apt time and opportunity to share my perspective on where FPGA soft processors came from and what their continuing utility and prospects might be in the decade ahead — the autumn of Moore’s Law, the winter of Dennard Scaling.

Abstract
Design productivity is still a challenge for reconfigurable computing. It is expensive to port a software workload to RTL, to maintain the RTL as the workload evolves, and to wait for hours to recompile a bitstream after each design change. Soft processors can help mitigate these costs, and provide new pathways to application acceleration. A mid-range FPGA can now host hundreds of soft CPUs and their interconnection network, and such heterogeneous massively parallel processor and accelerator arrays can sustain hundreds of operations, memory accesses, and branches per cycle.

This talk will look back on the history and diversity of soft processor cores for FPGAs, and their continuing relevance for the decade ahead. What new tools, IP, and infrastructure will help us to exploit the coming million LUT, 10 TFLOPS FPGAs? Along the way we will revisit an austere design esthetic and an implementation methodology for crafting FPGA-optimized soft cores, and see how the lessons of mapping one processor into one 1995 FPGA can inform us how to design massively parallel programmable accelerators going forward.

Here are the slides.

Quick FPGA Hacks: Population count

Presented at FPL2014 the paper “Pipelined Compressor Tree Optimization using Integer Linear Programming” by Martin Kumm and Peter Zipf, University of Kassel uses clever (Xilinx) slice-wide (four 6-LUTs + CARRY4) based compressors to reduce as many as 12 input bits to 5 sum bits for an efficiency of up to (12-5) / 4 = 1.75 bits/LUT. This reminded me of a nice application of compressors in FPGAs — population count a.k.a. bit count, e.g. determine the number of one bits set in a word — which uses simpler LUT-based (no carry chain) compressors to good effect. I first learned about this approach in some Altera documentation. Using modern FPGAs’ 6-LUTs it is easy to build a 6-bit wide population count from a bitvector [5:0] input in one LUT delay. This is a “6:3 compressor”:

//////////////////////////////////////////////////////////////////////////////
// 6:3 compressor as a 64x3b ROM -- three 6-LUTs
//
module c63(
    input [5:0] i,
    output reg [2:0] o);    // # of 1's in i

    always @* begin
        case (i)
        6'h00: o = 0; 6'h01: o = 1; 6'h02: o = 1; 6'h03: o = 2;
        6'h04: o = 1; 6'h05: o = 2; 6'h06: o = 2; 6'h07: o = 3;
        6'h08: o = 1; 6'h09: o = 2; 6'h0A: o = 2; 6'h0B: o = 3;
        6'h0C: o = 2; 6'h0D: o = 3; 6'h0E: o = 3; 6'h0F: o = 4;
        6'h10: o = 1; 6'h11: o = 2; 6'h12: o = 2; 6'h13: o = 3;
        6'h14: o = 2; 6'h15: o = 3; 6'h16: o = 3; 6'h17: o = 4;
        6'h18: o = 2; 6'h19: o = 3; 6'h1A: o = 3; 6'h1B: o = 4;
        6'h1C: o = 3; 6'h1D: o = 4; 6'h1E: o = 4; 6'h1F: o = 5;
        6'h20: o = 1; 6'h21: o = 2; 6'h22: o = 2; 6'h23: o = 3;
        6'h24: o = 2; 6'h25: o = 3; 6'h26: o = 3; 6'h27: o = 4;
        6'h28: o = 2; 6'h29: o = 3; 6'h2A: o = 3; 6'h2B: o = 4;
        6'h2C: o = 3; 6'h2D: o = 4; 6'h2E: o = 4; 6'h2F: o = 5;
        6'h30: o = 2; 6'h31: o = 3; 6'h32: o = 3; 6'h33: o = 4;
        6'h34: o = 3; 6'h35: o = 4; 6'h36: o = 4; 6'h37: o = 5;
        6'h38: o = 3; 6'h39: o = 4; 6'h3A: o = 4; 6'h3B: o = 5;
        6'h3C: o = 4; 6'h3D: o = 5; 6'h3E: o = 5; 6'h3F: o = 6;
        endcase
    end
endmodule

A 36b popcount may be implemented using nine such 6:3 compressors and a small ternary adder. The first six compressors determine the no. of ones in each bundle of 6 input bits:

//////////////////////////////////////////////////////////////////////////////
// 36-bit bitcount
//
module pop36(
    input [35:0] i,
    output [5:0] sum);  // # of 1's in i

    wire [2:0] c0500, c1106, c1712, c2318, c2924, c3530, c0, c1, c2;

    // add six bundles of six bits
    c63 c0500_(i[ 5: 0], c0500);
    c63 c1106_(i[11: 6], c1106);
    c63 c1712_(i[17:12], c1712);
    c63 c2318_(i[23:18], c2318);
    c63 c2924_(i[29:24], c2924);
    c63 c3530_(i[35:30], c3530);

These transform the problem of adding 36 bits to that of adding six 3b vectors. Surprisingly these can also be added with another round of compressors, by focusing on the [0], [1], and [2] bit positions separately.

    // sum the bits set in the [0], [1], and [2] bit positions separately
    c63 c0_({c0500[0],c1106[0],c1712[0],c2318[0],c2924[0],c3530[0]}, c0);
    c63 c1_({c0500[1],c1106[1],c1712[1],c2318[1],c2924[1],c3530[1]}, c1);
    c63 c2_({c0500[2],c1106[2],c1712[2],c2318[2],c2924[2],c3530[2]}, c2);

Now the total number of bits is (c0 * 2^0) + (c1 * 2^1) + (c2 * 2^2):

    assign sum = {2'b0,c0} + {1'b0,c1,1'b0} + {c2,2'b0};
endmodule

The ternary adder is sufficiently simple that synthesis maps it to two levels of pure LUTs (no carry chain). All totaled, this has a delay of four LUTs and runs at 370 MHz in a Kintex-7 -1 speed grade.  By adding a pipeline register between the compressors and the ternary add, the circuit is reduced to two LUT delays, and runs at 550 MHz.  By adding a pipeline register between each level of LUTs, the critical path is under 1.2 ns, which could run at 860 MHz. However, the minimum clock period for this device is 1.6 ns (625 MHz). Happy hacking!

Microsoft Catapult at ISCA 2014, In the News

This week at ISCA 2014 Andrew Putnam presented A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (PDF).

Some of the Catapult team members (Microsoft Research and Bing)

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the tail latency by 29%.

Coverage:

 

How to Design and Access a Memory-Mapped Device in Programmable Logic from Linaro Ubuntu Linux on Xilinx Zynq on the ZedBoard, Without Writing a Device Driver — Part Two

Introduction

Following part one, this is the second half of a two part tutorial series on how access a memory-mapped device implemented in Zynq’s programmable logic fabric.

Recap

So far we’ve built a new ZedBoard project from scratch. It has a pair-of-32-bit-counters peripheral in the programmable logic. Thanks to the XPS Base System Builder Wizard, its processing system is preconfigured with support for UART, GPIO, SD card, Quad SPI, USB, Ethernet, and 512 MB of DRAM. We’ve built an FSBL that sets this all up, loads the PL, and loads and launches u-boot; and we’re going to reuse the same good old u-boot.elf Linux boot loader from last time.

Notice the system we have just built from scratch does not include the nice ADI HDMI display controller. Certainly we could have added our counters to that system, but to focus on the essentials I thought a brand new, minimalist design would be better. Not to worry, we can still use desktop Ubuntu over the network from another computer.

Headless desktop Linaro Ubuntu Linux

This recapitulates headless operation from last time.

  • Boot your Linaro Ubuntu system from SD card.
  • Install the VNC and RDP servers. From serial port console or Terminal, $ sudo apt-get install xrdp . Then you will be able to boot and run headless. You may still need to use the serial port console to determine the DHCP-assigned IP address ($ ifconfig).
  • On Windows 7, run Remote Desktop Connection (a.k.a. mstsc.exe). Specify your ZedBoard’s IP address, and log in. Voila, desktop Ubuntu over the network. HDMI monitor not required.

Now let’s build a devicetree.dtb that does not probe for not configure the HDMI display controller or any other PL peripherals.

  • $ cd linux/arch/arm/boot/dts; cp zynq-zed-adv7511.dts zynq-zed-no-pl.dts
  • Edit it to delete all programmable logic peripherals. It will look like this:
/dts-v1/;

/include/ "zynq-zed.dtsi"
  • Back up your SD card.
  • Build your new empty PL devicetree blob: $ cd linux; make zynq-zed-no-pl.dtb
  • Copy it to the SD card. cp arch/arm/boot/zynq-zed-no-pl.dtb /media/BOOT/devicetree.dtb
  • Boot Linaro Ubunto Linux on your ZedBoard. Notice your HDMI monitor is now blank!
  • Fear not. On your TeraTerm serial port console, get the IP address ($ ifconfig), and connect and login to your ZedBoard via RDP or VNC. All should work as before, even though you are not using any of the programmable logic fabric.

Device access via /dev/mem

The first, simplest, most elemental, most hacky, most dangerous way to access our peripheral from Linux is by opening /dev/mem and using mmap() to map a view of the device’s physical address space into our process’s virtual address space. Sven Andersson’s UIO blog entry summarizes these considerations perfectly so I’ll just quote him extensively here:

“… Here are some of the characteristics:

  • Userspace interface to system address space
  • Accessed via mmap() system call
  • Must be root or have appropriate permissions
  • Quite a blunt tool – must be used carefully
  • Can bypass protections provided by the MMU Possible to corrupt kernel, device or other processes memory

Pro

  • Very simple – no kernel module or code
  • Good for quick prototyping / IP verification
  • peek/poke utilities
  • Portable (in a very basic sense)

Con

  • No interrupt handling possible
  • No protection against simultaneous access
  • Need to know physical address of IP Hard-code?”

Continuing with our zed_counters project, with headless Ubuntu set up, let’s reboot our Linux system, configured to use the zed_counters’ design BOOT.BIN image we built earlier.

  • Copy it to your SD card $ cp …/zed_fsbl/bootimage/u-boot.bin /media/BOOT/BOOT.BIN

The SD card BOOT partition now contains our new BOOT.BIN with the zed_counters design; a new devicetree.dtb blog which denies any devices in the PL fabric, and the uImage Linux kernel we built last time. Safe-eject it, insert it into your ZedBoard, and reboot. Start a remote desktop connection. Open a Terminal. Fetch this /dev/mem access-based test application from Andersson’s blog entry.

	/* Open /dev/mem file */
	fd = open ("/dev/mem", O_RDWR);
	if (fd < 1) {
		perror(argv[0]);
		return -1;
	}

	/* mmap the device into memory */
	page_addr = (gpio_addr & (~(page_size-1)));
	page_offset = gpio_addr - page_addr;
	ptr = mmap(NULL, page_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, page_addr);
  • There’s nothing GPIO specific about it, so it can be used to test our counters device.
  • Build it. $ cc -o gpio-dev-mem-test gpio-dev-mem-test.c
  • Run it. $ ./gpio-dev-mem-test -g 0x70000000 -i =>  “GPIO access through /dev/mem. ./gpio-dev-mem-test: Permission denied”
  • Oops! This highlights a shortcoming of the /dev/mem approach. /dev/mem is a character special file owned by root and unreadable by regular users:
    $ ls -l /dev/mem => crw-r—– 1 root kmem 1, 1 Dec 31 1969 /dev/mem
  • Make the test program SETUID root so it can open the file. $ sudo chown root gpio-dev-mem-test; sudo chmod u+s gpio-dev-mem-test Congratulations, you have now added a massive security and robustness hole to your Linux system.
  • Try again. Read (write) the up-counter at address 0x70000000:
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000000
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000001
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000002
  • $ ./gpio-dev-mem-test -g 0x70000000 -o 7
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000007
  • $ ./gpio-dev-mem-test -g 0x70000000 -i => … input: 00000008
  • It counts! Read the free-running counter at address 0x70000004:
  • $ ./gpio-dev-mem-test -g 0x70000004 -i => … input: 577c9e79
  • $ ./gpio-dev-mem-test -g 0x70000004 -i => … input: 62502c56
  • $ ./gpio-dev-mem-test -g 0x70000004 -i => … input: 6b764783
  • It counts at 100 MHz.

It’s great that we can kick the tires on our new device without any device drivers or kernel configuration, but as Mr. Andersson writes, direct /dev/mem access is “OK for prototyping – not recommended for production.” Next we’ll see how to employ user-mode I/O to access just the counters device — with better safety and security.

Device Access via UIO

In this final section, we will see how to configure, modify, and rebuild your Linux kernel so you can configure UIO devices in your device tree, and how to access them from your application program.

Your ADI or Xilinx Linux source tree already contains the source to two useful UIO drivers, uio_pdrv (“UIO platform driver”) and uio_pdrv_genirq (“UIO platform driver with generic interrupts”). Unfortunately the default ADI and Xilinx build configurations do not build or include these drivers in the kernel (nor as loadable driver modules).

To enable these drivers run $ cd linux; make menuconfig. This pops a character-mode GUI titled .config — … Kernel Configuration.

  • In Kernel Configuration, select Device Drivers —>
  • In Device Drivers, select Userspace I/O drivers —>
  • In Userspace I/O drivers, select <*> as built-in both Userspace I/O platform driver and Userspace I/O platform driver with generic IRQ handling.
  • Exit. Exit. Exit. Save. Check: $ grep UIO .config =>
    CONFIG_UIO=y
    CONFIG_UIO_PDRV=y
    CONFIG_UIO_PDRV_GENIRQ=y

Now (before we rebuild the kernel) it is time to briefly review the UIO drivers kernel source. This won’t hurt much.

  • $ cd linux/drivers/uio
  • Review uio.c, uio_pdrv.c, and uio_pdrv_genirq.c.
  • Note that uio.c contains common support routines that are used by the uio_pdrv and uio_pdrv_genirq drivers.
  • Note that only uio_pdrv_genirq defines a devicetree/”open firmware device id” compatible key:  … { .compatible = “generic-uio”, },
  • Reviewing the driver change history in github, we see this line was added in 11/2012 for “microblaze: UIO setup compatible property” to “Setup compatible property which matches petalinux”.
  • So “out of the box”, after enabling UIO drivers in menuconfig, the only UIO driver you can use that will properly initialize with a device tree blob configuration is uio_pdrv_genirq, and to use that, you must describe your peripheral as a “generic-uio”.
  • Furthermore, the uio_pdrv_genirq requires the device tree blob to also define the device interrupt number and the interrupt parent device.
  • To make this work for our interrupt-less counters device, we can lie, pick a free interrupt number, and pretend our counters are wired up to the Zynq GIC interrupt controller, just like interrupt-issuing Zynq peripherals do. Interrupt numbers are biased by -32 for some reason. Here is what the ensuing DTS device tree specification looks like:
/dts-v1/;

/include/ "zynq-zed.dtsi"

/ {
    counters@70000000 {
        compatible = "generic-uio";
        reg = < 0x70000000 0x1000 >;
        interrupts = < 0 57 0 >;
        interrupt-parent = <&gic>;
    };
};
  • We declare this device is a “generic-uio”, which will match uio_pdrv_genirq;
  • Its registers are at physical address 0x70000000 and are 0x1000 (4 KB) in size.
  • It generates interrupt 57+32 = 89. These are issued to the GIC. Not really.
  • Where did the fake interrupt number 57+32=89 come from? I reviewed the interrupt assignments in various DTS files that may intersect this project, such as arch/arm/boot/dts/{zynq.dtsi, zynq-zed.dtsi, zynq-zed-adv7511.dts}, and picked one that wasn’t in use there.

Now we can (and I have) built a devicetree blob from this DTS file, booted Linux, and my device has probed, configured, initialized, and set up a /dev/uio0 (major device 251, minor device 0), and (as we’ll see below) I have accessed this from a user application.

In fact if we already had a peripheral with both memory-mapped I/O and interrupts, this existing driver uio_gen_pdrv would be ideal, we’d be done, and this seemingly endless tutorial would be over already.

But since example device counters doesn’t have interrupts, I am still not satisfied. What does it take to get the interrupt-less uio_pdrv working with device tree configuration?

  • We have to modify drivers/uio/uio_pdrv.c to add an open firmware device id compatible string for it to match a device specification in the device tree. We’ll use the unremarkable name “uio_pdrv”.
  • We also have to bring some code forward (and now, since I’m not a Linux kernel developer, this is getting dangerously close to cargo cult programming) into the uio_pdrv_probe() function to dynamically allocate a uio_info structure for the dynamically discovered uio_pdrv instance.
  • For symmetry, we’ll add the compatible name “uio_pdrv_genirq” to the uio_pdrv_genirq device; it may then be used interchangeably with the existing name “generic-uio”.
  • Here are the context diffs:
diff --git a/drivers/uio/uio_pdrv.c b/drivers/uio/uio_pdrv.c
index 72d3646..cc50312 100644
--- a/drivers/uio/uio_pdrv.c
+++ b/drivers/uio/uio_pdrv.c
@@ -14,6 +14,9 @@
 #include <linux/module.h> 
 #include <linux/slab.h>

+#include <linux/of.h>
+#include <linux/of_platform.h>
+
 #define DRIVER_NAME "uio_pdrv"

 struct uio_platdata {
@@ -28,6 +31,19 @@ static int uio_pdrv_probe(struct platform_device *pdev)
    int ret = -ENODEV;
    int i;

+   if (!uioinfo) {
+       /* devicetree -- alloc uioinfo for one device */
+       uioinfo = kzalloc(sizeof(*uioinfo), GFP_KERNEL);
+       if (!uioinfo) {
+           ret = -ENOMEM;
+           dev_err(&pdev->dev, "unable to kmallocn");
+           goto err_uioinfo;
+       }
+       uioinfo->name = pdev->dev.of_node->name;
+       uioinfo->version = "devicetree";
+       uioinfo->irq = UIO_IRQ_NONE;
+   }
+
    if (!uioinfo || !uioinfo->name || !uioinfo->version) {
        dev_dbg(&pdev->dev, "%s: err_uioinfon", __func__);
        goto err_uioinfo;
@@ -95,12 +111,23 @@ static int uio_pdrv_remove(struct platform_device *pdev)
    return 0;
 }

+#ifdef CONFIG_OF
+static const struct of_device_id uio_pdrv_of_match[] = {
+   { .compatible = "uio_pdrv", },
+   { },
+};
+MODULE_DEVICE_TABLE(of, uio_pdrv_of_match);
+#else
+# define uio_pdrv_of_match NULL
+#endif
+
 static struct platform_driver uio_pdrv = {
    .probe = uio_pdrv_probe,
    .remove = uio_pdrv_remove,
    .driver = {
        .name = DRIVER_NAME,
        .owner = THIS_MODULE,
+       .of_match_table = uio_pdrv_of_match,
    },
 };

diff --git a/drivers/uio/uio_pdrv_genirq.c b/drivers/uio/uio_pdrv_genirq.c
index 1f5ec28..3c7d8de 100644
--- a/drivers/uio/uio_pdrv_genirq.c
+++ b/drivers/uio/uio_pdrv_genirq.c
@@ -264,7 +264,8 @@ static const struct dev_pm_ops uio_pdrv_genirq_dev_pm_ops = {
 #ifdef CONFIG_OF
 static const struct of_device_id uio_of_genirq_match[] = {
    { .compatible = "generic-uio", },
-   { /* empty for now */ },
+   { .compatible = "uio_pdrv_genirq", },
+   { },
 };
 MODULE_DEVICE_TABLE(of, uio_of_genirq_match);
 #else

Now that we have configured UIO and added devicetree support to uio_pdrv, we are ready to rebuild the kernel and copy it to the SD card.

  • $ cd linux; make uImage LOADADDR=0x00008000
  • $ cp arch/arm/boot/uImage /media/BOOT

We can now use a nice simple devicetree .DTS for our zed_counters project:

/dts-v1/;

/include/ "zynq-zed.dtsi"

/ {
    counters@70000000 {
        compatible = "uio_pdrv";
        reg = < 0x70000000 0x1000 >;
    };
};
  • Counters is now a uio_pdrv device with physical addresses 0x70000000-0x70000fff.
  • No interrupt fakery required!
  • Put that in arch/arm/boot/dts/zynq-zed-counters.dts, rebuild the device tree blob, and copy it to the SD card. $ make zynq-zed-counters.dtb;
    $ cp arch/arm/boot/zynq-zed-counters.dtb /media/BOOT/devicetree.dtb
  • Together with the BOOT.BIN for our zed-counters design, we’re all set.
  • Safe-eject the SD card, insert it into the ZedBoard, and boot Linux!
  • There is no helpful console message confirming our counters device is initialized. That’s OK, we can see for ourselves.
  • $ ls -l /dev/uio0 => crw——- 1 root root 251, 0 Dec 31  1969 /dev/uio0
  • $ grep 251 /proc/devices => 251 uio
  • $ cd /sys/devices/70000000.counters/uio/uio0; cat name version uevent =>
    counters
    devicetree
    MAJOR=251
    MINOR=0
    DEVNAME=uio0

Now to test UIO access to our counters. Andersson’s GPIO UIO test application is worthy of your study, but is not ideal to exercise our two counters. Instead, below, I have modified it (quickly hacked it) into a trivial counters-specific test. First we check counters[0] increments on every read. Next we do a few timing tests with our free running counters[1].

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <sys/mman.h>
#include <fcntl.h>

#define MAP_SIZE 0x1000

void usage(void)
{
    printf("usage: test_counters -d <UIO_DEV_FILE>n");
}

int main(int argc, char *argv[])
{
    int c;
    int i;
    int fd = 0;
    char *uiod = 0;
    int value = 0;
    unsigned start, end;

    volatile unsigned *counters;

    while ((c = getopt(argc, argv, "d:io:h")) != -1) {
        switch(c) {
        case 'd':
            uiod = optarg;
            break;
        case 'h':
            usage();
            return 0;
        default:
            printf("invalid option: %cn", c);
            usage();
            return -1;
        }
    }

    if (!uiod) {
        usage();
        return -1;
    }

    /* Open the UIO device file */
    fd = open(uiod, O_RDWR);
    if (fd < 1) {
        perror(argv[0]);
        printf("Invalid UIO device file: '%s'n", uiod);
        return -1;
    }

    /* mmap the UIO device */
    counters = (volatile unsigned *)mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (!counters) {
        perror(argv[0]);
        printf("mmapn");
        return -1;
    }

    counters[0] = 1;
    assert(counters[0] == 1);
    assert(counters[0] == 2);
    assert(counters[0] == 3);

    counters[0] = 100;
    assert(counters[0] == 100);
    assert(counters[0] == 101);
    assert(counters[0] == 102);

    printf("wr-rd delay on free-running counter: ");
    for (i = 0; i < 20; ++i) {
        counters[1] = start = 0;
        end = counters[1];
        printf("%u ", end - start);
    }
    printf("n");

    printf("rd-rd delay on free-running counter: ");
    for (i = 0; i < 20; ++i) {
        counters[1] = 0;
        start = counters[1];
        end = counters[1];
        printf("%u ", end - start);
    }
    printf("n");

    munmap((void*)counters, MAP_SIZE);
    return 0;
}

Now let’s run it!

  • $ cc -o counters counters.c
  • $ ./counters -d /dev/uio0 => ./counters: Permission denied
  • $ ls -l /dev/uio0 => crw——- 1 root root 251, 0 Dec 31  1969 /dev/uio0 Oh, right. The uio_pdrv driver created the device file but made it inaccessible to non-superusers.
  • Fix that. $ sudo chmod 666 /dev/uio0
  • $ ./counters -d /dev/uio0 =>
    wr-rd delay on free-running counter: 13 14 13 14 13 14 14 13 13 13 14 14 13 14 13 14 14 …
    rd-rd delay on free-running counter: 14 14 14 15 14 14 15 14 15 15 14 14 14 15 14 14 15 …

It works! Here we see that counter[0] correctly increments on every read, and successive read accesses to the free-running counter[1] each take 13-15 cycles e.g. 130-150 ns.

Unfortunately each time I reboot, permissions on /dev/uio0 revert to -rw——. I don’t know how to set them from within the device driver. Any suggestions? Thank you.

This ends the tutorial. Thank you for reading along. I hope it is of help in your Zynq Linux work. Please leave me your comments here or via twitter @jangray. Good luck and happy hacking.

How to Design and Access a Memory-Mapped Device in Programmable Logic from Linaro Ubuntu Linux on Xilinx Zynq on the ZedBoard, Without Writing a Device Driver – Part One

Introduction

Last time we discussed how to run desktop Linaro Ubuntu Linux on the ZedBoard. In this post, and part two that follows, we’ll cover two different ways for application software to access a memory-mapped device implemented in Zynq’s programmable logic fabric.

Memory-mapped device access is straightforward in a “standalone” “bare-metal” application. You initialize a (volatile) pointer with the physical address of the memory-mapped device control/status register(s) and simply load and store to your device registers through that pointer.

In contrast, under Linux, user-mode processes run in a virtual memory environment in which all addresses are virtual addresses, mapped to physical addresses by the processor’s memory management unit. The Linux kernel owns the mapping of virtual addresses to physical addresses, and by default it provides no access (no valid mapping) to your device registers.

So to access your device under Linux, you must open it its device file and then establish a valid virtual memory mapping from some address range in your process to the underlying physical address range of your device registers. Optionally, if your device generates interrupts, your software must await the next interrupt and handle it as it occurs.

In this tutorial series we’ll discuss two ways to do this, neither of which require your writing or debugging Linux device drivers.

  1. Open /dev/mem (as root) and mmap a valid virtual address view onto your memory-mapped device registers.
  2. Configure a user-mode I/O device driver for your device, open /dev/your-device, and again mmap a valid virtual address view onto your memory-mapped device registers.

Again with the disclaimers: I am sharing what works for me. It may not work for you or it may fail over time. You may suffer data loss or worse. I disclaim all warranties and representations. I am not supporting this. I am not a Linux kernel hacker. This is not necessarily the best way, nor does it comprise best practices; for example a proper engineering methodology would include extensive bus functional simulation of our peripheral core and so forth. Your mileage may vary.

Key URLs

Cut to the chase: UIO: key lessons learned

Please review Andersson’s page. Here is what I have subsequently puzzled out to enable UIO on my ZedBoard (with device tree configuration and the ADI / Xilinx Linux git tree).

  • Of the various UIO drivers in linux/drivers/uio/*.c in the Xilinx Linux git tree, only uio_pdrv_genirq can work out of the box with device tree configuration.
  • It is possible, if inelegant, to use that driver with a device that does not issue interrupts.
  • Device tree interrupt assignments are a little wonky.
  • It is also possible to use the non-interrupt version uio_pdrv if you bring forward some code from uio_pdrv_genirq.c, and add a “compatible” key for it to probe (initialize) from device tree configuration.
  • The UIO drivers are not built by default. You have to explicitly configure the Linux kernel build to build them.

The details are in the part two of this series.

Building a new system with a loadable up-counters device

So to demonstrate all of this we’ll build a memory-mapped up-counters device in the PL fabric and access it from Linux. First we’ll build the hardware. Start with the prerequisites, system, and tools we used earlier.

  • Run Xilinx Platform Studio. Select Create New Project Using BSB Wizard. Project file “zed_counters”. AXI system. OK.
  • In Base System Builder — AXI flow — Create a System for the Following Development Board, Select Board Vendor Avnet. Select Board Name ZedBoard. Next.
  • In Base System Builder — AXI flow — Peripheral Configuration, Remove the preselected buttons, LEDs, and switches. Remove. Remove. Remove. Finish.
  • In XPS, select the Bus Interfaces tab. All you have is the processing_system7_0. What a mouthful. Select its name, rename it “ps”. That’s better.

Now let’s create a loadable up-counters peripheral.

  • In the IP Catalog pane to the left, click on the rightmost button of eight called “Create and Import Peripheral” to invoke the Create and Import Peripheral Wizard. Next.
  • In Create and Import Peripheral Wizard — Peripheral Flow, choose default radio button Create templates for a new peripheral. Next.
  • In Create Peripheral — Repository or Project, I recommend you use or start a repository for your core(s) that you can share across projects. Click the first button and chose a new repository location. Next.
  • In Create Peripheral — Name and Version, chose a name such as “counters”. Next.
  • In Create Peripheral — Bus Interface, chose AXI4-Lite. Next.
  • In Create Peripheral — IPIF (IP Interface) Services, select only User logic software register. Next.
  • In Create Peripheral — User S/W Register, specify 2 software accessible registers. Next.
  • In Create Peripheral — IP Interconnect (IPIC), leave all defaults. Next.
  • In Create Peripheral — (OPTIONAL) Peripheral Simulation Support, leave Generate unselected. Next.
  • In Create Peripheral — (OPTIONAL) Peripheral Implementation Support, select Generate stub ‘user_logic’ in Verilog. Next.
  • In Create Peripheral — Congratulations!, Finish.
  • You now have a peripheral named counters which is a pair of simple read-write 32-bit registers.
  • Let’s modify that to make it more interesting. We’ll make the first register, at offset 0, automatically up-count every time it is read; and we’ll make the second register, at offset 4, a free-running counter, which up-counts every bus clock cycle, which is (by default) 100 MHz.
  • Edit Repository/MyProcessorIPLib/pcores/counters_v1_00_a/hdl/verilog/user_logic.v and change these two lines of code:
*** user_logic.v.0
--- user_logic.v
***************
*** 163,170 ****
                if ( Bus2IP_BE[byte_index] == 1 )
                  slv_reg1[(byte_index*8) +: 8] <= Bus2IP_Data[(byte_index*8) +: 8];
            default : begin
!             slv_reg0 <= slv_reg0;
!             slv_reg1 <= slv_reg1;
                      end
          endcase

--- 163,170 ----
                if ( Bus2IP_BE[byte_index] == 1 )
                  slv_reg1[(byte_index*8) +: 8] <= Bus2IP_Data[(byte_index*8) +: 8];
            default : begin
!             slv_reg0 <= slv_reg0 + (slv_reg_read_sel == 2'b10); // incr. each read
!             slv_reg1 <= slv_reg1 + 1;                           // incr. each cycle
                       end
          endcase

Now let’s add these counters to our zed_counters project.

  • In the IP Catalog pane, under Peripheral Project Repository, click (expand) USER, and double-click COUNTERS.
  • The Add IP Instance to Design modal dialog appears. Select Yes.
  • In XPS Core Config, leave all defaults as-is and close the dialog.
  • The Instantiate and Connect IP modal dialog appears. Your “ps” processing system is preselected. OK.
  • In the Bus Interfaces pane you now have an axi_interconnect_1, a ps processing system, and a counters_0.
  • Select the Ports tab. Expand the design hierarchy. Observe all is wired up correctly. For example, the counters_0.S_AXI is connect to BUS axi_interconnect_1, and the counters_0.S_AXI.S_AXI_ACLK is connect to the ps::FCLK_CLK0. Great.
  • Select the Addresses tab. Change the counters_0 Base Address to 0x70000000 and the Size to 4K. Lock it. (Why? Just to keep all subsequent instructions in sync.)
  • We’re almost done. Let’s build it and ship it. Click Project >> Export Hardware Design to SDK.
  • The Export to SDK modal dialog appears. Ensure default “Include bistream and BMM file” is selected. Select Export & Launch SDK.
  • The XPS build of the design starts.
  • Celebrate the long cascade of chatty warnings. With Platform Studio, Xilinx continues its grand tradition of angst-inducing warnings which challenge the engineer to determine which are expected and benign, which are serious, and which are new since last build.
  • If all is well, the build finishes. Spend a moment to review the output. Notice our pair of 32-bit counters incurred >190 DFFs and >300 LUTs, much of which (I assume) is in AXI4 and IPIF interconnect overheads.

Aside: “standalone” application testing

Before we turn to building the software infrastructure needed to boot and access our counters from software under Linux, I should point out that you may wish to take a more deliberate crawl, walk, run approach to brining up your new PL hardware. In my case, back in October 2012, the first thing I did was to try to access my memory mapped hardware design by building a “bare metal” standalone C test application in the SDK. It used (physically addressed) pointer dereferences, and printf, to read and write to my device register and verify in this simple environment that all was working as intended.

Building the First Stage Boot Loader (FSBL) and BOOT.BIN

We’ll use the same process and tools as last time.

  • Once the XPS build finishes, it launches the Xilinx SDK. Its Workspace Launcher modal dialog appears. Browse to your zed_counters project and select its SDK directory. OK.
  • In Project Explorer tab, select zed_counters_hw_platform.
  • Double click system.xml. Review the zed_counters_hw_platform Address Map. There’s your counters_0, at address range 0x70000000-0x70000fff, as desired!
  • Select zed_counters_hw_platform. Run File >> New >> Application Project.
  • In New Project — Application Project, enter Project Name zed_fsbl. Keep defaults (e.g. standalone, C). Next.
  • In New Project — Templates, select Zynq FSBL. Finish. It builds automatically.
  • In Project Explorer tab, select zed_fsbl. Run Xilinx Tools >> Create Zynq Boot Image.
  • In Create Zynq Boot Image, the list of partitions in the boot image will already contain your zed_fsbl.elf and your system.bit PL configuration bitstream. Now you must add your u-boot.elf that you built last time. Click Add.
  • In Select a partition image file, enter the path to your u-boot.elf file. Open. It will be added as the third partition file in the BOOT.BIN image.
  • In Create Zynq Boot Image, note the output folder path, then click Create Image. Bootgen runs and builds you a BOOT.BIN, unfortunately called u-boot.bin.

Recap and to be continued…

So far we’ve built a new ZedBoard project from scratch. It has a pair-of-32-bit-counters peripheral in the programmable logic. Thanks to the XPS Base System Builder Wizard, its processing system is preconfigured with support for UART, GPIO, SD card, Quad SPI, USB, Ethernet, and 512 MB of DRAM. We’ve built an FSBL that sets this all up, loads the PL, and loads and launches u-boot; and we’re going to reuse the same good old u-boot.elf Linux boot loader from last time.

The tutorial continues in part two. Thank you for reading along. I hope it is of help in your Zynq Linux work. Please leave me your comments here or via twitter @jangray. Good luck and happy hacking.

FPGAs, Then and Now

On the left, from 1995, J32, one 32-bit RISC SoC in an XC4010. It had 20x20x2=800 4-LUTs (and 400 3-LUTs).

On the right, from 2013, 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T (which provides over 433,000 6-LUTs and 1470 BRAMs). A work in progress.

In other words, in the past 18 years Moore’s Law has taken us from 1K LUTs per FPGA to 1K 32-bit CPUs per FPGA.

1995: One 32-bit RISC SoC in an XC4010 --- 2013: 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T.

FCCM 2013 Panel: Reconfigurable Computing in the Era of Dark Silicon

At FCCM 2013, I was on a panel to discuss Reconfigurable Computing in the Era of Dark Silicon. If you haven’t heard of the Dark Silicon meme in the computer architecture community, I recommend you review Michael Taylor (UCSD)‘s slides from DaSi 2012.

It’s difficult to take these things out of context, but here for posterity’s sake are my position slides: Gray-Dark Silicon and HeMPPAAs. I emphasize that orders of magnitude energy efficiency improvements might be achieved by building workload-optimized computers in FPGAs using a HeMPPAA (heterogeneous massively parallel processor and accelerator arrays) architecture. I also propose infrastructure investments so that FPGA design in the large is much more like the software development experience.

Yet Another Guide to Running Linaro Ubuntu Linux Desktop on Xilinx Zynq on the ZedBoard

Disclaimers: I am sharing what works for me. It may not work for you or it may fail over time. You may suffer data loss or worse. I disclaim all warranties and representations. I am not supporting this. I am not a Linux kernel hacker.

Update (Sept 18, 2013): In the 8th comment to this article, commentor eactor notes these months-old instructions may be getting stale. He/she says “The kernel has evolved over the past month, I successfully used your instructions to build a running linaro (Kernel 3.8) with git checkout xcomm_zynq_3_8” — but not using git checkout xcomm_zynq. So until I can revisit this article and verify this, you may wish to try git checkout xcomm_zynq_3_8 below rather than git checkout xcomm_zynq.

Why run Ubuntu Desktop on ZedBoard?

After all, isn’t Zynq for embedded systems? Most don’t need desktop UIs. Isn’t cross development faster? And if you need a UI surely Android is a better choice?

Yes, it’s not fast. But it’s fast enough. I don’t trust it. But it seems stable. Certainly it stays up all day and runs complex software builds (e.g. OpenCV).

Mostly it’s the pinch myself factor. It is remarkable that using an FPGA you can boot, have a UI, get on the internet, easily take advantage of the universe of web services and local open source software. Also by living on ZedBoard you will learn the vagaries and limitations of the platform. Eating your own dogfood.

What we’re going to build

  • An XPS design for ZedBoard, including a PL bitstream system.bit for the HDMI interface, which we’ll export to the SDK.
  • An FSBL (first stage boot loader) using the SDK.
  • A u-boot.elf (Linux boot loader).
  • A BOOT.BIN which is the catenation of the FSBL, system.bit, and u-boot.elf using the SDK.
  • A Linux kernel named uImage.
  • A devicetree blob named devicetree.dtb.
  • A FAT32 partition on our SD card that comprises these files BOOT.BIN, uImage, and devicetree.dtb.
  • An ext4 partition on our SD card with the pre-built Linaro Ubuntu userland .
  • Some post-boot tweaks to the Ubuntu system.

Key URLs

Of related interest (feel free to skip this section). Just a history of things I’ve looked into.

Prerequisities

Prepare the SD card

  • I recommend you set aside the SD card that came with your ZedBoard and use another one for this project.
  • I’m using Sandisk 32 GB Extreme U1 45 MB/s SDHC card. Probably overkill. But some other people have reported weird I/O errors to lesser cards.
  • Follow exactly the instructions on the ADI page.

“Preparing the SD Card

To boot the system on the ZED or ZC702 board you’ll need a SD memory card. The SD card should have at least 4 GB of storage and it is recommended to use a card with speed-grade 6 or higher to achieve optimal file transfer performance. … The SD card needs to be partitioned with two partitions. The first one should be about 40MB in size and the second one should take up the remaining space. For optimal performance make sure that the partitions are 4MB aligned. The first partition needs to be formatted with a FAT filesystem. It will hold the bootloader, devicetree and kernel images. The second partition needs to be formatted with a ext4 filesystem. It will store the systems root filesystem.”

  • I use $ sudo gparted. Then I safe-eject the SD card and reinsert it just to be safe.

Building the programmable logic hardware

  • I built this on Windows 7 x64 with ISE 14.4 and EDK.
    • Aside: I am tempted to try installing Xilinx tools on Ubuntu, but they are not supported. I already run Ubuntu, so I am disinclined to also run a CentOS VM just for this purpose. I could probably “make Ubuntu work” but if I have to file a webcase I don’t want to hear “oh you’re hosting on an unsupported OS.” It is lamentable that despite Ubuntu’s popularity and Long Term Support options, Xilinx does not support it. It is also lamentable that Xilinx tools do not yet support Windows 8. Windows Build Conference 2011 was 18 months ago. Windows 8 betas have been available since then. Windows 8 shipped 6+ months ago and there are more than 100 million Windows 8 seats out there. For some reason Xilinx is always slow to track Windows releases. It’s almost as if each release they want to “wait and see if this new Windows thing will catch on”.
  • Download and build the reference design for ZedBoard. From http://wiki.analog.com/resources/fpga/xilinx/kc705/adv7511 download http://wiki.analog.com/_media/resources/fpga/xilinx/kc705/cf_adv7511_zed_edk_14_4_2013_02_05.tar.gz
  • Unpack it $ tar xzvf …
  • Double click the system.xmp file in cf_adv7511_zed. This launches Xilinx Platform Studio.
  • Click Generate BitStream and grab a coffee head out for lunch work-ahead on the u-boot and linux kernel steps below, while you check back on progress on this step. Marvel at all the warnings. Wonder how an FPGA designer ever separates important warnings from ignored warnings. Lament how long it takes core generation and synthesis to get to MAP/PAR of the design!
  • Click Export Design. Select Export and Launch SDK. (Continued below.)

Build u-boot, the Linux boot-loader

  • Read http://www.wiki.xilinx.com/U-boot
  • Read http://www.wiki.xilinx.com/Build+U-Boot
  • Read http://www.wiki.xilinx.com/Fetch+Sources
  • Read http://www.wiki.xilinx.com/Install+Xilinx+Tools
  • Install x86 based ARM cross compile tools somewhere. I have /home/jan/CodeSourcery/Sourcery_CodeBench_Lite_for_Xilinx_GNU_Linux installed and at the front of my $PATH. I don’t remember where it came from and Xilinx has changed their wiki so it doesn’t refer to it, rather to Xilinx’s own cross-compiler tools. It may be if you have Linux based ISE 14.4 with EDK you already have them installed.
  • Fetch the source: $ git clone git://git.xilinx.com/u-boot-xlnx.git; cd u-boot-xlnx
  • Git-checkout the 14.4 version branch. Adapteva did this to preclude subsequent trunk checkins breaking things. Good idea. $ git checkout -b xilinx-v14.4 xilinx-v14.4
  • Set up cross-compile prefix: $ export CROSS_COMPILE=arm-xilinx-linux-gnueabi- (and ensure your ARM tools are on your PATH).
  • Edit include/configs/zynq_common.h to tweak the u-boot boot script so that it doesn’t try to load a ramdisk image from SD and doesn’t try to pass it to “bootm”:
+              "sdboot=echo Copying Linux from SD to RAM...;" 
+                              "mmcinfo;" 
+                              "fatload mmc 0 0x3000000 ${kernel_image};" 
+                              "fatload mmc 0 0x2A00000 ${devicetree_image};" 
+                              "bootm 0x3000000 - 0x2A00000" 
  • U-boot will therefore launch the kernel with bootm 0x3000000 – 0x2A00000 The all-important ‘‘ tells u-boot (and the kernel) there is no ramdisk filesystem (so don’t load one!) Instead it will use the bootargs from the devicetree.dtb and will use the Linaro ext4 filesystem /dev/mmcblk0p2 from the SD card, as desired. (See below.)
  • Configure: $ make zynq_zed_config
  • Build: $ make
  • The output file u-boot should be copied to your cf_adv7511_zed’s SDK project workspace directory, renamed u-boot.elf.

Build the boot image BOOT.BIN

  • This consists of the FSBL (first stage boot loader), the system.bit configuration bitstream, and the U-boot Linux boot-loader u-boot.elf. Follow exactly the instructions on the first ADI page:

“Build the boot image

To complete this step you need to have a u-boot image for the Zynq platform. Please refer to the Xilinx wiki on how to build such an image.   [[Rather — see above instructions.]]

The bootloader can be build with Xilinx SDK. In order to do so it is necessary to first export the HDL design from the Xilinx Platform Studio to the SDK, this is done by clicking the “Export to SDK” button in the Platform Studio GUI.

Export project to SDK: …

Once the project has been exported create a new FSBL project in the SDK. To do this right-click on the newly exported hardware platform specification in left “Project Explorer” panel and select “New > Project” from the popup menu. Select “Xilinx – Application Project” on first dialog page. On the second dialog page choose a name for the project (zynq_fsbl for example) and on the third page select “Zynq FSBL” template.

The project should build automatically. If not a manual build can be started by right clicking the newly created project in the left “Project Explorer” pane and selecting “Build Project” from the popup menu. After the project has been build it is time to generate the boot image. This is done by right clicking on the project in the left “Project Explorer” pane and selecting “Create Boot Image”. This will open up the bootgen wizard. The bootgen wizard needs three files:

The freshly build zynq_fsbl.elf binary
The system.bit bitstream
The u-boot.elf binary

Add these files to partitions list in the dialog, then select an output folder.

Clicking “Create Image” will now generate in the chosen location a new boot image for the target platform. The output *.bin file should be renamed “BOOT.BIN” and needs to be saved on the first partition of the SD-card .”  [[e.g. the BOOT partition mounted at /media/BOOT]]

Build the linux kernel, mostly according to the ADI instructions.

  • Fetch it. $ git clone https://github.com/analogdevicesinc/linux.git ; cd linux # and get another coffee.
  • Checkout a specific branch. $ git checkout xcomm_zynq
  • (Note: see Update at top of article. You may wish to use $ git checkout xcomm_zynq_3_8 instead.)
  • $ export ARCH=arm
  • $ export CROSS_COMPILE=arm-xilinx-linux-gnueabi-
  • Configure it. $ make zync_xcomm_adv7511_defconfig # yes, ‘zync’ with a ‘c’
  • To build the kernel uImage (e.g. a u-boot-image-prefixed kernel), the path to the (just built) U-boot tool ‘mkimage’ must be added to the $PATH variable, so that the kernel compilation process can find it.: $ export PATH=/……../u-boot-xlnx/tools:$PATH
  • Build it. $ make uImage LOADADDR=0x00008000
  • Copy it to your BOOT partition on your SD card. $ cp arch/arm/boot/uImage /media/BOOT

Build the device tree

  • Note your DTS file arch/arm/boot/dts/zynq-zed-adv7511.dts includes arch/arm/boot/dts/zynq-zed.dtsi with these bootargs:
// bootargs = "console=ttyPS0,115200 root=/dev/ram rw initrd=0x1100000,33M ip=:::::eth0:dhcp earlyprintk";
bootargs = "console=ttyPS0,115200 root=/dev/mmcblk0p2 rw earlyprintk rootfstype=ext4 rootwait devtmpfs.mount=0";
  • That’s good — Linux will mount your root file system from the second partition on your SD card.
  • Make the device tree .DTS file for the ZedBoard with ADV7511 support. $ make zynq-zed-adv7511.dtb
  • Copy it to the SD card as devicetree.dtb: $ cp arch/arm/boot/zynq-zed-adv7511.dtb /media/BOOT/devicetree.dtb

Install the root file system

  • We’ll basically follow the ADI instructions to fetch the Linaro Ubuntu root filesystem and install it on the ‘rootfs’ partition on your SD card. The only change is we will pick up a five months’ newer and fresher build of Ubuntu Precise. This helps minimize the number of post-boot package updates.
  • Download Linaro Ubtunu ARM rootfs  archive:
    $ wget https://releases.linaro.org/12.11/ubuntu/precise-images/ubuntu-desktop/linaro-precise-ubuntu-desktop-20121124-560.tar.gz
  • Extract the root filesystem onto the SD card. $ sudo tar –strip-components=3 -C /media/rootfs -xzpf linaro-precise-ubuntu-desktop-20121124-560.tar.gz binary/boot/filesystem.dir
  • At this point you should be all set. Your /media/BOOT partition contains BOOT.BIN, uImage, and devicetree.dtb. (Recall BOOT.BIN itself contains the FSBL, the system.bit config bistream, and u-boot.elf.) Your /media/rootfs contains a recent Linaro Ubuntu root filesystem.
  • Safe-eject the SD card. Install it in Zynq SD slot. Follow the rest of the ADI instructions:

Testing the system

“Once all of the previous tasks have been completed it is time to test the system. To do this inserted the SD-card into the board and power-up the board. After a few seconds the blue “DONE” LED should light up. This means that the bitstream has been successfully loaded and the system will now start to boot. It is also possible to connect to the serial console by using the on-board UART-to-USB bridge, this allows to monitor the boot process and view debug messages.

After another few seconds the monitor connected to the system will turn on and display the Linux mascot in the top left corner, after that the Ubuntu Desktop system will appear on the screen. The system is now ready to be used.”

  • I would wait an additional 30 seconds until booting finishes including checking packages.
  • You can also add a USB keyboard/mouse to use the desktop on the HDMI monitor.
  • I like to shut the system down cleanly: $ poweroff

Some post-boot tweaks to the Zynq Ubuntu system:

  • Follow the ADI instructions “Enable xf86-video-modesetting Xorg driver
  • Install the VNC and RDP servers. From serial port console or Terminal, $ sudo apt-get install xrdp . Then you will be able to boot and run headless. (You will still need to use the serial port console to interrogate the DHCP-assigned IP address ($ ifconfig) in order to connect via VNC or RDP to the ZedBoard).)
  • Add yourself as a user instead of using the linaro account. Unfortunately, using the desktop System Settings >> User Accounts always crashes on me. Instead I open a Terminal and use useradd.

Adding some “swap space”

  • The board “only” has 512 MB of DRAM and this can be a bit tight. For more headroom I always add a 1 GB rootfs swap file. Note: this may be a very bad idea in the long term! (I don’t think there is wear-leveling in the SD card interface.)
  • $ sudo bash
  • $ mkdir /var/cache/swap
  • $ dd if=/dev/zero of=/var/cache/swap/swapfile bs=1M count=1024
  • $ echo “/var/cache/swap/swapfile        none    swap    sw      0       0” >> /etc/fstab
  • ^D
  • Reboot.
  • Verify it’s working: $ swapon –s

Running headless — maximizing your available programmable logic

  • Here are the steps I took to build a new design with empty PL, and boot to desktop Ubuntu via RDP (Windows Terminal server, mstsc.exe) or VNC.
  • First build a new device tree source file. Copy the original device tree (arch/arm/boot/dts/zynq-zed-adv7511.dts) elsewhere and edit that to delete all the entries which pertain to PL-based control/status registers. Add it to arch/arm/boot/dts/Makefile. (TODO: provide an example DTS file.)
  • Rebuild it: $ make zynq-zed-no-pl.dtb; cp arch/arm/boot/zynq-zed-no-pl.dtb /media/BOOT/devicetree.dtb
  • Boot your ZedBoard. Notice your HDMI monitor is blank!
  • On your TeraTerm serial port console, get the DHCP IP address ($ ifconfig), and connect and login to your ZedBoard via RDP or VNC from your PC
  • Notice desktop Ubuntu works fine headless, even though you’re currently not using any of the configured I/O device cores (e.g. HDMI, audio, DMA) in the PL fabric.
  • (Caution: I last did this months ago, this description is from memory, I haven’t bothered to repeat it for this blog:)
    Now build an empty PL design. Go into XPS. copy your original ADI reference design elsewhere, then modify it to delete all PL-based cores without disturbing any settings/configurations in the PS configuration area. (The PS7 subsystem remains configured with DRAM, serial, USB, Ethernet MAC, etc.). Create a new Avnet ZedBoard XPS project with Base System Builder. Remove all the LEDs and switch peripherals you are offered. Don’t add any PL peripherals. Build. Export to SDK. Run SDK, build FSBL, then build a new u-boot.bin a.k.a. BOOT.BIN, with the new FSBL, this (empty PL fabric) system.bit, and your u-boot.elf.
  • Copy that to your SD card $ cp u-boot.bin /media/BOOT/BOOT.BIN. Reboot.
  • On your TeraTerm serial port console, get the IP address ($ ifconfig), and connect and login to your ZedBoard via RDP or VNC. All should work as before, even though you are not using any of the programmable logic fabric.

Next steps

  • In a later blog post, we’ll try this: Use XPS to build a design with some simple AXI pcores in the PL, export to SDK, build new FSBL, new BOOT.BIN. Copy to BOOT partition. Note the 32-bit address(es) of the memory-mapped control/status registers (CSRs) of the cores. Boot Ubuntu. Then directly read and write the core’s CSRs by opening /dev/mem (as root!) and using mmap to map the physical device address or region into the user processor. This is a hack, however.
  • Then we’ll see how we can configure and use UIO (user-mode I/O) to access these same CSRs in a cleaner way.
Visual C++ on Windows NT on Bochs on Linaro Ubuntu Linux on Xilinx Zynq on ZedBoard

Visual C++ on Windows NT on Bochs on Linaro Ubuntu Linux on Xilinx Zynq on ZedBoard