Category Archives: Uncategorized

Welcome, AMD Embedded Microblaze V

Congratulations to AMD Embedded on its debut of Microblaze V.

With RISC-V support now from AMD, Intel, Lattice, MicroSemi, and others, FPGA vendors’ transition to RISC-V is (almost) complete. Still a few stragglers.

Related, here for your possible amusement is my IEEE FCCM 2019 Soft Processors Panel presentation, A Game of Soft Processors:


This panel, the same night as the doomed Game of Thrones’ Battle of Winterfell, anticipated a similar wipe out: all FPGA vendors’ proprietary soft processors would eventually fall to the RISC-V juggernaut. This has come to pass. This is good.

The past two decades have been an era of siloed, proprietary, fragmented, duplicative soft processor ecosystems. Use MicroBlaze (or Nios) IP and you were locked in to Xilinx (or Altera) devices.

Now, looking ahead, FPGA vendors should still compete vigorously on best, fastest, smallest processor cores and IP, and most productive development environments, but also, shocker!, they should work together to advance RISC-V standards and profiles for FPGA embedded systems SoCs, so that customer designs are more reusable and portable across platforms. This will be great for customers and great for the vendors, because overall there will be more IP and more solutions ready to run on their latest devices.

RISC-V-standards-based reuse and interop is a major theme and objective of the RISC-V International Soft Processor SIG. Please join us and let’s build a community to advance these standards.

Also, a RISC-V FPGA vendor and user community, together, might speak with a common voice, better to be heard by a RISC-V consortium focused on ASIC design considerations. As RISC-V International undertakes new task groups to shape new standards that address new applications and market segments, the FPGA implementers’ perspective will help ensure that these new standards remain relatively feasible and hopefully economical to implement in FPGAs’ LUTs and BRAMs and DSPs. This is another objective of the SIG:

Background and Motivation

4. FPGA RISC-V systems bring new opportunities and challenges, such as late customizability / fine-grained subsetting, novel memory systems and interconnects, accelerator integration, partial reconfiguration, and alternative arithmetic systems, that may not be a priority or relevant to ASIC implementations.

5. Proposed RISC-V ISA extensions may inadvertently induce circuit structures that are prohibitively expensive in certain FPGA devices or use cases. …

Goals and Scope

2. Represent FPGA implementation considerations within RISC-V TGs, acting as a resource for consultations and to monitor progress of RISC-V standards from the perspective of Soft CPUs.

Our poster presentation on Composable Custom Extensions and Custom Function Units for RISC-V

Next week we have a poster presentation on our ongoing work on Composable Custom Extensions and Custom Function Units for RISC-V at Spring 2022 RISC-V Week in Paris.

Note, unfortunately I will not attend. Other contributors will be on site to present and discuss the work.

Here is the poster (PDF) and here is the video narration of the poster (MP4).

Poster: Composable Custom Extensions and Custom Function Units for RISC-V

2GRVI Phalanx at Hot Chips 31 (2019): The First Kilocore RISC-V RV64I with High Bandwidth Memory

This week at Hot Chips 31 (2019) I am presenting a status update poster on the work-in-progress GRVI Phalanx Accelerator Kit: 2GRVI Phalanx: Towards Kilocore RISC-V FPGA Accelerators with HBM2 DRAM (PDF).

This is the debut of the FPGA-efficient 2GRVI (“too groovy”) RV64I processing element (PE) core, and of Phalanx support for FPGAs with HBM2 high bandwidth DRAM, first discussed last month.

The poster tells the story of the version two redesign of GRVI Phalanx to take best advantage of HBM2 DRAM. It explains some V1 limitations, particularly FPGAs’ relatively low DRAM bandwidth, and shows how the advent of HBM2 FPGAs, such as the Xilinx VCU37P and VU35P in the Alveo U280 and U50 accelerator cards, potentially with over 400 GB/s of memory bandwidth, fundamentally changes the utility and competitiveness of FPGA accelerators.

However, the Niagara of data that 30+ HBM2 memory channels can pour down on your head required changes to the PE and to the Phalanx SoC architecture to request and receive all that sweet sweet bandwidth. These changes include:

  • New 2GRVI latency-tolerant RV64I PE
  • New 64b cluster interconnect, 64b UltraRAM banks
  • New 32B/cycle split transaction pipelined NoC-AXI RDMA bridges
  • Add PCIe XDMA mastering (to an AXI-HBM channel)
  • Add many more Hoplite NoC ring columns

We discuss some of these below, others in another blog post to follow.

New 2GRVI latency-tolerant RV64I 64-bit RISC-V processing element

At just 320 LUTs/PE, the good old 2016-era 32-bit RV32I GRVI PE still has leading soft processor throughput per area. Its frugality made possible the first kilocore 32b RISC processor SoCs, but GRVI’s shortcomings include: 1) its 32-bit address and data width, which is an awkward match to AWS F1’s up to 1.5 TB DRAM, to OpenCL kernels which need to pass 64-bit pointers to global memory buffers, and which wastes half of the bandwidth of 64-bit wide UltraRAM memory banks; 2) its 300-400 MHz Fmax — fast, but not fast enough; and 3) its too-simple scalar RISC microarchitecture, with blocking in-order loads. Blocking loads are fine in a one PE system with a tightly coupled BRAM memory, but in an 8 PE GRVI cluster setting a load can take five cycles there and back through the cluster interconnect to the UltraRAM cluster memory banks (which can be two long trips across one fifth of the width of the die). This is especially painful in a function epilog, reloading n callee save registers, each load taking five cycles. Ugh.

The new RV64I 2GRVI PE tackles these problems: it provides 64-bit addresses and data, up to 550 MHz pipelined execution, and latency tolerance for loads and multi-cycle function units.

Using a busy-register scoreboard, loads do not stall the pipeline until/unless subsequent use of a still busy register — so in a function epilog’s register reloads, or an unrolled block copy loop, 2GRVI issues one load each cycle. The same mechanism enables concurrent execution and out-of-order completion of long latency function units, using a to-be-proposed open Custom Function Unit interface.

As with GRVI, the 64b 2GRVI PE optionally generates RTL obsessively and exquisitely technology mapped for Xilinx 6-LUT FPGAs. It also embraces Jan’s Razor: “In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.” This leads to a deconstructed PE architecture where functions such as shifts, multiplies, even byte-aligning load/store memory ports, are factored out of the PE core such that multiple PEs share those occasional-use resources. This gets the 64-bit 2GRVI PE core down to just 400 LUTs, and the total area overhead of the PE and its share of a six PE cluster, function units, cluster interconnect, and 300b Hoplite router, is about 700 LUTs.

For its highest Fmax of 550 MHz, 2GRVI can implement a 4-stage pipeline with an initiation interval of one instruction/cycle, but a minimum ALU result latency of two cycles. This enables higher frequency SoC designs, but impairs CPI by 25% or so. To mitigate ALU result-use stalls and four cycle taken branches, I’m also exploring introducing two-way hardware multithreading. This will cost ~100 LUTs, +80 LUTs of which are needed to double the physical register file to 64x64b, so it remains to be seen if this is a net win from the perspective of total throughput / area. We’ll see.

In all, 2GRVI’s XLEN width doubling, load latency tolerance, and higher Fmax means 2GRVI PE clusters have double or triple the total bandwidth to the cluster data RAMs vs. the older GRVI PEs in a GRVI cluster, using the same LUTs and UltraRAMs.

The following table compares and contrasts the two cores.

GRVI2GRVI
Year2015 Q42019 Q2
FPGA Target20 nm UltraScale16 nm UltraScale+
RTLVerilogSystem Verilog
ISARV32I + mul* + lr/scRV64I + lr/sc (mul WIP)
RV32I to come
Area320 LUTs400 LUTs (not including barrel shifter)
Fmax / congested400 / 300 MHz550 MHz / TBD MHz
Pipeline stages2 / 32 / 3 / 4 (superpipelined)
Latency tolerance: out-of-order retiretypical but optional
Latency tolerance: two hardware threadsoptional (WIP) (+100 LUTs)
Cluster, load initiation interval5 cycles1 / cycle
Cluster, load-to-use5 cycles6 cycles / 3 thread-cycles (WIP)
Cluster, peak cluster RAM bandwidth4.8 GB/s (300 MHz)12.8 GB/s (400 MHz (WIP))

Phalanx redesign for HBM2 memory

The Phalanx “array of clusters, exchanging messages on a NoC” architecture has been redesigned for Xilinx UltraScale+ HBM2 devices such as the VU37P FPGA, with 32 256b @ 450 MHz hardened AXI-HBM controllers coupled to the two stacks (8 GB) of HBM2.

It is rather tricky to move data at up to 3.7 Tb/s to/from the AXI-HBM controllers at the base of the FPGA, from/to the various cores across the length and breadth of the device. A very fast, very wide soft NoC is the way forward, although at FPGA SoC frequencies (300-600 MHz) this requires many thousands of northbound and southbound nets. (The faster the NoC clock, the fewer nets required.)

Then other clock constraints must be considered. The older 32-bit GRVI PEs are too slow; the Hoplite NoC and UltraRAMs can run at 600 MHz, but the AXI-HBM controllers’ Fmax is 450 MHz. To avoid clock domain crossings (for now) we aim to run each component at 450 MHz. (It’s a work-in-progress, we’re not there yet.) Then a 15x15x256b Hoplite NoC will carry ~200 GB/s of read data and ~200 GB/s of write data between the HBM controllers and any FPGA clusters or I/O controllers. While not yet full peak VU37P HBM2 bandwidth, it is nevertheless a giant leap ahead for RISC-V multiprocessors and for FPGA accelerators.

So this redesign depends on three advances: 1) modifying the NoC’s X rings x Y rings topology to include at least twice as many die-spanning vertical Y rings; 2) designing a wide, deeply pipelined NoC-AXI RDMA bridge that can sustain writes and burst reads on back to back clock cycles, 256 bits per bridge per cycle, all day long; and 3) generally increasing the Fmax of every element of the SoC from 300 MHz towards 450 MHz.

At present the first two have been achieved. The 30×7 NoC of the 2017 Hot Chips demonstration is replaced here with a 16×15 NoC with an array of 15×15 PE clusters and a row of 15×1 NoC-AXI RDMA bridges, each coupled to two AXI-HBM bridges. This doubles the NoC bandwidth to the HBM2 bridges. Here’s the new system topology:

The poster presents two different FPGA SoCs design chip plots.

The first is a 1776 PE GRVI Phalanx, with (15×15-3) x 8 32-bit GRVI PEs. (It depopulates three clusters in the bottom right of SLR0, freeing up some LUTs needed for the ~15000 LUT PCIe XDMA logic.)

A 1776 PE GRVI Phalanx, comprising a 15×15-3 array of clusters of eight RISC-V RV32I GRVI PEs, 128 KB cluster RAM, and Hoplite router, plus 15 NoC-AXI RDMA bridges and 30 AXI-HBM bridges.

The second is a 1332 PE 2GRVI Phalanx, with 222 clusters of six 2GRVI RV64I PEs. To our knowledge this is the first operational kilocore 64-bit RISC SoC in any technology, and the first with HBM memory.

A 1332 PE 2GRVI Phalanx, comprising a 15×15-3 array of clusters of six RISC-V RV64I 2GRVI PEs, 128 KB cluster RAM, and Hoplite router, plus 15 NoC-AXI RDMA bridges and 30 AXI-HBM bridges.

A later blog post will drill down into this design, how the memory system works overall, and experiences working with the Xilinx AXI-HBM bridges.

Welcome Xilinx Project Everest

Xilinx Everest block diagram

Everest: A New Adaptive Compute Acceleration Stack

Start with this Xilinx presentation from Victor Peng, Xilinx CEO: Xilinx Vision and Strategy for the Adapatable World. (Dear Xilinx: please share a recording of this presentation.)

Blog and press roundup:

My take: Everest is bold bet on Xilinx’s “data center first” strategy. I see Everest as Xilinx’s response to the present situation that its FPGAs beat GPUs on energy efficiency and integrated data center networking, but not raw compute, and they significantly trail CPUs and GPUs in developer productivity, adoption, and appeal. Just “more of the same” FPGA device scaling and integration was never going to change that.

I think the key ideas and challenges for Everest are to:

  1. add or harden the compute resources for which GPUs have a competitive edge today (software programmable “engines”, interconnect, memory system);
  2. keep it scaling up throughout the 2020s;
  3. make it all much easier for software developers to use and to love.

If Xilinx succeeds, it stands to win share from rival computing platforms, enable and grow new markets, and capture value beyond mere device sales. It is thrilling to see the bantamweight Xilinx innovating furiously versus the Intel+Altera behemoth, with its potential advantages of scale and of platform and tools integration.

(Back in the heyday of Microsoft’s Parallel Computing Platform, circa 2008, our mission was “to deliver lovable parallel programming models and infrastructure” — that is, “to provide models, languages, tools, libraries, and frameworks that make it easier for mainstream software developers to successfully develop and ship software that scales up on new parallel hardware”. Here Xilinx’s job one is similarly to make their top-to-bottom technology stack lovable to cloud services developers and cloud operators — competitive with/superior to CPUs, MICs, GPUs, and ASICs, on criteria including throughput, efficiency, cost, developer appeal, agility, and time to market.)

With VU37P HBM2-in-package memory and its CCIX interface, Xilinx catches GPUs/APUs in DRAM-tier memory bandwidth and coherent shared memory host integration. But that doesn’t address FPGAs’ raw compute and productivity shortcomings. SDAccel, i.e. OpenCL-based software defined accelerator hardware, is a leap forward, but with each mind-numbing multi-hour-build design iteration it loses the hearts and minds of high performance software developers.

Software-first, software-mostly, massively parallel compute and accelerator FPGA overlays, such as GRVI Phalanx (and its Hoplite NoC) provide minimum developer table-stakes: the rapid turnaround and NDRange data parallel programming model of recompile-and-go GPUs. But even obsessively-FPGA-optimized soft processor array overlays, with custom function units and accelerator cores, often cannot compete with full custom (e.g. GPU streaming multiprocessor) processing elements for cost, throughput, or energy efficiency. The FPGA cost, of bit-granularity programmable gates and wires, is too high when you are are instantiating e.g. 1680 processors.

So to compete and win in data center acceleration in the next decade, Xilinx has no alternative but to complement its leadership in programmable logic + SRAM + DSP + SERDES, with massive throughput software programmable “engines” and the requisite scalable interconnect NoC and memory system. (“20x” 20 TOPS INT8 ML inference is indeed “massive”.)

The concept of a bit-oriented FPGA hybridized with word oriented, massive throughput software programmability is not new. The many thousands of Xilinx DSP blocks (*), tailored for efficient FIR filters and such, were always tantalizingly close to (but yet so far from) software programmability. Projects like iDEA show the promise and the disappointment of running C code on DSP blocks + BRAMs. ASICs such as the Ambric Am2045 MPPA, the Adapteva Epiphany, the Kalray MPPA, Picohip’s picoArray, and many other MPPA, CGRA, RAW, etc. projects, albeit more for embedded systems than data center, were more than competitive with CPUs and GPUs, but ultimately did not disrupt the CPU-GPU-FPGA-ASIC status quo. Why? In part, they lacked synergistic SOC integration with the rest of the heterogeneous ARM-MPSOC-FPGA that e.g. Zynq / Zynq UltraScale+ MPSOC enjoy. In part they did not sufficiently deliver the developer love. In part these technologies were advanced by companies that did not have the requisite breadth or scale or sales channels or deep pockets. I think Project Everett can succeed where they did not.

( (*) You know, for a few gates more, the DSP48 block could have been our generation’s AMD2900 bitslice processor-kit. The poor DSP block just need a register file and a better result forwarding mux network. For want of a “Mick and Brick” we must build our phalanxes out of LUTs better spent on custom accelerators.)

Until the recent emergence of massive data center workloads (data analytics, AI/ML/vision, video) and data center scale FPGA farms (Catapult v2) there was no compelling value proposition to move Xilinx or Altera to gamble expensive FPGA die area, power budgets, and huge tools and libraries investments on massive throughput programmability. But now Xilinx sees “data center first” as their business imperative for the 2020s.

It’s a Heterogeneous, Specialized, Networked, Adaptive Computing World

While we await answers to questions such as “so what are these HW-SW programmable engines?”, “how do you program them?”, “will my current workload run on them?”, “what is the memory system?”, “how do software and hardware elements compose?”, and “so what is the difference between UltraScale programmable logic and ‘next generation’ programmable logic?”, it is clear that “Xilinx FPGAs” of the future will complement programmable logic with diverse programmable engines and application processors.

And what is the “right” mix of hardened processing elements and programmable logic? It depends. In the wake of its Zynq MPSOC-FPGA offerings, Xilinx is set to combine many types of computational resources in one ultra heterogeneous device, combining (surely ARM) app processors and real-time processors with the new programmable engines and programmable logic.

Note Xilinx’s mastery of multi-die “2.5D” packaging enables a flexible product mix of processing elements and programmable logic, composed on a hard network-on-a-chip (NoC) which presumably will span dies.

As we work to advance FPGA-efficient soft NoCs like Hoplite, we feel that hard NoCs complement, but not replace soft NoCs. For years to come, the installed base (including UltraScale+ VU9P, VU2xP, VU37P, Arria-10, Stratix-10, etc.) will require soft NoCs. Especially VU37P. And in a hard NoC device, there are even more resources for the fabric resources to reach and connect to. Much more many-to-many communication. Metcalfe’s Law in the small.

A Leap Ahead on Performance and Efficiency

Xilinx heralds several Everest breakthroughs including an astounding 20x boost on AI compute and 4x on 5G communications. Some of this scale up certainly comes the transition from TSMC 16 nm to 7 nm technology nodes, but this time much of the improvement must come from architecture, and in particular those new, mysterious programmable hardware engines.

In footnotes, Xilinx states the Everest 20x speedup is on an ML image recognition inference workload, versus a VU9P with 7000 DSPs at max performance. At max performance, on INT8-optimized ML inference, the latter can approach 20 TOPS, at 200+ GOPS/W. Does Everest achieve 400 TOPS? At 4 TOPS/W?? We’ll see.

Looking Ahead to 2020, and the Software Stack Challenge

On the hardware front, Xilinx new device products engineering is executing well, quickly mastering new technology nodes and packaging innovations to ship new devices and tools. (For example, I found zero device or tools bugs in porting a dense, complex 1680-core GRVI Phalanx to VU9P ES1 silicon in 12/2016 — it just worked.) This bodes well for a rapid and successful development and roll out of these new 7nm Everest devices.

But the impact and uptake of Everest depends to the greatest extent on the software stack. Xilinx has targeted several vertical domains — AI, video, genomics —  with a familiar, successful model of prebuilt high level frameworks, libraries, and tools stacks. Now Xilinx will have to prime the pump for Everest themselves. Xilinx, The Programmable Logic Company The All Programmable Company — can they become a great software and software developer tools company too? Will they have the will and the scale to invest in new languages, compilers, debuggers, profilers, runtimes, libraries, and yet more libraries to compete for developer mindshare at the level of NVIDIA, Intel, or Microsoft?

Xilinx can, must, and will enable software developers in key market segments harness these new programmable engines with turnkey software stacks. Developers will bring TensorFlow, ONNX, etc. models to Everest-enabled frameworks and run them instantly, without ever spinning a bitstream or editing a line of XDC or Verilog.

It follows that Xilinx and its domain partners will be very busy providing to their new customers prebuilt solutions stacks. (Intel too, can, must, and will pursue this strategy.) This also affords Xilinx and its partners an opportunity to accrue IP value up the software stack, selling accelerated software solutions priced at the value proposition they bring to customers, no longer just selling silicon devices at whatever LUTs/$ vs. Intel and others.

My mission is to make it easier to compute with FPGAs. More than ever, that’s Xilinx’s mission too. It’s an exciting time in the FPGA world; once again the sky is the limit.

(For a stale but fun 2013 take on FPGAs in the data center, check out Reconfigurable Computing in the Era of Dark Silicon.)

Pynq as a High Productivity Platform for FPGA Design and Exploration

This spring, en route to bringing GRVI Phalanx to Amazon AWS EC2 F1, I had to develop new AXI4 Stream and AXI4 Memory Mapped interface bridges to GRVI Phalanx. Both the Xilinx IP catalog and the F1 shell extensively use AXI4 interfaces.

Fortunately for me it was also time to try out the terrific Pynq (Python on Zynq) development platform and its first board, Digilent PYNQ-Z1. (Just $65 academic!) Pynq is a brilliant synthesis of the Zynq FPGA-SOC, Ubuntu Linux for Zynq, Python, the massive Python library ecology, Jupyter notebooks, and new Python classes, IP, and software for bridging the Python world and the programmable logic fabric world.

I expected Pynq and its browser-hosted Jupyter notebooks interface would be apt for learning and teaching modern FPGAs and FPGA applications, for documenting and sharing work. I hadn’t expected that Pynq would also be a high productivity tool for experienced FPGA designers to more raplidly explore, evaluate, discover, and play. Using Pynq to explore, develop, prototype saved me weeks of effort. It was particularly good for interactively writing tests and exploring new hardware corner cases in Python with <1 second turnaround.

Back in May I attended Pynq training presented by Xilinx’s Patrick Lysaght at FCCM 2017. In the morning session I put together this My Pynq experience slide deck (PDF) and Patrick was kind enough to let me present it as the impromptu lunch entertainment.

I know I have only scratched the surface of the Pynq platform, and I am fascinated by, and look forward to, the idea of delivering “Live, Runs Interactively in Hardware, Live User’s Guides” of my work as Jupyter notebooks.

Idea: The two platforms I am focusing upon this year are Pynq / PYNQ-Z1 and AWS EC2 F1, it would be wonderful to see a port of Pynq to F1, so I can also provide Live User’s Guides on F1.

Fellow FPGA designers, try Pynq. You’ll like it. Pynq makes exploring new FPGA ideas lightweight, fresh, fast, easy, fun again.

Thank you Patrick, Cathal McCabe, and team for bringing us an unanticipated but essential new tool.

GRVI Phalanx joins The Kilocore Club

The work-in-progress GRVI Phalanx massively parallel accelerator framework has been ported to the Xilinx Virtex UltraScale+ XCVU9P.

On Dec. 30, 2016, a design with 30 rows by 7 columns of clusters of 8 GRVI RISC-V cores + 128 KB CRAM (cluster RAM) + a 300-bit Hoplite NOC router — a total of 1680 cores and 26 MB of SRAM — booted up and tested successfully, running a message passing matrix multiply workload on all 1680 cores, in a XCVU9P-FLGA2104-2L-E-ES1 device in a Xilinx VCU118 evaluation kit.

This 1680 core GRVI Phalanx is the first operational kilocore RISC-V, the first kilocore 32b RISC in an FPGA, and the most 32b RISC cores on a chip in any technology.

1 core, 32 cores, 1680 cores -- RISC-V scales up! A 1-core Si-Five HiFive-1, a 2x2x8=32-core GRVI Phalanx in a Digilent Arty / XC7A35T, and a 30x7x8=1680-core GRVI Phalanx in a Xilinx VCU118 / XCVU9P

1 core, 32 cores, 1680 cores — RISC-V scales up! A 1-core Si-Five HiFive-1, a 2x2x8=32-core GRVI Phalanx in a Digilent Arty / XC7A35T, and a 30x7x8=1680-core GRVI Phalanx in a Xilinx VCU118 / XCVU9P.

Here is the basic cluster tile architecture redesigned for UltraScale+ and its new 288 Kb UltraRAM jumbo-SRAM blocks. The present design includes 210 instances of this tile.

A GRVI Cluster tile with 8 GRVI RISC-V cores, 128 KB multiported bank interleaved shared cluster RAM, optional accelerators (here, none), and a 300-bit wide Hoplite NOC router.

A GRVI cluster tile with 8 GRVI RISC-V cores, 128 KB multiported bank interleaved shared cluster RAM, optional accelerators (here, none), message passing NOC interface, and a 300-bit wide Hoplite NOC router.

An example 1680 GRVI system implemented in a Xilinx Virtex UltraScale+ VU9P. This GRVI Phalanx comprises NX=7 x NY=30 = 210 clusters, each cluster with 8 GRVI cores and a 8-ported 128 KB cluster shared memory. The clusters are interconnected on a Hoplite NOC, with the Hoplite routers configured with 290b data payloads (including 32b address and 256b data), achieving a bandwidth of about 70 Gb/s/link and a NOC bisection bandwidth of 900 Gb/s. Each cluster can send or receive 32 B per cycle into the NOC. The GRVI Phalanx architecture anticipates a variety of configurable accelerators coupled to the processors, the cluster shared RAM, or the NOC.

An example 1680 GRVI system implemented in a Xilinx Virtex UltraScale+ VU9P. This GRVI Phalanx comprises NX=7 x NY=30 = 210 clusters, each cluster with 8 GRVI cores and a 8-ported 128 KB cluster shared memory. The clusters are interconnected on a Hoplite NOC, with the Hoplite routers configured with 290b data payloads (including 32b address and 256b data), achieving a bandwidth of about 70 Gb/s/link and a NOC bisection bandwidth of 900 Gb/s. Each cluster can send or receive 32 B per cycle into the NOC. The GRVI Phalanx architecture anticipates a variety of configurable accelerators coupled to the processors, the cluster shared RAM, or the NOC.

An extended abstract with additional detail on this work has been submitted to, and hopefully will be presented at, the OLAF’17 workshop at FPGA’17.

Quick FPGA Hacks: Population count

Presented at FPL2014 the paper “Pipelined Compressor Tree Optimization using Integer Linear Programming” by Martin Kumm and Peter Zipf, University of Kassel uses clever (Xilinx) slice-wide (four 6-LUTs + CARRY4) based compressors to reduce as many as 12 input bits to 5 sum bits for an efficiency of up to (12-5) / 4 = 1.75 bits/LUT. This reminded me of a nice application of compressors in FPGAs — population count a.k.a. bit count, e.g. determine the number of one bits set in a word — which uses simpler LUT-based (no carry chain) compressors to good effect. I first learned about this approach in some Altera documentation. Using modern FPGAs’ 6-LUTs it is easy to build a 6-bit wide population count from a bitvector [5:0] input in one LUT delay. This is a “6:3 compressor”:

//////////////////////////////////////////////////////////////////////////////
// 6:3 compressor as a 64x3b ROM -- three 6-LUTs
//
module c63(
    input [5:0] i,
    output reg [2:0] o);    // # of 1's in i

    always @* begin
        case (i)
        6'h00: o = 0; 6'h01: o = 1; 6'h02: o = 1; 6'h03: o = 2;
        6'h04: o = 1; 6'h05: o = 2; 6'h06: o = 2; 6'h07: o = 3;
        6'h08: o = 1; 6'h09: o = 2; 6'h0A: o = 2; 6'h0B: o = 3;
        6'h0C: o = 2; 6'h0D: o = 3; 6'h0E: o = 3; 6'h0F: o = 4;
        6'h10: o = 1; 6'h11: o = 2; 6'h12: o = 2; 6'h13: o = 3;
        6'h14: o = 2; 6'h15: o = 3; 6'h16: o = 3; 6'h17: o = 4;
        6'h18: o = 2; 6'h19: o = 3; 6'h1A: o = 3; 6'h1B: o = 4;
        6'h1C: o = 3; 6'h1D: o = 4; 6'h1E: o = 4; 6'h1F: o = 5;
        6'h20: o = 1; 6'h21: o = 2; 6'h22: o = 2; 6'h23: o = 3;
        6'h24: o = 2; 6'h25: o = 3; 6'h26: o = 3; 6'h27: o = 4;
        6'h28: o = 2; 6'h29: o = 3; 6'h2A: o = 3; 6'h2B: o = 4;
        6'h2C: o = 3; 6'h2D: o = 4; 6'h2E: o = 4; 6'h2F: o = 5;
        6'h30: o = 2; 6'h31: o = 3; 6'h32: o = 3; 6'h33: o = 4;
        6'h34: o = 3; 6'h35: o = 4; 6'h36: o = 4; 6'h37: o = 5;
        6'h38: o = 3; 6'h39: o = 4; 6'h3A: o = 4; 6'h3B: o = 5;
        6'h3C: o = 4; 6'h3D: o = 5; 6'h3E: o = 5; 6'h3F: o = 6;
        endcase
    end
endmodule

A 36b popcount may be implemented using nine such 6:3 compressors and a small ternary adder. The first six compressors determine the no. of ones in each bundle of 6 input bits:

//////////////////////////////////////////////////////////////////////////////
// 36-bit bitcount
//
module pop36(
    input [35:0] i,
    output [5:0] sum);  // # of 1's in i

    wire [2:0] c0500, c1106, c1712, c2318, c2924, c3530, c0, c1, c2;

    // add six bundles of six bits
    c63 c0500_(i[ 5: 0], c0500);
    c63 c1106_(i[11: 6], c1106);
    c63 c1712_(i[17:12], c1712);
    c63 c2318_(i[23:18], c2318);
    c63 c2924_(i[29:24], c2924);
    c63 c3530_(i[35:30], c3530);

These transform the problem of adding 36 bits to that of adding six 3b vectors. Surprisingly these can also be added with another round of compressors, by focusing on the [0], [1], and [2] bit positions separately.

    // sum the bits set in the [0], [1], and [2] bit positions separately
    c63 c0_({c0500[0],c1106[0],c1712[0],c2318[0],c2924[0],c3530[0]}, c0);
    c63 c1_({c0500[1],c1106[1],c1712[1],c2318[1],c2924[1],c3530[1]}, c1);
    c63 c2_({c0500[2],c1106[2],c1712[2],c2318[2],c2924[2],c3530[2]}, c2);

Now the total number of bits is (c0 * 2^0) + (c1 * 2^1) + (c2 * 2^2):

    assign sum = {2'b0,c0} + {1'b0,c1,1'b0} + {c2,2'b0};
endmodule

The ternary adder is sufficiently simple that synthesis maps it to two levels of pure LUTs (no carry chain). All totaled, this has a delay of four LUTs and runs at 370 MHz in a Kintex-7 -1 speed grade.  By adding a pipeline register between the compressors and the ternary add, the circuit is reduced to two LUT delays, and runs at 550 MHz.  By adding a pipeline register between each level of LUTs, the critical path is under 1.2 ns, which could run at 860 MHz. However, the minimum clock period for this device is 1.6 ns (625 MHz). Happy hacking!