Category Archives: FPGAs

About field programmable gate arrays.

Welcome, AMD Embedded Microblaze V

Congratulations to AMD Embedded on its debut of Microblaze V.

With RISC-V support now from AMD, Intel, Lattice, MicroSemi, and others, FPGA vendors’ transition to RISC-V is (almost) complete. Still a few stragglers.

Related, here for your possible amusement is my IEEE FCCM 2019 Soft Processors Panel presentation, A Game of Soft Processors:


This panel, the same night as the doomed Game of Thrones’ Battle of Winterfell, anticipated a similar wipe out: all FPGA vendors’ proprietary soft processors would eventually fall to the RISC-V juggernaut. This has come to pass. This is good.

The past two decades have been an era of siloed, proprietary, fragmented, duplicative soft processor ecosystems. Use MicroBlaze (or Nios) IP and you were locked in to Xilinx (or Altera) devices.

Now, looking ahead, FPGA vendors should still compete vigorously on best, fastest, smallest processor cores and IP, and most productive development environments, but also, shocker!, they should work together to advance RISC-V standards and profiles for FPGA embedded systems SoCs, so that customer designs are more reusable and portable across platforms. This will be great for customers and great for the vendors, because overall there will be more IP and more solutions ready to run on their latest devices.

RISC-V-standards-based reuse and interop is a major theme and objective of the RISC-V International Soft Processor SIG. Please join us and let’s build a community to advance these standards.

Also, a RISC-V FPGA vendor and user community, together, might speak with a common voice, better to be heard by a RISC-V consortium focused on ASIC design considerations. As RISC-V International undertakes new task groups to shape new standards that address new applications and market segments, the FPGA implementers’ perspective will help ensure that these new standards remain relatively feasible and hopefully economical to implement in FPGAs’ LUTs and BRAMs and DSPs. This is another objective of the SIG:

Background and Motivation

4. FPGA RISC-V systems bring new opportunities and challenges, such as late customizability / fine-grained subsetting, novel memory systems and interconnects, accelerator integration, partial reconfiguration, and alternative arithmetic systems, that may not be a priority or relevant to ASIC implementations.

5. Proposed RISC-V ISA extensions may inadvertently induce circuit structures that are prohibitively expensive in certain FPGA devices or use cases. …

Goals and Scope

2. Represent FPGA implementation considerations within RISC-V TGs, acting as a resource for consultations and to monitor progress of RISC-V standards from the perspective of Soft CPUs.

Summer 2023 RISC-V Composable Custom Extensions Update

In winter 2022 I reworked our 2019-2022 work-in-progress materials on RISC-V Composable Custom Extensions into the Draft Proposed RISC-V Composable Custom Extensions Specification.

We presented a poster (narration) on this work at the 2022 Paris RISC-V Summit. Then we stepped back, to work elsewhere, and waited to see if there were new interest or uptake in the work. Some but no groundswell!

I did some development work towards that first still-pending end-to-end composition demo. I added some example CFUs to the nascent CFU Zoo, including implementations of the three CFU-LI level adapters and a CFU Switch.

In June 2023, my colleague, Prof. Guy Lemieux of UBC, chair of the RISC-V International SIG-SOFT-CPU group, and another co-designer of the composable extensions work, presented a new talk From CCX to CIX: A Modest Proposal for (Custom) Composable Instruction eXtensions at the 2023 Barcelona RISC-V Summit.

SIG-SOFT-CPU was one the original RISC-V Foundation SIGs. In 2019 our focus (and teleconf meetings) gradually devolved to strictly focus on refining the composable custom extensions work and so the other SIG charter items received short shrift. With the transition to RISC-V International, SIG-SOFT-CPU went dormant, but now in summer 2023, Guy is working to reboot the SIG. Currently he is driving a process to ratify a new Charter for the SIG.

The great big rename: goodbye Custom Interfaces and Custom Function Units (CFUs), hello Composable Extensions and Composable Extension Units (CXUs)!

For better or worse, the Custom Extensions spec has always used the invented, defined term Custom Interface to mean the abstract immutable interface contract of a composable custom extension. From the beginning I used this term because 1) it shone a spotlight on the all-important interface contract aspects of a custom extension, and 2) as homage to its inspiration, Microsoft COM Interfaces. Unfortunately it has only sowed confusion, between terms Custom Interface and Composable Extension, particularly because the spec also uses the (separate) concept interoperation interface extensively.

At last week’s online meeting I gave a 45 minute deeper dive into Composable Extensions for the SIG-SOFT-CPU members. When prepping for this talk, and giving it, I was frustrated that these particular monikers were not helping to convey concepts and understanding. It was time to say goodbye to custom interface. We needed a new term (not custom extension) to distinguish the category of composable custom extensions from current non-composable custom extensions and I chose composable extension for short.

This anticipates a RISC-V ISA world with standard extensions, composable extensions, and custom extensions; CX ::= composable extension.

Similarly we need a crisp term for the reusable modular hardware unit that implements a composable extension. Years ago I coined the term custom function unit for this type of tightly CPU-coupled synchronous hardware unit, because it implements the custom function instructions of composable custom extensions. But CFU is no longer apt and ideal. For one thing, an early fork of our group’s CFU and CFU logic interface design have become extensively used in the CFU Playground work. It seems tiring, hopeless, and pointless to try to wrest back the CFU moniker imprimateur from the CFU Playground folks for the present composition focused work.

Thus throughout the spec custom function unit (CFU) becomes composable extension unit (CXU). This is appealing because it captures its relationship to CXs: a composable extension unit is a core that implements a composable extension. Perfect. Also, and I can’t explain why, for me, CXU suggests a core that is different more flexible and powerful than a mere CFU. (Beyond older CFUs, proposed CXUs enable uniform/automatic: composition, n state contexts, OS context switching, multicore CPU complexes, extension versioning, …)

Now with the spec thoroughly reworked it was also time to rework the talk.

New talk: Design and Rationale of the Draft Proposed RISC-V Composable Custom Extensions Specification

Here is our new talk Design and Rationale of the Draft Proposed RISC-V Composable Custom Extensions Specification (slides PDF) which explains why RISC-V needs standards-based composable extensions to bring order and reuse to the custom extensions wild-west. The talk details the design of the various interop interface standards proposed in the spec, and importantly, explains why they are they way they are.

I hope you enjoy it, and find it a helpful complement to the spec.

The last slide, our Call to Action, proposes the next step in the work, which is, working with the RISC-V International framework, that the SIG-SOFT-CPU SIG should sponsor two new RVI Task Groups, an ISA TG to standardize -Zicx (custom extension multiplexing) and a non-ISA TG to standardize CXU-LI (custom extension unit logic interface).

Onwards!

S3GA: Part 2: Implementation: RTL v1

“An FPGA is just a clever heap of multiplexers and mux select memory”

S3GA RTL v1 is now available on our S3GA github repo.

A 2000 LUT configuration, implemented with Openlane

Here is the first implementation of the S3GA<N=2048,M=8,K=4> design, which fits in the efabless Open Shuttle Program‘s ~10mm2 caravel_user_project area (2.92mm x 3.52mm) for the 130nm Skywater PDK, as produced by the wonderful OpenLane tools.

GDS-II plot of the first version of the S3GA<2048> design, implemented in 130nm sky130 process.
GDS-II plot of the first version of the S3GA<2048> design, implemented in 130nm sky130 process.

Implementation plus signoff checks take about 7 hours. At peak the tools used 60 GB of RAM. There are 2 wee Magic design rule check failures to investigate. Some design stats: 141,000 DFFs, 14,000 mux4s and 19,000 mux2s. This first cut has a 9 ns clock period = M=8-cycles at 72 ns = 14 MHz.

Next: DFFRAM?

This design uses 256-4=252 eight 4-LUT logic blocks (LB8s) (see part one), each of which uses an 8x48b configuration memory — an 8-entry x 48b circular shift register.

I hope to replace the hundreds of 8x configuration memories with bespoke, placed (floorplanned) instances of DFFRAM macros. This should also make it more practical to add a LUT-RAM mode to LB8s, so they may also be used as true dual port 32×8 SRAMs and possibly as 2R1W-64×4 SRAMs. Thus each X32 tile could implement a 32×32 register file.

It should also be possible to replace some X128 tiles with 100% DFFRAM or OpenRAM based block RAM (BRAM) cells at >3x the density of S3GA LUT-RAM.

Onwards!

S3GA: A Simple Scalable Serial FPGA: Part 1: Beginnings

Note: this blog post is under construction. This notice will be removed when it is complete.

Introduction: a look back at S4GA

In winter 2021 I took the first edition of Matthew Venn’s superb Zero to ASIC course. My 300x300um project was S4GA: “a small simple slow serial” FPGA core, targeting ~0.1mm2 of the 130nm Skywater ASIC PDK, using the efabless Caravel harness and the Zero-to-ASIC multiproject framework. This core fit an N=40 4-LUT (four input lookup table) FPGA arranged in 5 clusters of M=8 LUTs, with a full cluster cycle time of M=8 cycles. A family tragedy interrupted this work and I did not finish it nor submit to MPW-1.

block diagram of S4GA N=40
Block diagram of the N=40 M=8 K=4-LUT S4GA FPGA designed for the Zero-to-ASIC course (2021)
GDS plot (300um x 300um) of the Skywater130 implementation of the S4GA core, N=40 M=8 K=4-LUTs

Scaling way up with S3GA

Now I am reworking S4GA into a much larger (N>1000 LUTs) FPGA for the upcoming efabless MPW-8 run, closing Dec. 31, 2022. A full efabless Caravel user project area is over 10mm2, approximately 100x larger than this earlier N=40 design.

As this new FPGA core won’t be small, nor (in the spatial computation sense) slow, it is reborn as S3GA — simple scalable serial FPGA. (Pronounced “see-gah”, not “say-gah”.)

Follow the work in progress here: S3GA github repo.

“What do you mean by serial FPGA?” The FPGA evaluates N logical LUTs using only N/M real LUTs, over M cycles. A serial FPGA trades off latency (e.g., logical clock period) for greater capacity by sharing (amortizing) the gate and wiring area of lookup table input and output multiplexers (muxes).

For S3GA, one LUT Block of M=8 logical K=4-LUTs (“LB8”) contains eight 4-LUT contexts (each about 40b, see below), and one physical 4-LUT (including one or more logic block input muxes, four LUT input muxes, one 16:1 LUT output mux, and a flip-flop output mux).

The README of the old N=40 S4GA design is instructive. Take a look. For that design:

  1. N=40 and M=8, so across the whole design, this S4GA configuration evaluates 5 LUTs per cycle, or 40 LUTs per 8 cycles.
  2. The global interconnect is just a flat 64b bus that passes each cluster. Each cluster has G=2 64:1 muxes to select two global input nets per LUT per clock.

Contemplating S3GA, with perhaps 50x more more LB8s and LUT outputs, this flat global interconnect architecture will not scale. For example, it could mean running 1000-bit buses past hundreds of LB8s and many hundreds of 1000:1 input muxes, and so forth. Such a flat design would also be over-provisioned with so many mm2 of unneeded global wiring and super wide LUT input muxes.

Rent’s Rule, T=tgp, estimates T (number of external IOs) as a function of g (number of gates), t and p constants, p<1, sometimes p=~0.5. It reflects that most of a circuit’s sub-circuits’ nets remain local to that sub-circuit — recursively so in a hierarchical design. For an FPGA architecture that hosts real world N LUT digital circuits, tNp global nets, instead of N global nets, will usually suffice. Also, within any subset of M k-LUTs of a circuit, there are never more than Mk inputs and M outputs.

Connecting a hierarchy of clusters

Instead of a classic island-style FPGA of rows and columns of M k-LUT clusters and switchboxes, S3GA will be a hierarchy of LB8s, clusters of LB8s, etc., interconnected with a hierarchy of switches in a fat tree like interconnect. This should allow a relatively simple place-and-route CAD flow to transform synthesized, technology mapped K-LUT netlists into S3GA configuration bitstreams.

Anticipating a 2D physical design layout, we arbitrarily pick a sub-cluster branch factor of four, so that sub-clusters may be (recursively) floorplanned to NW, NE, SW, SE quadrants of each cluster.

Here’s a sketch of a N=512 logical LUT quadrant of such a design:

A diagram illustrating the hierarchical layout of a S3GA FPGA, comprising 8-LUT clusters at the leaves, 4 8-clusters composed via an X32 switch, four X32 switches composed by an X128 switch, four of those composed with an X512 switch.
A hierarchical N=512 LUT quadrant of a large S3GA FPGA

At the leaves of this quadrant are sixty-four LB8s, each labeled 8. LB8s (hence LUTs) only appear at the leaves of the graph.

Each cluster of four LB8s is composed by an X32 switch. The X32 switch routes some subset of its 32 LUTs’ outputs up to its parent X128 switch, accepts some inputs from its X128 switch, and routes them down to its four LB8s as appropriate.

At the next level up, the X128 switch accepts subsets of LUT outputs from its four child X32 switches, as well as other input nets from its parent X512 switch, and routes input subsets of all these nets down to its four child X32 switches.

At the top level of this diagram, the X512 switch accepts subsets of LUT outputs from its four child X128 switches, as well as other inputs nets from its parent X2048 switch (not shown), and routes input subsets of all these nets down to its four child X128 switches.

With this architecture, global LUT placement is (mostly) finding the minimum cut 4-partition of the netlist hypergraph, and repeating min cut 4-partitions recursively until each subpartition fits in an LB8. A design “fits” into the device if its LUTs fit into the total LB8 capacity of the device, of course, and if the size of each cut is not greater than the capacity of the corresponding inter-switch bus. Following a successful hierarchical partitioning, routing requires scheduling which LUTs to evaluate in which cycles, and filling in so many input and output mux selection tables.

Reflecting Rent’s rule, the switch hierarchy need not “route up” every LUT output. For example, an LB8 might have 8 outputs to its parent switch, and the X32 switch might have 32 outputs, but the X128 switch might output only 64 of those 128, and the X512 switch might output only 128 of those 512.

Similarly, for input routing, an X128 switch might “route down” (different) 48 net subsets of its input nets for each of its four child X32 switches. (Specific switch I/O bus width parameters are TBD.)

Floorplan

Here is an example floorplan for a N=2048 LUT S3GA device, reflecting the proposed recursive 2D 4-partitioning of the circuit. The first leaf cluster of four LB8s (32 LUTs) is replaced with a 32b IO block (“IOB32”).

Floorplan of a 2K LUT S3GA
Floorplan of a 2K LUT S3GA

Observe that a LUT output net from one LB8 to an adjacent LB8 at the same X32 switch need never leave that domain, whereas a LUT output in some X512 quadrant may be received as a LUT input in some other other X512 quadrant, by ascending an X32, X128, and X512 switch, up to the top-level X2048 switch, then descending an X512, X128, and X32 switch, down to the receiving LB8.

A serial interconnect fabric

S3GA’s LB8’s bit serial nature economizes standard cells per logical LUT. More significantly, it enables a remarkably frugal programmable interconnect fabric.

Referring to the older N=40 S4GA block diagram figure above, the five M=8 LUT clusters are passed by a bus of 64 nets (24 FPGA inputs and 40 LUT outputs). But each of these 40 LUT output wires changes only once per M=8 cycles. These 40 LUT outputs could be conveyed over five wires by serially streaming out LUT outputs into the interconnect fabric. Now the LUT input muxes that select input values from these LUT outputs can be just a (5-1):1 mux instead of a (40-8):1 mux. Now transmitting all 128 LUT outputs from a X128 switch quadrant requires only 16 wires (over 8 clock cycles). Now selecting one of these 128 outputs requires a 16:1 mux instead of a 128:1 mux (i.e., five mux4_1s cells instead of 42+ cells).

However serial transmission of LUT outputs on shared wires significantly complicates LUT cycle scheduling. If LUT A in some LB8 has an output net that is an input of LUT B in an adjacent LB8, it is necessary to schedule the LUT B evaluation one cycle after the LUT A, because that is the only clock cycle during which the A value is available on a LUT output wire. Use it or lose it! But if LUT B has a second input, LUT C, evaluated during a different clock cycle than LUT A, there is no clock cycle during which A’s and LUT C’s outputs are simultaneously available as LUT B inputs. Drat. We’ll address this problem momentarily.

So across the entire hierarchical interconnect fabric, S3GA conveys all LUT output nets in the M=8 serial time domain, eliminating (M-1)/M of all wires and their mux trees.

The serial interconnect fabric hierarchy

This figure revisits the N=512 LUT quadrant hierarchy figure above, labeling the switch input and output ports with strawman port widths, for one vertical slice of elements through the hierarchy. Remember these port widths are for bit serial ports, and convey, across eight clock cycles, eight times as many nets as port wires. For example, the four 16-wire ports that are the inputs of the top level X2048 switch, carry 4x16x8 = 512 output nets from the four X512 quadrants.

A slice through the switch hierarchy, labeled with serial interconnect I/O bus widths.

This table summarizes the strawman switch port width parameters.

ParameterValueDescription
N2048no. of LUTs (incl. IOB32 as LUTs)
M8no. of logical LUTs per LUT (or, per logic block)
B4branching factor: no. of sub-clusters per cluster
O01level 0: LB8: no. of serial outputs
O14level 1: X32: no. of serial outputs
O28level 2: X128: no. of serial outputs
O316level 3: X512: no. of serial outputs
N464level 4: X2048: no. of serial nets
I324level 3: X512: no. of serial inputs
I212level 2: X128: no. of serial inputs
I16level 1: X32: no. of serial inputs
I09level 0: LB8: no. of serial inputs

LB8 input deserialization

When a serial LUT output net is routed to and arrives at some LB8, it is selected as an LB8 input by one of I >=1 logic block input muxes, whether or not the net is an input for the logical LUT evaluation that cycle. This input net is buffered in one of I (M-deep) LB8 input buffers (serial-in parallel-out shift registers). This input net value may then be used as a LUT input during the next M LUT eval cycles of that LB8.

In the baseline (bit parallel interconnect) S4GA N=40 M=8 LUTs design, there are G=2 global inputs per LUT cluster, using two 40:1 muxes (each 13 mux4_1s). Adopting a bit serial interconnect and LB8 input deserialization, there are two 8:1 LB8 input muxes, two 8x1b LB8 input buffers, and two (8+8+…):1 LUT input muxes (5+… mux4_1s), each selecting a LUT input from the LB8 input buffers.

The base S4GA design requires M=8 x G=2 x clg(40) bits = 96b of configuration data to select two LUT inputs from the 40 bit parallel LUT outputs, whereas the bit serial version requires M=8 x I=2 x clg(5) bits = 48b of configuration data (LB8 inputs) plus M=8 x I=2 x clg(8+8+…) = 80b of config data for two LUT input selectors (LUT inputs). A 60% increase — worth it to mitigate LUT cycle scheduling constraints.

Level 0: M=8 logic block (LB8)

This figure shows the general structure of the M=8 I=3 K=4 logic block (“LB8”), now with block input muxes and buffers.

Architecture of LB8: M=8 I=3 K=4 LUT block

(Not shown: configuration logic, half-LUT cascade, D-FF clock enables and set/reset, and 32×8/64×4/128×2/256×1 true dual port RAM mode.)

At left, the block is passed by six global wires and four local wires. Global wires are serial input nets from afar, received via the block’s X32 switch. Local wires are the serial output nets of the four LB8s in this X32 cluster.

Each cycle the block inputs three of these nets, as selected by three 3b LB-IMUX selector fields. These three inputs are buffered in three 8b IBUF shift registers. Together these 24b plus the most recent 8 LUT outputs (the OBUF shift register) are the 32 possible input nets for the 4-LUT.

Next, four 5b IMUX selector fields select four of these 32 inputs as the 4-bit LUT input. This selects one of the 16 bits of the LUT mask as the LUT combinational output.

Then, the final output of the LUT net is determined by combinational output, the previous value of the LUT output (i.e., the M=8 shift-out bit of OBUF), the global reset input (not shown), the LB8 clock enable (CE) input (not shown), and the LUT’s 3b FF control field.

The new value of the LUT output net is captured in the OBUF shift register, and is output via a local wire to the other LB8s in this cluster, and to the X32 switch.

In an N=2048 LUT S3GA, 256 LB8s – 4 LB8s (IOB32) require about 252 x (8×48 + 32 + 4) = 252 x 420 = 105,840 configuration bits.

LB8 half-LUT cascade

Half-LUT cascade is a simple local optimization to implement adders efficiently. A n-bit ripple carry adder comprises a vector of n full adder cells. Each cell adds a[i], b[i], and carry[i-1], producing sum[i] and carry[i]. When this is technology mapped to a pure 4-LUT FPGA, it requires 2n LUTs for the 2n LUT outputs (sum[n-1:0] and carry[n-1:0]).

S3GA LB8’s provide half-LUT cascade to implement n-bit adders in n LUTs. S3GA provides a simple fracturable LUT, treating a k-LUT as two (k-1)-half-LUTs of the same k-1 inputs. So in addition to the usual full k-LUT output, which can be routed anywhere in the S3GA, an additional half-LUT (i.e., (k-1)-LUT) output, using the k-1 LSBs of the LUT input index, is also evaluated and registered in a “half_q” register for possible use as the half-LUT cascade-in in the next tick.

For example, when k=4, LUT inputs { 1’b1, half_q, b[i], a[i] } can index an adder LUT mask that produces sum[i] as the output of the full LUT (i.e., the upper half LUT) and carry[i] as the output of the lower half LUT. To select a half-LUT cascade input and half-LUT evaluation, a LUT’s k-1 and k-2 LUT input selectors are set to all-ones. The first special selector selects a constant 1; the second selects the current value of half_q.

To facilitate wide adders (up to n=32b in 32 LUTs / 4 LB8s), starting in any LB8, the half-LUT output of LUT #7 in each LB8 cascades into the half-LUT input of LUT #0 in the next LB8 in the four LB8-cluster. For those of you (properly) thinking serially, that means, in tick M-1=7, that the half-LUT output of each LB8 is registered as the next half-LUT input, for (pending) tick 0, in each next LB8 in the cycle (LB8 #0, #1, #2, #3, #0, …). Of course, unless adder sub-segments are pipelined, a ripple-carry 32b add still has a latency and an initiation interval of 32 ticks = 4 tocks.

TODO: illustrate half-LUT carry adders with a figure and LUT netlist.

Level 1: Composing four LB8s with a (transparent) X32 switch

Four LB8s composed by a (transparent) X32 switch

In an N=2048 LUT S3GA, 64 X32s require zero configuration bits and zero muxes.

Level 1: IO block (IOB32)

The first (and only, for the time being) “X32 switch cluster” of four LB8s is replaced with an IOB32 block. This interfaces the serial interconnect S3GA with the greater parallel interconnect SoC. It consists of a configurable input crossbar selecting up to 32 inputs into a 4b serial output bus, plus a configurable output crossbar, receiving a 6b serial input bus and from that selecting and registering up to 48 parallel outputs.

Level 2: Composing four X32s with an X128 switch

Architecture of an X128 switch

In an N=2048 LUT S3GA, sixteen X128s require 16 x (8x8x4 + 4x8x6x5) = 16 x 1,216 = 19,456 configuration bits and 16 x (8×16:1 + 4x6x24:1) muxes = 16 x (8×5 + 24×8) mux4_1s = 16 x 232 mux4_1s = 3,712 mux4_1s.

Level 3: Composing four X128s with a X512 switch

Architecture of an X512 switch

In an N=2048 LUT S3GA, four X512s require 4x (8x16x5 + 4x8x12x6) = 4 x 2,944 = 11,776 configuration bits and 4 x (16×32:1 + 4x12x48:1) muxes = 4 x (16×11 + 48×16) mux4_1s = 4 x 944 mux4_1s = 3,776 mux4_1s.

Level 4: Composing four X512s with a X2048 switch

Architecture of an X2048 switch

In an N=2048 LUT S3GA, one X2048 requires 4x8x24x6 = 4,608 configuration bits and 4x24x48:1muxes = 96×16 mux4_1s = 1,536 mux4_1s.

TO BE CONTINUED…

Our poster presentation on Composable Custom Extensions and Custom Function Units for RISC-V

Next week we have a poster presentation on our ongoing work on Composable Custom Extensions and Custom Function Units for RISC-V at Spring 2022 RISC-V Week in Paris.

Note, unfortunately I will not attend. Other contributors will be on site to present and discuss the work.

Here is the poster (PDF) and here is the video narration of the poster (MP4).

Poster: Composable Custom Extensions and Custom Function Units for RISC-V

Introducing Composable Custom Extensions and Custom Function Units for RISC-V

Today’s The Register article by Agam Shaw, RISC-V takes steps to minimize fragmentation, discusses RISC-V International’s efforts to grapple with the RISC-V instruction set architecture’s growing pains as its diverse community strives to apply and optimize RISC-V to many different use cases.

There is a fundamental tension between development of new RISC-V optional standard extensions which must be non-proprietary, of broad interest and general utility, which may take years to reach consensus and ratification, and which consume a shared, limited resource (RISC-V encoding space and overall complexity), versus development of a custom extension, which may be the work of one party, in house, in one day, narrowly targeted, and/or proprietary. Both are valuable and necessary. However, RISC-V does not currently provide means to make these custom extensions, their hardware implementations, and their software libraries, reusable and interoperable, which does silo solutions and fragments the ecosystem.

Imagine you could combine the guaranteed correct composition of the optional standard instruction set extensions, and the agility of custom extensions. For several years, a small group of RISC-V FPGA soft processor developers and users have met informally, on and off, working towards this vision.

We (see Preface for contributors) now have a Draft Proposed RISC-V Composable Custom Extensions Specification ready for public review and discussion. Even though it is still a work-in-progress we hope it helps inform discussion of this issue of RISC-V custom extensions and interoperation vs. fragmentation.

This first edition of the spec focuses on the prerequisite common HW-HW and HW-SW interfaces, formats, and metadata required to achieve robust automatic composition of custom extensions. Notably the spec includes a chapter on the Custom Function Unit Logic Interface, a HW-HW interface for composable custom function units that plug-and-play into different processors and systems. Beyond this spec, much work remains to define the software stack and software tooling above these interfaces, flesh out the Runtime, etc.

We just submitted a one page abstract for a poster which we hope may be presented and discussed at the upcoming 2022 Spring RISC-V Week in Paris. If the 50 page spec is daunting, this one page TLDR abstract attempts to provide a brief overview of some objectives and contributions of the work.

I recapitulate the poster abstract below. For more detail, see the spec.

UPDATE: the poster was accepted. Here is the poster (PDF) and here is a video narration of the poster: Composable Custom Extensions and Custom Function Units for RISC-V.

To get involved with this work, stay tuned, we will set up a public mailing list for discussions momentarily (and update this paragraph). Also please raise your spec concerns and suggestions in the Issues list. Thank you for your interest. Onwards!

Whenever we specify new a custom extension, implement it as a custom function unit, or target it as an accelerated library, let us do so using common, standardized interoperation interfaces so that it may “just work” with all RISC-V CPUs and the other standard and custom extensions.


Composable Custom Extensions and Custom Function Units for RISC-V (Poster Abstract submited for 2022 Spring RISC-V Week)

Jan Gray (Gray Research) , Tim Vogt (Lattice Semiconductor), Tim Callahan (Google), Charles Papon (SpinalHDL), Guy Lemieux (University of British Columbia), Maciej Kurc (Antmicro), Karol Gugala (Antmicro)

This poster introduces a draft specification for composable custom instruction extensions in RISC-V. The RISC-V custom instruction encoding space is unmanaged, leading to potential conflicts when combining different accelerators and their libraries into one system. This specification defines interop interfaces including a physical logic interface and CSRs that manage the composition of multiple, independently developed custom instruction extensions. Contributions include custom interface multiplexing and stateful but isolated state for multiple harts sharing multiple custom function units (CFUs).

Today, custom extensions don’t interoperate

SoCs may use app-specific hardware accelerators to improve performance and energy – particularly so with FPGA SoCs that offer plasticity and abundant spatial parallelism. The RISC-V ISA explicitly supports domain-specific custom extensions.

There are many RISC-V processors with custom instruction extensions, and now some vendor tooling. But the accelerated libraries that use these extensions and the cores that implement them are authored by different organizations, using different tools, and may not work together. Different custom extensions may conflict in use of opcodes, or their implementations may require different CPU cores, pipeline structures, logic interfaces, models of computation, means of discovery, context switching, or error reporting. Composition is difficult, impairing reuse of hardware and software, and fragmenting the RISC-V ecosystem.

Unleashing innovation in interoperable custom extensions

RISC-V International uses a community process to define a new standard extension to the RISC-V ISA. New extensions must be of broad interest and utility to merit allocation of precious RISC-V opcode space, CSR space, and generally to add to the enduring complexity of the platform. New extensions typically require years to reach consensus and ratification. Each coexists with all other extensions. Might any new custom extension also safely coexist (compose) with all extensions? Might there be a rich ecosystem of plug-and-play custom extensions? Yes!

Our proposed interop interfaces allow any party to rapidly define, develop, and use:

  • a custom interface (CI): a custom extension consisting of a set of custom function (CF) instructions,
  • a custom function unit (CFU): a composable hardware core that implements a custom interface,
  • an accelerated CI library that issues custom instructions,
  • a processor that can mix and match any CFUs (plural), and
  • tools to create and compose these elements into systems.
Composing packages of custom interfaces, CPU and CFU cores, and accelerated libraries, into systems

Custom interfaces, their CFUs and libraries, may be open or proprietary, even of narrow interest. Anyone can mint a new one. A new CPU core can use existing CFUs and libraries. A new interface, CFU, or library can be used by existing CPUs and systems. Many CFUs may implement a given custom interface, and many libraries may issue instructions of a custom interface.

Such composition requires routine integration of separately authored, separately versioned elements into stable systems that just work together, now, and over time, as elements evolve. To ensure composition does not change the behavior of any interface, interfaces’ state contexts are isolated: a CF instruction only accesses its source operands and its current state context.

Custom interface multiplexing

Custom interface multiplexing provides an inexhaustible, collision-free opcode space for custom instructions without any central opcode authority. Every new interface can use any or all of the custom-0/-1 opcode space. Each accelerated CI library, prior to issuing any custom instructions, calls a runtime to obtain that interface’s (CFU,state) selector value and write it to a new mcfu_selector CSR. This selects the hart’s current interface (and CFU core) and its current interface state context. Like the vector extension’s vsetvl instruction, an mcfu_selector write configures the behavior of custom instructions that follow.

HW-SW interface: issuing a custom function instruction using custom interface multiplexing ⇒ CFU logic interface

Custom function unit logic interface (CFU-LI)

A CPU executes a CF instruction by sending a CFU request to a CFU, carrying context IDs and operands. The CFU processes the request, may update its state, and sends a CFU response, which updates a destination register and the cfu_status CSR.

The CFU-LI defines standard signaling and metadata for combinational, fixed-latency, and variable-latency CFUs, so that CPU and CFU packages may be automatically composed.

Example variable-latency, flow controlled CFU-L2 transactions

2GRVI Phalanx at Hot Chips 31 (2019): The First Kilocore RISC-V RV64I with High Bandwidth Memory

This week at Hot Chips 31 (2019) I am presenting a status update poster on the work-in-progress GRVI Phalanx Accelerator Kit: 2GRVI Phalanx: Towards Kilocore RISC-V FPGA Accelerators with HBM2 DRAM (PDF).

This is the debut of the FPGA-efficient 2GRVI (“too groovy”) RV64I processing element (PE) core, and of Phalanx support for FPGAs with HBM2 high bandwidth DRAM, first discussed last month.

The poster tells the story of the version two redesign of GRVI Phalanx to take best advantage of HBM2 DRAM. It explains some V1 limitations, particularly FPGAs’ relatively low DRAM bandwidth, and shows how the advent of HBM2 FPGAs, such as the Xilinx VCU37P and VU35P in the Alveo U280 and U50 accelerator cards, potentially with over 400 GB/s of memory bandwidth, fundamentally changes the utility and competitiveness of FPGA accelerators.

However, the Niagara of data that 30+ HBM2 memory channels can pour down on your head required changes to the PE and to the Phalanx SoC architecture to request and receive all that sweet sweet bandwidth. These changes include:

  • New 2GRVI latency-tolerant RV64I PE
  • New 64b cluster interconnect, 64b UltraRAM banks
  • New 32B/cycle split transaction pipelined NoC-AXI RDMA bridges
  • Add PCIe XDMA mastering (to an AXI-HBM channel)
  • Add many more Hoplite NoC ring columns

We discuss some of these below, others in another blog post to follow.

New 2GRVI latency-tolerant RV64I 64-bit RISC-V processing element

At just 320 LUTs/PE, the good old 2016-era 32-bit RV32I GRVI PE still has leading soft processor throughput per area. Its frugality made possible the first kilocore 32b RISC processor SoCs, but GRVI’s shortcomings include: 1) its 32-bit address and data width, which is an awkward match to AWS F1’s up to 1.5 TB DRAM, to OpenCL kernels which need to pass 64-bit pointers to global memory buffers, and which wastes half of the bandwidth of 64-bit wide UltraRAM memory banks; 2) its 300-400 MHz Fmax — fast, but not fast enough; and 3) its too-simple scalar RISC microarchitecture, with blocking in-order loads. Blocking loads are fine in a one PE system with a tightly coupled BRAM memory, but in an 8 PE GRVI cluster setting a load can take five cycles there and back through the cluster interconnect to the UltraRAM cluster memory banks (which can be two long trips across one fifth of the width of the die). This is especially painful in a function epilog, reloading n callee save registers, each load taking five cycles. Ugh.

The new RV64I 2GRVI PE tackles these problems: it provides 64-bit addresses and data, up to 550 MHz pipelined execution, and latency tolerance for loads and multi-cycle function units.

Using a busy-register scoreboard, loads do not stall the pipeline until/unless subsequent use of a still busy register — so in a function epilog’s register reloads, or an unrolled block copy loop, 2GRVI issues one load each cycle. The same mechanism enables concurrent execution and out-of-order completion of long latency function units, using a to-be-proposed open Custom Function Unit interface.

As with GRVI, the 64b 2GRVI PE optionally generates RTL obsessively and exquisitely technology mapped for Xilinx 6-LUT FPGAs. It also embraces Jan’s Razor: “In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.” This leads to a deconstructed PE architecture where functions such as shifts, multiplies, even byte-aligning load/store memory ports, are factored out of the PE core such that multiple PEs share those occasional-use resources. This gets the 64-bit 2GRVI PE core down to just 400 LUTs, and the total area overhead of the PE and its share of a six PE cluster, function units, cluster interconnect, and 300b Hoplite router, is about 700 LUTs.

For its highest Fmax of 550 MHz, 2GRVI can implement a 4-stage pipeline with an initiation interval of one instruction/cycle, but a minimum ALU result latency of two cycles. This enables higher frequency SoC designs, but impairs CPI by 25% or so. To mitigate ALU result-use stalls and four cycle taken branches, I’m also exploring introducing two-way hardware multithreading. This will cost ~100 LUTs, +80 LUTs of which are needed to double the physical register file to 64x64b, so it remains to be seen if this is a net win from the perspective of total throughput / area. We’ll see.

In all, 2GRVI’s XLEN width doubling, load latency tolerance, and higher Fmax means 2GRVI PE clusters have double or triple the total bandwidth to the cluster data RAMs vs. the older GRVI PEs in a GRVI cluster, using the same LUTs and UltraRAMs.

The following table compares and contrasts the two cores.

GRVI2GRVI
Year2015 Q42019 Q2
FPGA Target20 nm UltraScale16 nm UltraScale+
RTLVerilogSystem Verilog
ISARV32I + mul* + lr/scRV64I + lr/sc (mul WIP)
RV32I to come
Area320 LUTs400 LUTs (not including barrel shifter)
Fmax / congested400 / 300 MHz550 MHz / TBD MHz
Pipeline stages2 / 32 / 3 / 4 (superpipelined)
Latency tolerance: out-of-order retiretypical but optional
Latency tolerance: two hardware threadsoptional (WIP) (+100 LUTs)
Cluster, load initiation interval5 cycles1 / cycle
Cluster, load-to-use5 cycles6 cycles / 3 thread-cycles (WIP)
Cluster, peak cluster RAM bandwidth4.8 GB/s (300 MHz)12.8 GB/s (400 MHz (WIP))

Phalanx redesign for HBM2 memory

The Phalanx “array of clusters, exchanging messages on a NoC” architecture has been redesigned for Xilinx UltraScale+ HBM2 devices such as the VU37P FPGA, with 32 256b @ 450 MHz hardened AXI-HBM controllers coupled to the two stacks (8 GB) of HBM2.

It is rather tricky to move data at up to 3.7 Tb/s to/from the AXI-HBM controllers at the base of the FPGA, from/to the various cores across the length and breadth of the device. A very fast, very wide soft NoC is the way forward, although at FPGA SoC frequencies (300-600 MHz) this requires many thousands of northbound and southbound nets. (The faster the NoC clock, the fewer nets required.)

Then other clock constraints must be considered. The older 32-bit GRVI PEs are too slow; the Hoplite NoC and UltraRAMs can run at 600 MHz, but the AXI-HBM controllers’ Fmax is 450 MHz. To avoid clock domain crossings (for now) we aim to run each component at 450 MHz. (It’s a work-in-progress, we’re not there yet.) Then a 15x15x256b Hoplite NoC will carry ~200 GB/s of read data and ~200 GB/s of write data between the HBM controllers and any FPGA clusters or I/O controllers. While not yet full peak VU37P HBM2 bandwidth, it is nevertheless a giant leap ahead for RISC-V multiprocessors and for FPGA accelerators.

So this redesign depends on three advances: 1) modifying the NoC’s X rings x Y rings topology to include at least twice as many die-spanning vertical Y rings; 2) designing a wide, deeply pipelined NoC-AXI RDMA bridge that can sustain writes and burst reads on back to back clock cycles, 256 bits per bridge per cycle, all day long; and 3) generally increasing the Fmax of every element of the SoC from 300 MHz towards 450 MHz.

At present the first two have been achieved. The 30×7 NoC of the 2017 Hot Chips demonstration is replaced here with a 16×15 NoC with an array of 15×15 PE clusters and a row of 15×1 NoC-AXI RDMA bridges, each coupled to two AXI-HBM bridges. This doubles the NoC bandwidth to the HBM2 bridges. Here’s the new system topology:

The poster presents two different FPGA SoCs design chip plots.

The first is a 1776 PE GRVI Phalanx, with (15×15-3) x 8 32-bit GRVI PEs. (It depopulates three clusters in the bottom right of SLR0, freeing up some LUTs needed for the ~15000 LUT PCIe XDMA logic.)

A 1776 PE GRVI Phalanx, comprising a 15×15-3 array of clusters of eight RISC-V RV32I GRVI PEs, 128 KB cluster RAM, and Hoplite router, plus 15 NoC-AXI RDMA bridges and 30 AXI-HBM bridges.

The second is a 1332 PE 2GRVI Phalanx, with 222 clusters of six 2GRVI RV64I PEs. To our knowledge this is the first operational kilocore 64-bit RISC SoC in any technology, and the first with HBM memory.

A 1332 PE 2GRVI Phalanx, comprising a 15×15-3 array of clusters of six RISC-V RV64I 2GRVI PEs, 128 KB cluster RAM, and Hoplite router, plus 15 NoC-AXI RDMA bridges and 30 AXI-HBM bridges.

A later blog post will drill down into this design, how the memory system works overall, and experiences working with the Xilinx AXI-HBM bridges.

Welcome Xilinx Alveo U50!

Today Xilinx announced the new Alveo U50 Data Center Accelerator Card. Press release. Launch presentation. U50 Home. Product Brief. Data Sheet. User Guide.

I usually don’t blog about FPGA card announcements but this is a big deal. Finally a vendor FPGA card streamlined and focused on pure data + network compute acceleration, with massive bandwidth (PCIe gen4x8 or gen3x16, QSFP28 for 100 GbE, ~7 TB/s to 5 MB of BRAM, ~6 TB/s to 20 MB of UltraRAM, and 460 GB/s to 8 GB of HBM2 DRAM), in an optimized form factor.

(In particular, it doesn’t have conventional DRAM DIMMs inside, and I think that’s fine. Doesn’t need them, won’t miss them. The key external RAM is the 8 GB of high bandwidth DRAM, right there behind the 32 AXI-HBM controllers. If greater RAM capacity is required, the host has tens or hundreds of GB that can be streamed in/out across PCIe. And no more sprawling soft DDR4 DRAM controllers in your design.)

Now FPGA uptake as mainstream data center accelerator platforms really depends upon their performance and cost competitiveness vs. multicore CPUs and GPUs. GPUs, with GDDRx and HBM2 DRAM memory systems, have always enjoyed a big lead in peak external memory bandwidth vs. FPGAs. This advantage has limited the types of workloads for which FPGAs are faster, or at least performance competitive. But the advent of Xilinx Virtex UltraScale+ VU3xP and Intel Stratix 10 MX devices, with HBM2 DRAM in package, now give FPGAs CPU-beating, GPU-competitive memory bandwidth. The next frontier is cost. So far, HBM2-powered FPGA cards have been expensive, many times more expensive than a GPU card with comparable bandwidth. I hope U50 will move the needle on price competitiveness, a prerequisite for FPGA accelerators to reach high volume economies of scale and support a thriving solution provider ecosystem.

Under the hood

The User Guide and Data Sheet describes the FPGA as an UltraScale+ XCU50, with 872K 6-LUTs, 5952 DSPs, 1344 BRAMs, 640 UltraRAMs, and two stacks of 4 GB HBM2 DRAM. While the XCU50 is not in the UltraScale+ Product Tables, these resources exactly match that of the XCVU35P, as does this floorplan figure:

XCU50 FPGA floorplan
XCU50 FPGA floorplan

Assuming this is the same silicon as the VU35P, that’s fantastic news — this part is extremely capable. For example, here is another kilocore RISC-V GRVI Phalanx with HBM2, for VU35P:

1176 RISC-V PE GRVI Phalanx with 30 HBM DRAM channels
An 1176 RISC-V PE implementation of the GRVI Phalanx massively parallel accelerator framework in a VU35P.
10×15 -3 clusters of { 8 PE, 128 KB SRAM, 300b Hoplite NoC router }, 30 HBM DRAM channels, PCIe DMA controller.

I look forward to an exciting future of mainstream FPGA+HBM2 accelerator cards, as common as GPU accelerator cards, deployed across the industry, there and just waiting for all of our problems, ingenuity, workloads, and bitstreams. Today’s Alveo U50 launch is a big milestone in this march to the mainstream. Congratulations to Xilinx, its staff, and partners.

GRVI Phalanx: The First Kilocore RISC-V with High Bandwidth Memory

A kilocore processor with a few DDR4 DRAM channels has never made much sense, and so today I am happy to announce that the GRVI Phalanx massively parallel RISC-V accelerator framework is now running on a Xilinx UltraScale+ VU37P FPGA with 8 GB of integrated in-package HBM2 DRAM, on a Xilinx Alveo U280 accelerator card.

This new FPGA SoC overlay is configured with a 15×15 array of clusters of 8 GRVI RISC-V PEs, 128 KB of SRAM, and a 300b Hoplite NoC router. In total it has 1800 PEs, 28 MB of SRAM, 8 GB of HBM2, 240 Hoplite NoC routers, 30 256b Hoplite-AXI RDMA bridges, and 31 AXI-HBM channels.

An FPGA device view (chip plot) of an 1800 RISC-V PE implementation of the GRVI Phalanx massively parallel accelerator framework.
15x15 clusters of { 8 PE, 128 KB SRAM, and a 300b Hoplite NoC router }. The die plot consists of 45 rows of 5 columns of variously colored regions, with two High Bandwidth Memory die stacks at the bottom.
An 1800 RISC-V PE implementation of the GRVI Phalanx massively parallel accelerator framework.
15×15 clusters of { 8 PE, 128 KB SRAM, 300b Hoplite NoC router }.

We’ll have more to say about this new design in the coming weeks. Thank you for your interest.

Welcome Xilinx Project Everest

Xilinx Everest block diagram

Everest: A New Adaptive Compute Acceleration Stack

Start with this Xilinx presentation from Victor Peng, Xilinx CEO: Xilinx Vision and Strategy for the Adapatable World. (Dear Xilinx: please share a recording of this presentation.)

Blog and press roundup:

My take: Everest is bold bet on Xilinx’s “data center first” strategy. I see Everest as Xilinx’s response to the present situation that its FPGAs beat GPUs on energy efficiency and integrated data center networking, but not raw compute, and they significantly trail CPUs and GPUs in developer productivity, adoption, and appeal. Just “more of the same” FPGA device scaling and integration was never going to change that.

I think the key ideas and challenges for Everest are to:

  1. add or harden the compute resources for which GPUs have a competitive edge today (software programmable “engines”, interconnect, memory system);
  2. keep it scaling up throughout the 2020s;
  3. make it all much easier for software developers to use and to love.

If Xilinx succeeds, it stands to win share from rival computing platforms, enable and grow new markets, and capture value beyond mere device sales. It is thrilling to see the bantamweight Xilinx innovating furiously versus the Intel+Altera behemoth, with its potential advantages of scale and of platform and tools integration.

(Back in the heyday of Microsoft’s Parallel Computing Platform, circa 2008, our mission was “to deliver lovable parallel programming models and infrastructure” — that is, “to provide models, languages, tools, libraries, and frameworks that make it easier for mainstream software developers to successfully develop and ship software that scales up on new parallel hardware”. Here Xilinx’s job one is similarly to make their top-to-bottom technology stack lovable to cloud services developers and cloud operators — competitive with/superior to CPUs, MICs, GPUs, and ASICs, on criteria including throughput, efficiency, cost, developer appeal, agility, and time to market.)

With VU37P HBM2-in-package memory and its CCIX interface, Xilinx catches GPUs/APUs in DRAM-tier memory bandwidth and coherent shared memory host integration. But that doesn’t address FPGAs’ raw compute and productivity shortcomings. SDAccel, i.e. OpenCL-based software defined accelerator hardware, is a leap forward, but with each mind-numbing multi-hour-build design iteration it loses the hearts and minds of high performance software developers.

Software-first, software-mostly, massively parallel compute and accelerator FPGA overlays, such as GRVI Phalanx (and its Hoplite NoC) provide minimum developer table-stakes: the rapid turnaround and NDRange data parallel programming model of recompile-and-go GPUs. But even obsessively-FPGA-optimized soft processor array overlays, with custom function units and accelerator cores, often cannot compete with full custom (e.g. GPU streaming multiprocessor) processing elements for cost, throughput, or energy efficiency. The FPGA cost, of bit-granularity programmable gates and wires, is too high when you are are instantiating e.g. 1680 processors.

So to compete and win in data center acceleration in the next decade, Xilinx has no alternative but to complement its leadership in programmable logic + SRAM + DSP + SERDES, with massive throughput software programmable “engines” and the requisite scalable interconnect NoC and memory system. (“20x” 20 TOPS INT8 ML inference is indeed “massive”.)

The concept of a bit-oriented FPGA hybridized with word oriented, massive throughput software programmability is not new. The many thousands of Xilinx DSP blocks (*), tailored for efficient FIR filters and such, were always tantalizingly close to (but yet so far from) software programmability. Projects like iDEA show the promise and the disappointment of running C code on DSP blocks + BRAMs. ASICs such as the Ambric Am2045 MPPA, the Adapteva Epiphany, the Kalray MPPA, Picohip’s picoArray, and many other MPPA, CGRA, RAW, etc. projects, albeit more for embedded systems than data center, were more than competitive with CPUs and GPUs, but ultimately did not disrupt the CPU-GPU-FPGA-ASIC status quo. Why? In part, they lacked synergistic SOC integration with the rest of the heterogeneous ARM-MPSOC-FPGA that e.g. Zynq / Zynq UltraScale+ MPSOC enjoy. In part they did not sufficiently deliver the developer love. In part these technologies were advanced by companies that did not have the requisite breadth or scale or sales channels or deep pockets. I think Project Everett can succeed where they did not.

( (*) You know, for a few gates more, the DSP48 block could have been our generation’s AMD2900 bitslice processor-kit. The poor DSP block just need a register file and a better result forwarding mux network. For want of a “Mick and Brick” we must build our phalanxes out of LUTs better spent on custom accelerators.)

Until the recent emergence of massive data center workloads (data analytics, AI/ML/vision, video) and data center scale FPGA farms (Catapult v2) there was no compelling value proposition to move Xilinx or Altera to gamble expensive FPGA die area, power budgets, and huge tools and libraries investments on massive throughput programmability. But now Xilinx sees “data center first” as their business imperative for the 2020s.

It’s a Heterogeneous, Specialized, Networked, Adaptive Computing World

While we await answers to questions such as “so what are these HW-SW programmable engines?”, “how do you program them?”, “will my current workload run on them?”, “what is the memory system?”, “how do software and hardware elements compose?”, and “so what is the difference between UltraScale programmable logic and ‘next generation’ programmable logic?”, it is clear that “Xilinx FPGAs” of the future will complement programmable logic with diverse programmable engines and application processors.

And what is the “right” mix of hardened processing elements and programmable logic? It depends. In the wake of its Zynq MPSOC-FPGA offerings, Xilinx is set to combine many types of computational resources in one ultra heterogeneous device, combining (surely ARM) app processors and real-time processors with the new programmable engines and programmable logic.

Note Xilinx’s mastery of multi-die “2.5D” packaging enables a flexible product mix of processing elements and programmable logic, composed on a hard network-on-a-chip (NoC) which presumably will span dies.

As we work to advance FPGA-efficient soft NoCs like Hoplite, we feel that hard NoCs complement, but not replace soft NoCs. For years to come, the installed base (including UltraScale+ VU9P, VU2xP, VU37P, Arria-10, Stratix-10, etc.) will require soft NoCs. Especially VU37P. And in a hard NoC device, there are even more resources for the fabric resources to reach and connect to. Much more many-to-many communication. Metcalfe’s Law in the small.

A Leap Ahead on Performance and Efficiency

Xilinx heralds several Everest breakthroughs including an astounding 20x boost on AI compute and 4x on 5G communications. Some of this scale up certainly comes the transition from TSMC 16 nm to 7 nm technology nodes, but this time much of the improvement must come from architecture, and in particular those new, mysterious programmable hardware engines.

In footnotes, Xilinx states the Everest 20x speedup is on an ML image recognition inference workload, versus a VU9P with 7000 DSPs at max performance. At max performance, on INT8-optimized ML inference, the latter can approach 20 TOPS, at 200+ GOPS/W. Does Everest achieve 400 TOPS? At 4 TOPS/W?? We’ll see.

Looking Ahead to 2020, and the Software Stack Challenge

On the hardware front, Xilinx new device products engineering is executing well, quickly mastering new technology nodes and packaging innovations to ship new devices and tools. (For example, I found zero device or tools bugs in porting a dense, complex 1680-core GRVI Phalanx to VU9P ES1 silicon in 12/2016 — it just worked.) This bodes well for a rapid and successful development and roll out of these new 7nm Everest devices.

But the impact and uptake of Everest depends to the greatest extent on the software stack. Xilinx has targeted several vertical domains — AI, video, genomics —  with a familiar, successful model of prebuilt high level frameworks, libraries, and tools stacks. Now Xilinx will have to prime the pump for Everest themselves. Xilinx, The Programmable Logic Company The All Programmable Company — can they become a great software and software developer tools company too? Will they have the will and the scale to invest in new languages, compilers, debuggers, profilers, runtimes, libraries, and yet more libraries to compete for developer mindshare at the level of NVIDIA, Intel, or Microsoft?

Xilinx can, must, and will enable software developers in key market segments harness these new programmable engines with turnkey software stacks. Developers will bring TensorFlow, ONNX, etc. models to Everest-enabled frameworks and run them instantly, without ever spinning a bitstream or editing a line of XDC or Verilog.

It follows that Xilinx and its domain partners will be very busy providing to their new customers prebuilt solutions stacks. (Intel too, can, must, and will pursue this strategy.) This also affords Xilinx and its partners an opportunity to accrue IP value up the software stack, selling accelerated software solutions priced at the value proposition they bring to customers, no longer just selling silicon devices at whatever LUTs/$ vs. Intel and others.

My mission is to make it easier to compute with FPGAs. More than ever, that’s Xilinx’s mission too. It’s an exciting time in the FPGA world; once again the sky is the limit.

(For a stale but fun 2013 take on FPGAs in the data center, check out Reconfigurable Computing in the Era of Dark Silicon.)