Category Archives: Parallel Computing

About parallel computing, programming models, infrastructure, tools, libraries, parallel computer architecture, etc.

GRVI Phalanx joins The Kilocore Club

The work-in-progress GRVI Phalanx massively parallel accelerator framework has been ported to the Xilinx Virtex UltraScale+ XCVU9P.

On Dec. 30, 2016, a design with 30 rows by 7 columns of clusters of 8 GRVI RISC-V cores + 128 KB CRAM (cluster RAM) + a 300-bit Hoplite NOC router — a total of 1680 cores and 26 MB of SRAM — booted up and tested successfully, running a message passing matrix multiply workload on all 1680 cores, in a XCVU9P-FLGA2104-2L-E-ES1 device in a Xilinx VCU118 evaluation kit.

This 1680 core GRVI Phalanx is the first operational kilocore RISC-V, the first kilocore 32b RISC in an FPGA, and the most 32b RISC cores on a chip in any technology.

1 core, 32 cores, 1680 cores -- RISC-V scales up! A 1-core Si-Five HiFive-1, a 2x2x8=32-core GRVI Phalanx in a Digilent Arty / XC7A35T, and a 30x7x8=1680-core GRVI Phalanx in a Xilinx VCU118 / XCVU9P

1 core, 32 cores, 1680 cores — RISC-V scales up! A 1-core Si-Five HiFive-1, a 2x2x8=32-core GRVI Phalanx in a Digilent Arty / XC7A35T, and a 30x7x8=1680-core GRVI Phalanx in a Xilinx VCU118 / XCVU9P.

Here is the basic cluster tile architecture redesigned for UltraScale+ and its new 288 Kb UltraRAM jumbo-SRAM blocks. The present design includes 210 instances of this tile.

A GRVI Cluster tile with 8 GRVI RISC-V cores, 128 KB multiported bank interleaved shared cluster RAM, optional accelerators (here, none), and a 300-bit wide Hoplite NOC router.

A GRVI cluster tile with 8 GRVI RISC-V cores, 128 KB multiported bank interleaved shared cluster RAM, optional accelerators (here, none), message passing NOC interface, and a 300-bit wide Hoplite NOC router.

An example 1680 GRVI system implemented in a Xilinx Virtex UltraScale+ VU9P. This GRVI Phalanx comprises NX=7 x NY=30 = 210 clusters, each cluster with 8 GRVI cores and a 8-ported 128 KB cluster shared memory. The clusters are interconnected on a Hoplite NOC, with the Hoplite routers configured with 290b data payloads (including 32b address and 256b data), achieving a bandwidth of about 70 Gb/s/link and a NOC bisection bandwidth of 900 Gb/s. Each cluster can send or receive 32 B per cycle into the NOC. The GRVI Phalanx architecture anticipates a variety of configurable accelerators coupled to the processors, the cluster shared RAM, or the NOC.

An example 1680 GRVI system implemented in a Xilinx Virtex UltraScale+ VU9P. This GRVI Phalanx comprises NX=7 x NY=30 = 210 clusters, each cluster with 8 GRVI cores and a 8-ported 128 KB cluster shared memory. The clusters are interconnected on a Hoplite NOC, with the Hoplite routers configured with 290b data payloads (including 32b address and 256b data), achieving a bandwidth of about 70 Gb/s/link and a NOC bisection bandwidth of 900 Gb/s. Each cluster can send or receive 32 B per cycle into the NOC. The GRVI Phalanx architecture anticipates a variety of configurable accelerators coupled to the processors, the cluster shared RAM, or the NOC.

An extended abstract with additional detail on this work has been submitted to, and hopefully will be presented at, the OLAF’17 workshop at FPGA’17.

‘Computing on Programmable Logic’ at Microsoft Research Faculty Summit 2016

Yesterday I had the privilege of speaking on Computing on Programmable Logic (slides) in the ‘Computing with Exotic Technologies and Platforms’ session at the Microsoft Research Faculty Summit 2016.

Abstract: “We have seen the birth of many exotic architectures in recent years, from a quantum computer that promises to achieve exponential speed-ups over conventional computers, to DNA computation that performs disease diagnostics and therapy, to Field Programmable Gate Arrays (FPGAs) that provide a flexible toolkit for implementing architectures such as Microsoft’s Catapult fabric for large-scale datacenters. Each of these exotic technologies enable novel solutions to challenging problems and require equally novel methods to program and design them. We will highlight the advances in their applications and the challenges behind developing their toolchains and programming environments.”

GRVI Phalanx Update

An update on the work-in-progress GRVI Phalanx.

Conferences

An extended abstract and brief talk on GRVI Phalanx was presented at the 2nd International Workshop on Overlay Architectures (OLAF-2) at FPGA 2016.

GRVI Phalanx was discussed in the short talk Software-First, Software Mostly: Fast Starting with Parallel Programming for Processor Array Overlays at the Arduino-like Fast-Start for FPGAs pre-conference workshop at FCCM 2016. [Slides]

The first refereed paper on GRVI Phalanx was presented yesterday at the 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2016): GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator and received the FCCM 2016 Best Short Paper Award. [PDF]

Hardware Changes: Version 0.2

Here are some of the changes made to the GRVI Phalanx design since it was first described at the 3rd RISC-V Workshop. This is now version 0.2.

GRVI

  • LB/LBU/LH/LHU/SB/SH: Load/store byte and halfword alignment functionality is now configured OFF in the GRVI PEs. The LdMux and StMux units have been factored out of GRVI and into the GRVI cluster, each set of muxes shared by a pair of cores.
  • MUL/MULH/MULHU/MULHSU: The multiply instructions from the RISC-V “M” extension are now enabled by default and are implemented in the GRVI cluster. Each pair of processors shares one DSP-based multiplier. This consumes 200 DSP48s in the 400 PE GRVI Phalanx for Kintex UltraScale 040, leaving 1720 DSP48s for use by accelerators.
  • SL*/SR*: By default, fast left and right shift instructions are also implemented in these DSP-based multipliers.
  • LR/SC: These atomic instructions from the RISC-V “A” extension are now enabled by default. Part of the implementation is in the GRVI core and part in the GRVI cluster memory arbiters. The implementation considerations were discussed on the RISC-V mailing lists here.

Phalanx

  • A Phalanx system may be configured to replace the cluster at (NX-1,NY-1) with a character mode VGA cluster with a 32 KB text frame buffer.
  • Hoplite multicast message routing is now enabled by default. An agent can sent a message to every cluster on a given row, given column, or to every cluster on the NOC. If desired, all IRAMs in all clusters in a Phalanx may be updated with a single burst of 1024 XY-multicast messages.

Introducing GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator

GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area as a processing element (PE) in a parallel processor. GRVI implements a 2 or 3 stage single issue pipeline, typically consumes 320 6-LUTS in a Xilinx UltraScale FPGA, and currently runs at 300-375 MHz in a Kintex UltraScale (-2) in a standalone configuration with most favorable placement of local BRAMs.

Phalanx is massively parallel FPGA accelerator framework, designed to reduce the effort and cost of developing and maintaining FPGA accelerators. A Phalanx is a composition of many clusters of soft processors and accelerator cores with extreme bandwidth memory and I/O interfaces on a Hoplite NOC.

GRVI Phalanx was introduced today at the 3rd RISC-V Workshop at Redwood Shores, CA.

A work-in-progress 5x10x8 = 400 processor configuration in a KU040 in a Xilinx KCU105 and a 2x2x8 = 32 processor configuration in a Xilinx Artix-7 35T in an Digilent Arty were demonstrated in the demo/poster session.

A 10x5x8 = 400 processor GRVI Phalanx

For more information please visit the GRVI Phalanx page.

Introducing Hoplite: An FPGA-Optimal Router for Extreme Bandwidth NOCs

Hoplite is a configurable, general purpose, FPGA-optimal 2D router and tools for implementation of efficient network on chip (NOC) interconnection of diverse processors, accelerators, other client cores, and extreme bandwidth (100+ Gb/s) interfaces.

The first paper on Hoplite is presented today at FPL 2015:

Hoplite: Building Austere Overlay NoCs for FPGAs, Nachiket Kapre, Jan Gray.
25th International Conference on Field-Programmable Logic and Applications, Sept. 2015. Received the Michael Servit Best Paper Award. [PDF]

For more information please visit the Hoplite page.

The Past and Future of FPGA Soft Processors

Earlier this month I had the privilege of giving a keynote at Reconfig 2014. I decided to speak on the past and future of FPGA soft processors. This is my twentieth anniversary of working (on and off) in this field so this seemed an apt time and opportunity to share my perspective on where FPGA soft processors came from and what their continuing utility and prospects might be in the decade ahead — the autumn of Moore’s Law, the winter of Dennard Scaling.

Abstract
Design productivity is still a challenge for reconfigurable computing. It is expensive to port a software workload to RTL, to maintain the RTL as the workload evolves, and to wait for hours to recompile a bitstream after each design change. Soft processors can help mitigate these costs, and provide new pathways to application acceleration. A mid-range FPGA can now host hundreds of soft CPUs and their interconnection network, and such heterogeneous massively parallel processor and accelerator arrays can sustain hundreds of operations, memory accesses, and branches per cycle.

This talk will look back on the history and diversity of soft processor cores for FPGAs, and their continuing relevance for the decade ahead. What new tools, IP, and infrastructure will help us to exploit the coming million LUT, 10 TFLOPS FPGAs? Along the way we will revisit an austere design esthetic and an implementation methodology for crafting FPGA-optimized soft cores, and see how the lessons of mapping one processor into one 1995 FPGA can inform us how to design massively parallel programmable accelerators going forward.

Here are the slides.

Microsoft Catapult at ISCA 2014, In the News

This week at ISCA 2014 Andrew Putnam presented A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (PDF).

Some of the Catapult team members (Microsoft Research and Bing)

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the tail latency by 29%.

Coverage:

 

FPGAs, Then and Now

On the left, from 1995, J32, one 32-bit RISC SoC in an XC4010. It had 20x20x2=800 4-LUTs (and 400 3-LUTs).

On the right, from 2013, 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T (which provides over 433,000 6-LUTs and 1470 BRAMs). A work in progress.

In other words, in the past 18 years Moore’s Law has taken us from 1K LUTs per FPGA to 1K 32-bit CPUs per FPGA.

1995: One 32-bit RISC SoC in an XC4010 --- 2013: 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T.

FCCM 2013 Panel: Reconfigurable Computing in the Era of Dark Silicon

At FCCM 2013, I was on a panel to discuss Reconfigurable Computing in the Era of Dark Silicon. If you haven’t heard of the Dark Silicon meme in the computer architecture community, I recommend you review Michael Taylor (UCSD)‘s slides from DaSi 2012.

It’s difficult to take these things out of context, but here for posterity’s sake are my position slides: Gray-Dark Silicon and HeMPPAAs. I emphasize that orders of magnitude energy efficiency improvements might be achieved by building workload-optimized computers in FPGAs using a HeMPPAA (heterogeneous massively parallel processor and accelerator arrays) architecture. I also propose infrastructure investments so that FPGA design in the large is much more like the software development experience.

The Autumn of Moore’s Law: Scaling Up Computer Performance, 2011-2020

In 2010 and 2011 I gave this survey talk on prospects for continued exponential scaling of computer performance for the Singularity University Graduate Studies Program, in Mountain View, CA.

It is in three parts: prospects for continued transistor scaling; the transition to parallel computer architecture; and the challenges of writing mainstream software for parallel computers.