Category Archives: Parallel Computing

About parallel computing, programming models, infrastructure, tools, libraries, parallel computer architecture, etc.

Introducing GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator

GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area as a processing element (PE) in a parallel processor. GRVI implements a 2 or 3 stage single issue pipeline, typically consumes 320 6-LUTS in a Xilinx UltraScale FPGA, and currently runs at 300-375 MHz in a Kintex UltraScale (-2) in a standalone configuration with most favorable placement of local BRAMs.

Phalanx is massively parallel FPGA accelerator framework, designed to reduce the effort and cost of developing and maintaining FPGA accelerators. A Phalanx is a composition of many clusters of soft processors and accelerator cores with extreme bandwidth memory and I/O interfaces on a Hoplite NOC.

GRVI Phalanx was introduced today at the 3rd RISC-V Workshop at Redwood Shores, CA.

A work-in-progress 5x10x8 = 400 processor configuration in a KU040 in a Xilinx KCU105 and a 2x2x8 = 32 processor configuration in a Xilinx Artix-7 35T in an Digilent Arty were demonstrated in the demo/poster session.

A 10x5x8 = 400 processor GRVI Phalanx

For more information please visit the GRVI Phalanx page.

Introducing Hoplite: An FPGA-Optimal Router for Extreme Bandwidth NOCs

Hoplite is a configurable, general purpose, FPGA-optimal 2D router and tools for implementation of efficient network on chip (NOC) interconnection of diverse processors, accelerators, other client cores, and extreme bandwidth (100+ Gb/s) interfaces.

The first paper on Hoplite is presented today at FPL 2015:

Hoplite: Building Austere Overlay NoCs for FPGAs, Nachiket Kapre, Jan Gray.
25th International Conference on Field-Programmable Logic and Applications, Sept. 2015. Received the Michael Servit Best Paper Award. [PDF]

For more information please visit the Hoplite page.

The Past and Future of FPGA Soft Processors

Earlier this month I had the privilege of giving a keynote at Reconfig 2014. I decided to speak on the past and future of FPGA soft processors. This is my twentieth anniversary of working (on and off) in this field so this seemed an apt time and opportunity to share my perspective on where FPGA soft processors came from and what their continuing utility and prospects might be in the decade ahead — the autumn of Moore’s Law, the winter of Dennard Scaling.

Abstract
Design productivity is still a challenge for reconfigurable computing. It is expensive to port a software workload to RTL, to maintain the RTL as the workload evolves, and to wait for hours to recompile a bitstream after each design change. Soft processors can help mitigate these costs, and provide new pathways to application acceleration. A mid-range FPGA can now host hundreds of soft CPUs and their interconnection network, and such heterogeneous massively parallel processor and accelerator arrays can sustain hundreds of operations, memory accesses, and branches per cycle.

This talk will look back on the history and diversity of soft processor cores for FPGAs, and their continuing relevance for the decade ahead. What new tools, IP, and infrastructure will help us to exploit the coming million LUT, 10 TFLOPS FPGAs? Along the way we will revisit an austere design esthetic and an implementation methodology for crafting FPGA-optimized soft cores, and see how the lessons of mapping one processor into one 1995 FPGA can inform us how to design massively parallel programmable accelerators going forward.

Here are the slides.

Microsoft Catapult at ISCA 2014, In the News

This week at ISCA 2014 Andrew Putnam presented A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (PDF).

Some of the Catapult team members (Microsoft Research and Bing)

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the tail latency by 29%.

Coverage:

 

FPGAs, Then and Now

On the left, from 1995, J32, one 32-bit RISC SoC in an XC4010. It had 20x20x2=800 4-LUTs (and 400 3-LUTs).

On the right, from 2013, 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T (which provides over 433,000 6-LUTs and 1470 BRAMs). A work in progress.

In other words, in the past 18 years Moore’s Law has taken us from 1K LUTs per FPGA to 1K 32-bit CPUs per FPGA.

1995: One 32-bit RISC SoC in an XC4010 --- 2013: 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T.

FCCM 2013 Panel: Reconfigurable Computing in the Era of Dark Silicon

At FCCM 2013, I was on a panel to discuss Reconfigurable Computing in the Era of Dark Silicon. If you haven’t heard of the Dark Silicon meme in the computer architecture community, I recommend you review Michael Taylor (UCSD)‘s slides from DaSi 2012.

It’s difficult to take these things out of context, but here for posterity’s sake are my position slides: Gray-Dark Silicon and HeMPPAAs. I emphasize that orders of magnitude energy efficiency improvements might be achieved by building workload-optimized computers in FPGAs using a HeMPPAA (heterogeneous massively parallel processor and accelerator arrays) architecture. I also propose infrastructure investments so that FPGA design in the large is much more like the software development experience.

The Autumn of Moore’s Law: Scaling Up Computer Performance, 2011-2020

In 2010 and 2011 I gave this survey talk on prospects for continued exponential scaling of computer performance for the Singularity University Graduate Studies Program, in Mountain View, CA.

It is in three parts: prospects for continued transistor scaling; the transition to parallel computer architecture; and the challenges of writing mainstream software for parallel computers.