GRVI Phalanx Update Presentation at the 7th RISC-V Workshop

On Nov. 29, 2017, I gave a talk titled GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens (PDF) at the 7th RISC-V Workshop.

In this talk for the RISC-V community I recap the purpose, design, and implementation of the GRVI Phalanx Accelerator Kit, recent work, and present work in progress to deliver an SDK for AWS EC2 F1 and PYNQ-Z1, including an OpenCL-like programming model built upon Xilinx SDAccel.

GRVI Phalanx on AWS F1 — die plots of various work-in-progress XCVU9P F1 designs including: 0 cores with 4 DDR4 DRAM channels, 884 cores with 3 channels, 1240 cores with 1 channel, and 9920 cores (8 FPGA slots, on AWS F1.16xlarge).

Pynq as a High Productivity Platform for FPGA Design and Exploration

This spring, en route to bringing GRVI Phalanx to Amazon AWS EC2 F1, I had to develop new AXI4 Stream and AXI4 Memory Mapped interface bridges to GRVI Phalanx. Both the Xilinx IP catalog and the F1 shell extensively use AXI4 interfaces.

Fortunately for me it was also time to try out the terrific Pynq (Python on Zynq) development platform and its first board, Digilent PYNQ-Z1. (Just $65 academic!) Pynq is a brilliant synthesis of the Zynq FPGA-SOC, Ubuntu Linux for Zynq, Python, the massive Python library ecology, Jupyter notebooks, and new Python classes, IP, and software for bridging the Python world and the programmable logic fabric world.

I expected Pynq and its browser-hosted Jupyter notebooks interface would be apt for learning and teaching modern FPGAs and FPGA applications, for documenting and sharing work. I hadn’t expected that Pynq would also be a high productivity tool for experienced FPGA designers to more raplidly explore, evaluate, discover, and play. Using Pynq to explore, develop, prototype saved me weeks of effort. It was particularly good for interactively writing tests and exploring new hardware corner cases in Python with <1 second turnaround.

Back in May I attended Pynq training presented by Xilinx’s Patrick Lysaght at FCCM 2017. In the morning session I put together this My Pynq experience slide deck (PDF) and Patrick was kind enough to let me present it as the impromptu lunch entertainment.

I know I have only scratched the surface of the Pynq platform, and I am fascinated by, and look forward to, the idea of delivering “Live, Runs Interactively in Hardware, Live User’s Guides” of my work as Jupyter notebooks.

Idea: The two platforms I am focusing upon this year are Pynq / PYNQ-Z1 and AWS EC2 F1, it would be wonderful to see a port of Pynq to F1, so I can also provide Live User’s Guides on F1.

Fellow FPGA designers, try Pynq. You’ll like it. Pynq makes exploring new FPGA ideas lightweight, fresh, fast, easy, fun again.

Thank you Patrick, Cathal McCabe, and team for bringing us an unanticipated but essential new tool.

GRVI Phalanx at Hot Chips 29 (2017)

Yesterday at Hot Chips 29 (2017) I presented a poster GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Framework: A 1680-core, 26 MB SRAM Parallel Processor Overlay on Xilinx UltraScale+ VU9P (PDF) and some hardware demos. Extended abstract (PDF). The poster focuses on the Dec. 2016 1680 core milestone but also describes plans and ideas for programming models and tools, and recent work towards AWS F1 and PYNQ-Z1 general availability, including work on AXI4-MM and AXI4-Stream bridges to the Hoplite NoC fabric, enabling AXI4 DRAM / Phalanx-RDMA interface support for Zynq 7000 (hard DRAM controllers behind the HP[0-3] ports) and AXI4 64b DDR4-2400 MIG DRAM interfaces on KU040 and VU9P.

In the adjacent RISC-V Foundation booth, I set up two demos:

  • A 1680-core, 26 MB GRVI Phalanx on VU9P on VCU118, with a 7×30×300b Hoplite NoC and 7×30 clusters of { 8 RISC-V cores + 128 KB }, running a message passing, bulk synchronous integer matrix multiplies demo, and
  • An 80-core GRVI Phalanx on 7Z020 on PYNQ-Z1, with a 4×4×300b Hoplite NoC and 10 clusters of { 8 RISC-V cores + 32 KB }, running an AXI4 DRAM/RDMA bridge test of 80×256B×2^28 reads. (Some of the 80 cores’ blue and white subwindows of the console are visible in the photo below.)

This just underscores that I need to invest in better demos.

Photo of demos of GRVI Phalanx on PYNQ-Z1 (80 cores) and VCU118 (1680 cores) at the RISC-V table at Hot Chips 29

Demos of GRVI Phalanx on PYNQ-Z1 (80 cores) and VCU118 (1680 cores) at the RISC-V table at Hot Chips 29

This event was a special occasion for me. I’ve attended Hot Chips conferences since the late 1980s. Then, as a software engineer fascinated by computer architecture and following USENET’s comp.arch gang, it was a thrill for me to head to Stanford and meet my heroes, microprocessor architects, and learn more about how their new parts worked, and I’d return with new insights that made me a better software engineer.

Back in the ’80s, only chip design teams had the EDA tools and fabs necessary to build microprocessors. But starting in the ’90s, larger and more capable FPGAs, with increasingly comprehensive tools and infrastructure, enabled anyone to develop FPGA CPUs and now parallel computers. FPGAs democratize access to high performance digital design, and yesterday to bring this full circle, I demonstrated a parallel computer system on a chip integrating the greatest number of 32-bit RISC processors ever. With this kind of work, and with the Microsoft Brainwave announcement, FPGA designers are emphatically not second class, second best to inflexible ASICs. Rather FPGA platforms are coequal and indeed are the vanguard of agile computer architecture.

Intel’s New 10 nm Process: The Wind in our Sails

Welcome to the 10 nm CMOS era! In its first Technology and Manufacturing Day event (press kit, all presentations), Intel unveiled and detailed the highlights of their forthcoming 10 nm process technology node. It’s better than I expected. Intel’s new 10 nm process almost triples the capacity of new integrated circuits, so that the performance and capabilities of our systems can again “leap ahead” and deliver new platforms and experiences. It’s like a three year contract extension. Laissez les bon temps rouler!

The Autumn of Moore’s Law

Computer performance necessarily surfs transistor technology scaling trends. (See The Autumn of Moore’s Law: Scaling up Computing Performance 2011-2020.) For the past 50 years, transistor scaling has been the wind in the sails of the computer industry. Every couple of years transistors per chip double and cost per transistor halves, and this powers disruptive innovations like iPhones, wireless broadband, datacenters with 100 Gbps networking, self-driving cars and self-piloting drones, mixed reality, and deep learning.

These scaling trends have slowed somewhat in the 2010s. We have left the Dennard scaling era. Transistor performance (area times delay) and energy efficiency (gates switched per unit energy) no longer double as feature widths and heights each scale down by a factor of 1/√2 ≈ 0.7. And now each such lithographic feature shrink is taking more than two years.

This decade, Intel’s innovations in high volume manufacturing with high-K metal gates, FinFETs and SADP (self-aligned double patterning) lithography have kept Moore’s Law ticking along. But now Intel has been stuck on the 14 nm technology node for at least one year longer than we expected. The familiar yearly tick, tock of process and architecture advances has been a tick, tock, tock, tock. Has Intel manufacturing lost its mojo?

Scaling Down

Emphatically: no. As introduced by EVP Stacy Smith and detailed in Intel Senior Fellow Mark Bohr’s presentation and CVP Kaizad Mistry’s presentation, Intel’s new 10 nm process, announced 32 months after the 14 nm process launch, achieves a remarkable 2.7× transistor density improvement over its predecessor. This is achieved by combining pure lithography scaling with new transistor topology and circuit layout advances that together scale the area of an average logic cell, not at 0.5× but at 0.37×. (This compounds with the impressive 0.37× scaling Intel achieved moving from 22 nm to 14 nm.) Unfortunately SRAM cell scaling at ~0.6× from 14 nm is less dramatic.

Intel 10 nm lithography now requires self-aligned quad patterning (SAQP) at least for 36 nm pitch metal interconnect. Apparently soft x-ray Extreme UV litho is still not ready for prime time. So the challenge is to pattern very narrow rows of lines on the die to lay out FET structures and wires. How can you image such narrow lines using fuzzy 193 nm deep UV laser light? You first pattern the finest lines you can optically using every trick in the book, high numerical aperture immersion, phase shift masks, and computational lithography, then etch/deposit sidewall spacers on the lines that (after more processing steps) become masks, to pattern twice as many lines at half the line pitch. (Multiple patterning (“sidewall image transfer”). I understand that SAQP is SADP and then SADP again. These ultra fine features are achieved only with many extra (expensive) lithographic processing steps.

Beyond lithography, the third generation FinFET transistors themselves are taller (54 nm) and narrower (34 nm pitch), as compared to 42×42 nm in 14 nm, and 34×60 nm in 22 nm; and now the gate contact can be placed atop the gate (“contact over active gate”) that achieves a 10% area savings versus prior nodes’ alongside-the-gate. Unspecified process innovations also enable a new standard cell layout with a shared single dummy gate per cell border which achieves a ~20% area savings. Together with litho scaling these one-time “hyper scaling” improvements boost density scaling from 2.0× to 2.7×. This slide summarizes the improvements.

In the context of an exemplary microprocessor with its mix of logic and regular SRAM, Intel expects overall area scaling at 0.43×, reducing a (complex) processor + cache tile from 17.7 mm2 to 7.6 mm2. This portends a feasible doubling in core counts and cache areas in both client processors and Xeon servers, and a doubling in programmable logic and embedded SRAM resources in future 10 nm FPGAs.

As transistors scale down, the big challenge in spending them to scale up system performance is energy. If you keep doubling transistors per die without doubling gate energy efficiency, eventually you can’t afford to power or cool your integrated circuit, or you have to run it at a lower frequency that it is capable of. This is the dark silicon problem (and for FPGAs, dark fabric). Here too Intel’s 10 nm process makes great strides. Compared to 14 nm, you can get 25% faster switching, or get same performance for 0.55× the power. (I’ll take the latter, thank you.) Furthermore, Intel anticipates follow on nodes 10+ and 10++ with additional performance or power savings. This is welcome news and just significant as the headline 2.7× density scaling.

Despite good progress on gate switching energy scaling, the best way forward is still to selectively run serial bottlenecks at higher voltages/frequencies but devote most of the computation to slower, but more energy efficient, parallel compute fabrics. For a perfectly parallel workload, for the same power, you can run the same cores 25% faster, or spend some of your new transistor budget windfall on more processing elements, in pursuit of 81% (1/0.55) greater throughput.

Apples to Apples

Intel underscores their process technology lead versus competing fabs, who are also underway on so-called 10 nm and even 7 nm nodes. Much like FPGA industry’s “marketing system logic cells” (of which there are zero in any FPGA – go open an FPGA device view and see for yourself – none) vs. real delivered 6-LUTs, in process technology specs one-upsmanship there is Intel 10 nm and then there is everybody else’s 10 nm.

In his editorial Let’s Clear Up the Node Naming Mess, Mark Bohr proposes a benchmark of transistors per square millimeter implementing logic standard cells of 60% NAND2 and 40% SFF (scan flip-flop).

Using this metric, the new 10 nm process achieves 100.8 M transistors per square millimeter. This compares to 37.5 MTr/mm2 in today’s 14 nm and just 7.5 MTr/mm2 in 2010’s 32 nm process. That’s a big leap forward that underscores that Moore’s Law is not dead – not yet.

Agility and Heterogeneous Integration: More than Moore

Slides 37-42 of Bohr’s slides underscore Intel’s EMIB (Embedded Multi-Die Interconnect Bridge) technology, which enables cost effective, high bandwidth, low latency composition of heterogeneous dice in an SiP (system in package). EMIB enables the forthcoming Stratix-10 MX FPGA with HBM2 DRAM die stacks in package, targeting up to 1 TB/s of DRAM bandwidth.

At FPGA 2017, Andrew Putnam of Microsoft Research pointed out that if you already have an EMIB- or SSI-interposer- FPGA, it’s straightforward to build a new FPGA + ASIC (or CPU + ASIC) SiP. Better than a standalone ASIC, a SiP-ASIC doesn’t need PCIe or QPI interface to the FPGA/CPU, doesn’t need high powered 10-28 Gb/s multi-gigabit serial transceivers with clock-data recovery, but rather will employ many hundreds of ~2 GHz low-voltage-swing nets to the FPGA/CPU.

EMIB SiP enables a new kind of agilty that Intel should leverage, both in Xeon-ASIC and Xeon-FPGA-ASIC SiP solutions. For example if a particular binary weight neural network machine learning platform catches on, Intel’s Altera asset enables them to 1) rapidly develop and ship an acceleration solution on a CPU+FPGA SiP (plus a software library version for down level systems); and concurrently develop a BNN-ASIC bare die then 2) assemble and ship a CPU+FPGA+BNN-ASIC SiP, without impact to CPU or FPGA dies, costs, or schedules. Compared to a four year product cycle of new feature pathfinding and value-proposition-proving and architecture review, and so forth — finally achieving production silicon but typically missing first mover advantage — an EMIB-powered ASIC-SiP methodology could cut two years from the process, capturing new business, and providing new work for older fab lines.

What Intel Didn’t Say

The elephant in the room is cost per transistor (CPT) scaling. As per-fab equipment costs rise and per-wafer processing costs rise with multiple patterning, CPT no longer halves as transistors per mm2 doubles. A few years back, NVIDIA complained that transitioning to partner TSMC’s then-new 20 nm planar process could see negligible CPT improvement due to these increased costs. Here though Intel states that their lithographic shrink plus application of one-time hyper scaling techniques (here, contact over active gate and single dummy gate standard cells) overcomes the increasing cost per mm2 to continue the trend of expoentially cheaper transistors — “hyper scaling allows the economics of Moore’s Law to continue”.

I am also curious to whether and what extent (as Intel’s Shekar Borkar discussed 10 years ago) transistor variability across the die and across chips has become a problem that requires (e.g. microarchitectural) attention. Are FinFETs less susceptible to dopant distribution variability than planar transistors?

It is unclear how quickly Intel will be able to ramp up high volume production in this process, what yields they expect in 2018, and how it compares with the competing TSMC 7 nm process that will power the next generation of Xilinx FPGAs.

Also, no sign of silicon photonics in the mainstream.

Into the Grand and Glorious Future

Intel is well positioned with leadership manufacturing, processors, memory, FPGAs, SOC and networking and wireless infrastructure, with its Software and Solutions Group assets, and with new business investments like machine learning and ADAS. Not content to merely fill x86 ISA sockets until oblivion, it is investing and striving to climb up the technology stack and capture more value in new markets.

Next year Intel will crank out several hundred million 10 nm processors, and soon FPGAs and other chips. As an FPGA technologist I am particularly excited about the opportunity to integrate FPGAs into processors — whether monolithic die designs or via EMIB bridges. For forty years increasing transistor budgets have brought integration and democratization of new functions into the platform. By 1989 as transistors doubled not only did Intel pipeline their 386 core but they added a paged MMU and FPU to make the 486. This quickly became a standard platform that software stacks take for granted. Similarly, rather than double the client CPU from 4 cores to 8, or a server from 16 to 32, it may make sense to spend some of the new transistor and power budget to add some FPGA fabric into the system. We’ll see.

My career was built on Intel processors, and my work today still relies upon them. Beyond that, Intel’s remarkable process scaling and manufacturing leadership has led the industry forward. When I use Xilinx Virtex UltraScale+ (16 nm TSMC FinFET) FPGAs, running at 0.72 V at 50 A, I appreciate many of the requisite lithography, process, and circuit technologies involved were invented, nurtured, or perfected at Intel first.

Thank you, Intel. Well done. Hope to see you again in 2020.

GRVI Phalanx joins The Kilocore Club

The work-in-progress GRVI Phalanx massively parallel accelerator framework has been ported to the Xilinx Virtex UltraScale+ XCVU9P.

On Dec. 30, 2016, a design with 30 rows by 7 columns of clusters of 8 GRVI RISC-V cores + 128 KB CRAM (cluster RAM) + a 300-bit Hoplite NOC router — a total of 1680 cores and 26 MB of SRAM — booted up and tested successfully, running a message passing matrix multiply workload on all 1680 cores, in a XCVU9P-FLGA2104-2L-E-ES1 device in a Xilinx VCU118 evaluation kit.

This 1680 core GRVI Phalanx is the first operational kilocore RISC-V, the first kilocore 32b RISC in an FPGA, and the most 32b RISC cores on a chip in any technology.

1 core, 32 cores, 1680 cores -- RISC-V scales up! A 1-core Si-Five HiFive-1, a 2x2x8=32-core GRVI Phalanx in a Digilent Arty / XC7A35T, and a 30x7x8=1680-core GRVI Phalanx in a Xilinx VCU118 / XCVU9P

1 core, 32 cores, 1680 cores — RISC-V scales up! A 1-core Si-Five HiFive-1, a 2x2x8=32-core GRVI Phalanx in a Digilent Arty / XC7A35T, and a 30x7x8=1680-core GRVI Phalanx in a Xilinx VCU118 / XCVU9P.

Here is the basic cluster tile architecture redesigned for UltraScale+ and its new 288 Kb UltraRAM jumbo-SRAM blocks. The present design includes 210 instances of this tile.

A GRVI Cluster tile with 8 GRVI RISC-V cores, 128 KB multiported bank interleaved shared cluster RAM, optional accelerators (here, none), and a 300-bit wide Hoplite NOC router.

A GRVI cluster tile with 8 GRVI RISC-V cores, 128 KB multiported bank interleaved shared cluster RAM, optional accelerators (here, none), message passing NOC interface, and a 300-bit wide Hoplite NOC router.

An example 1680 GRVI system implemented in a Xilinx Virtex UltraScale+ VU9P. This GRVI Phalanx comprises NX=7 x NY=30 = 210 clusters, each cluster with 8 GRVI cores and a 8-ported 128 KB cluster shared memory. The clusters are interconnected on a Hoplite NOC, with the Hoplite routers configured with 290b data payloads (including 32b address and 256b data), achieving a bandwidth of about 70 Gb/s/link and a NOC bisection bandwidth of 900 Gb/s. Each cluster can send or receive 32 B per cycle into the NOC. The GRVI Phalanx architecture anticipates a variety of configurable accelerators coupled to the processors, the cluster shared RAM, or the NOC.

An example 1680 GRVI system implemented in a Xilinx Virtex UltraScale+ VU9P. This GRVI Phalanx comprises NX=7 x NY=30 = 210 clusters, each cluster with 8 GRVI cores and a 8-ported 128 KB cluster shared memory. The clusters are interconnected on a Hoplite NOC, with the Hoplite routers configured with 290b data payloads (including 32b address and 256b data), achieving a bandwidth of about 70 Gb/s/link and a NOC bisection bandwidth of 900 Gb/s. Each cluster can send or receive 32 B per cycle into the NOC. The GRVI Phalanx architecture anticipates a variety of configurable accelerators coupled to the processors, the cluster shared RAM, or the NOC.

An extended abstract with additional detail on this work has been submitted to, and hopefully will be presented at, the OLAF’17 workshop at FPGA’17.

‘Computing on Programmable Logic’ at Microsoft Research Faculty Summit 2016

Yesterday I had the privilege of speaking on Computing on Programmable Logic (slides, video) in the ‘Computing with Exotic Technologies and Platforms’ session at the Microsoft Research Faculty Summit 2016.

Abstract: “We have seen the birth of many exotic architectures in recent years, from a quantum computer that promises to achieve exponential speed-ups over conventional computers, to DNA computation that performs disease diagnostics and therapy, to Field Programmable Gate Arrays (FPGAs) that provide a flexible toolkit for implementing architectures such as Microsoft’s Catapult fabric for large-scale datacenters. Each of these exotic technologies enable novel solutions to challenging problems and require equally novel methods to program and design them. We will highlight the advances in their applications and the challenges behind developing their toolchains and programming environments.”

GRVI Phalanx Update

An update on the work-in-progress GRVI Phalanx.

Conferences

An extended abstract and brief talk on GRVI Phalanx was presented at the 2nd International Workshop on Overlay Architectures (OLAF-2) at FPGA 2016.

GRVI Phalanx was discussed in the short talk Software-First, Software Mostly: Fast Starting with Parallel Programming for Processor Array Overlays at the Arduino-like Fast-Start for FPGAs pre-conference workshop at FCCM 2016. [Slides]

The first refereed paper on GRVI Phalanx was presented yesterday at the 24th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2016): GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator and received the FCCM 2016 Best Short Paper Award. [PDF]

Hardware Changes: Version 0.2

Here are some of the changes made to the GRVI Phalanx design since it was first described at the 3rd RISC-V Workshop. This is now version 0.2.

GRVI

  • LB/LBU/LH/LHU/SB/SH: Load/store byte and halfword alignment functionality is now configured OFF in the GRVI PEs. The LdMux and StMux units have been factored out of GRVI and into the GRVI cluster, each set of muxes shared by a pair of cores.
  • MUL/MULH/MULHU/MULHSU: The multiply instructions from the RISC-V “M” extension are now enabled by default and are implemented in the GRVI cluster. Each pair of processors shares one DSP-based multiplier. This consumes 200 DSP48s in the 400 PE GRVI Phalanx for Kintex UltraScale 040, leaving 1720 DSP48s for use by accelerators.
  • SL*/SR*: By default, fast left and right shift instructions are also implemented in these DSP-based multipliers.
  • LR/SC: These atomic instructions from the RISC-V “A” extension are now enabled by default. Part of the implementation is in the GRVI core and part in the GRVI cluster memory arbiters. The implementation considerations were discussed on the RISC-V mailing lists here.

Phalanx

  • A Phalanx system may be configured to replace the cluster at (NX-1,NY-1) with a character mode VGA cluster with a 32 KB text frame buffer.
  • Hoplite multicast message routing is now enabled by default. An agent can sent a message to every cluster on a given row, given column, or to every cluster on the NOC. If desired, all IRAMs in all clusters in a Phalanx may be updated with a single burst of 1024 XY-multicast messages.

Introducing GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator

GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area as a processing element (PE) in a parallel processor. GRVI implements a 2 or 3 stage single issue pipeline, typically consumes 320 6-LUTS in a Xilinx UltraScale FPGA, and currently runs at 300-375 MHz in a Kintex UltraScale (-2) in a standalone configuration with most favorable placement of local BRAMs.

Phalanx is massively parallel FPGA accelerator framework, designed to reduce the effort and cost of developing and maintaining FPGA accelerators. A Phalanx is a composition of many clusters of soft processors and accelerator cores with extreme bandwidth memory and I/O interfaces on a Hoplite NOC.

GRVI Phalanx was introduced today at the 3rd RISC-V Workshop at Redwood Shores, CA.

A work-in-progress 5x10x8 = 400 processor configuration in a KU040 in a Xilinx KCU105 and a 2x2x8 = 32 processor configuration in a Xilinx Artix-7 35T in an Digilent Arty were demonstrated in the demo/poster session.

A 10x5x8 = 400 processor GRVI Phalanx

For more information please visit the GRVI Phalanx page.

Introducing Hoplite: An FPGA-Optimal Router for Extreme Bandwidth NOCs

Hoplite is a configurable, general purpose, FPGA-optimal 2D router and tools for implementation of efficient network on chip (NOC) interconnection of diverse processors, accelerators, other client cores, and extreme bandwidth (100+ Gb/s) interfaces.

The first paper on Hoplite is presented today at FPL 2015:

Hoplite: Building Austere Overlay NoCs for FPGAs, Nachiket Kapre, Jan Gray.
25th International Conference on Field-Programmable Logic and Applications, Sept. 2015. Received the Michael Servit Best Paper Award. [PDF]

For more information please visit the Hoplite page.

Stop Everything, We’re Doing 8-LUTs

[With apologies/thanks to The Onion]

Commentary • Opinion • business • ISSUE 15•03 • March 6, 2015
By Josh Gane, CEO and President, XilTera

Would someone tell me how this happened? We were the vanguard of programmable logic. The XilTera StraTex was the cool device to design in. Then the other guy came out with a 6-LUT FPGA. Were we scared? Hell, no. Because we hit back with a little thing called the CLM. That’s a 6-LUT with a dual ported synchronous LUT RAM. For register files and so much more. But you know what happened next? Shut up, I’m telling you what happened—the bastards went to superfracturable LUTs. Now we’re standing around looking foolish selling plain 6-LUTs and a LUT RAM. RAM or not, suddenly we’re the chumps. Well, screw it. We’re going to 8-LUTs.

Sure, we could go to 7-LUTs next, like the competition. That seems like the logical thing to do. After all, 6-LUTs worked out pretty well, and seven is the next number after six. So let’s play it safe. Let’s make a 7-input 6,6-LUT and call it the UltraLUT. Why innovate when we can follow? Oh, I know why: Because we’re a business, that’s why!

You think it’s crazy? It is crazy. But I don’t give a damn. From now on, we’re the ones who have the edge in the LUT inputs game. Are they the best programmable logic a designer can get? Hell, no. XilTera is the most programmable programmable logic around.

What part of this don’t you understand? If 4-LUTs are good, and 6-LUTs are better, obviously 8-LUTs would make us the best freaking FPGA that ever existed. Comprende? We didn’t claw our way to the top of the FPGA game by clinging to the 4-LUT industry standard. We got here by taking chances. Well, 8-LUTs are the biggest chance of all.

Here’s the report from Engineering. I don’t want to hear those three damnable letters V.P.R. ever again! They don’t tell me what to invent—I tell them. And I’m telling them to stick two more LUT inputs in there. I don’t care how. Make the mux tree FinFETs so thin they’re invisible. Put some LUT configuration cells in the CLM cluster switchbox. Partially populate it. I don’t care if they have to cram the seventh and eighth input in perpendicular to the other six, just do it!

You’re taking the “balanced” part of “balanced architecture” too literally, grandma. Cut the strings and soar. Let’s hit it. Let’s roll. This is our chance to make programmable logic history. Let’s dream big. All you have to do is say that 8-LUTs can happen, and it will happen. If you aren’t on board, then screw you. And if you’re on the board, then screw you and your father. Hey, if I’m the only one who’ll take risks, I’m sure as hell happy to hog all the glory when the 8-LUT FPGA becomes the device for the U.S. of “this is how we reconfigurably compute now” A.

People said we couldn’t go to 6-LUTs. It’ll blow up the size of the input muxes and configuration bitstream, the LUT mux tree wiring will be too long, they said. Well, we did it. Now some egghead in a lab is screaming “8-LUTs are crazy?” Well, perhaps he’d be more comfortable in the labs at Lactel, working on antifuse crossbars. ONO, my ass!

Maybe I’m wrong. Maybe we should just ride in AltLinx’s wake and make voltage regulators. Ha! Not on your damn life! The day I shadow a penny-ante outfit like AltLinx is the day I leave the FPGA game for good, and that won’t happen until the day I die! The market? Listen, we make the market. All we have to do is put her out there with a little jingle. It’s as easy as, “Hey, technology mapping complex datapaths with anything less than 8-input LUTs is like playing pushing-on-a-rope whack-a-mole with some bad-approximation-timing-model physical synthesis CAD.” Or “There’ll be so few logic levels between the hard logic blocks, I could put Intel out of business.”

I know what you’re thinking now: What’ll people say? Mew mew mew. Oh, no, what will people say?! Grow up. When you’re on top, people talk. That’s the price you pay for being on top. Which XilTera is, always has been, and forever shall be, Amen, 8-LUTs, sweet Jesus in heaven.

Wait. I just had a stroke of genius. Are you ready? Open your mouth, baby birds, cause Mama’s about to drop you one sweet, fat nightcrawler. Here she comes: Put another carry chain on that sucker, too. And hell, why stop there? Two write ports. That’s right. 8-LUTs, fracturable into four 6-LUTs, four clock enables on the flops, two carry chains, 256b LUT RAM, and two write ports. You heard me—two write ports. It’s a whole new way to think about memory bandwidth intensive massively parallel accelerators. Don’t question it. Don’t say a word. Just key the music, and call the chorus girls, because we’re on the edge—the rising clock edge—and I feel like dancing.