The Plural Architecture:Shared Memory Many-cores with Hardware Scheduling

Speaker: Professor Ran Ginosar EE & CS, Technion, Israel
Time: 10:00 ~ 11:30 AM., Sept. 22, 2011
Place: Room 1201, Institute of Computing Technology

Abstract

The Plural many-core architecture combines hundreds of simple cache-less cores, many shared cache banks, a hardware scheduler, and two custom active networks-on-chip: cores-to-shared-caches and cores-to-scheduler. A theoretical model (almost) justifies increasing the number of cores while making them smaller and slower, maximizing performance-to-power ratio. Several benchmark simulations are demonstrated, showing close to linear speedup and high performance-to-power ratio.

A de-synchronized PRAM-like task-based programming model for shared memory enables fine-grain parallelism. Plural tasks are sequential. Precedence relations among tasks are described by a task map, which is executed by the hardware scheduler. Duplicable tasks are described once and executed as multiple instances, under the hardware scheduler’s control. Tasks are not functions they neither receive inputs nor generate outputs; data are shared only through shared memory. Control tasks (join, fork, condition) contain no code, and are executed only by the scheduler. There are no locking mechanisms all synchronizations are formulated as inter-task dependencies and managed by the scheduler.

The shared memory is organized as many banks, allowing all or most cores simultaneous access. A multistage interconnection network resolves address conflicts and may include fetch-and-op facility to enhance PRAM-like concurrent read-and-write as well as unique indexing operations. Addresses are interleaved to reduce conflicts. The entire shared memory is organized as a shared L1 cache. The architecture supports an optional L2 cache on-chip.

The Plural architecture employs standard processors; we have tried Sparc, Microblaze and some proprietary ones. Cores contain a small private scratch-pad memory for unshared variables. Shared co-processors include FPU and collective support. DMA processors provide for data pre-fetching.

The Plural architecture is intended for one-job-at-a-time accelerators; it is not a multitasking multicore, and there should be no OS. The architecture has been implemented as an IP core for mobile SoC and as a FPGA accelerator. It has yet to be demonstrated as a standalone IC. During the talk we will contrast it with other many-core architectures including Tiles, Rigel and XMT.

Biography

Ran Ginosar received his BSc EE&CS /Summs cum Laude/ from the Technion in Israel in 1978 and his PhD in EE&CS from Princeton University in the USA in 1982. He has conducted research at Bell Laboratories, at the University of Utah and at Intel Research Laboratories in Oregon, USA. He is member of the faculty of both EE and CS departments at the Technion, and heads the VLSI Systems Research Center. He has also co-funded several start-up companies in the area of VLSI and parallel processing. His research interests focus on VLSI and parallel processing and architectures.