Vectorization techniques for the Blue Gene/L double FPU
IBM Journal of Research and Development, Mar-May 2005 by Lorenz, J, Kral, S, Franchetti, F, Ueberhuber, C W
This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue GeneĀ®/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.
Introduction
The IBM Blue Gene*/L (BG/L) supercomputer [1], planned to be in operation in 2005, will be an order of magnitude faster than the Earth Simulator. BG/L will feature eight times more processors than current massively parallel systems. To tame this vast parallelism, new approaches and tools have had to be developed. However, developing highly efficient numerical software has to start with optimizing the computational kernels for the nonstandard floating-point unit (FPU) of the BG/L processors. This so-called double FPU provides support for complex arithmetic as an important prerequisite to speed up large scientific codes.
The utilization of nonstandard FPUs in computational kernels, like fast Fourier transforms (FFTs), is by no means straightforward. Optimization of FFT kernels leads to complicated data dependencies of real variables that cannot easily be mapped to the elaborate BG/L FPU. This problem is particularly demanding in the context of automatic performance tuning, but it must be addressed in order to obtain high-performance FFT implementations, which are required as major building blocks for the large scientific codes planned to be run on BG/L. Most of these applications require very fast one-dimensional FFT routines to be run on a single processor for computing relatively small transforms.
This paper introduces a new FFT library, BGL/FFTW-GEL, that runs efficiently on the BG/L prototypes. This library is the first numerical library for BG/L not developed by IBM. It takes full advantage of the double FPU by means of short-vector single-instruction multiple-data (SIMD) vectorization.
BGL/FFTW-GEL is the result of a combination of FFTW with special-purpose vectorization technology in the Vienna MAP vectorizer [2-4]. FFT codes produced by BGL/FFTW-GEL are running five times faster than standard nonadaptive FFT libraries [2]. On the DD2 prototype, speeds up to 1.8 times greater than the best FFT code not utilizing the special features of the BG/L double FPU were achieved.
The Blue Gene/L supercomputer
The initial DDl prototype of the IBM Blue Gene/L supercomputer [1], equipped with 8,192 custom-made IBM PowerPC* 440 FP2 processors running at 500 MHz, achieved a Linpack performance of R^sub max^ = 11.7 teraflops, i.e., 71% of its theoretical peak performance of R^sub peak^ = 16.4 teraflops. This performance ranks the BG/L prototype machine at position four of the Top500 list (June 2004) [5]. The prototype machine is roughly 1/20th the physical size of machines of comparable compute power-such as Linux** clusters-that exist today.
The 64K-processor BG/L machine currently being built for the Lawrence Livermore National Laboratory (LLNL) will be eight times larger, occupying 64 full racks. When completed in 2005, the LLNL supercomputer-featuring 360 teraflops peak performance-is expected to lead the Top500 list. Compared with the fastest supercomputers of today, it will be an order of magnitude faster, consume 1/15th of the power, and be ten times more compact.
Complex and real arithmetic
Since there are many areas of scientific computing, such as computational electronics, in which complex arithmetic plays an important role, its native support has been integrated into the FPUs of computers devoted to such applications. Nevertheless, even algorithms using complex arithmetic may have to be reformulated in terms of real arithmetic to allow for the application of the inevitable optimization techniques to achieve satisfactory performance of scientific codes: common subexpression elimination, constant folding, and copy propagation on the real and imaginary parts.
BG/L double FPU
The BG/L PowerPC 440 (PPC440) double floating-point FPU (FP2) was obtained by replicating the standard PPC440 FPU and adding crossover datapaths and sign-change capabilities to allow the short-vector SIMD fused multiply-add (FMA) operations to support complex multiplication. Up to four real floating-point operations (one SIMD FMA) can be issued every cycle, and efficient intermixing of scalar and vector operations is possible. The data to be processed has to be naturally aligned on 16-byte boundaries in memory.
Most Recent Technology Articles
- INTERVIEW WITH BEN BUTTERS, DIRECTOR OF EUROPEAN AFFAIRS AT EUROCHAMBRES : "A PERFECT ROAD MAP FOR EU CLUSTERS DOES NOT EXIST".
- AGENDA.(Brief article)(Conference notes)
- FIGHT AGAINST INTERNET PIRACY.
- INTERNET : AUTHORS' SOCIETIES URGE ACTION AGAINST PIRACY.
- TELECOMMUNICATIONS : BUSINESSEUROPE HOSTILE TO FURTHER CONTRACTUAL OBLIGATIONS.(Brief article)
Most Recent Technology Publications
Most Popular Technology Articles
- What is precision air conditioning and why is it necessary?
- Business process re-engineering in the small firm: A case study
- 3G: naughty or nice? PhoneErotica.com generates over 300 million hits per month, and rings up more minutes of use per month than MSN
- BizRate to monitor in-store customer satisfaction for Office Depot stores - Market Intelligence
- Speed control of separately excited DC motor


