Vectorization techniques for the Blue Gene/L double FPU

IBM Journal of Research and Development, Mar-May 2005 by Lorenz, J, Kral, S, Franchetti, F, Ueberhuber, C W

This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue GeneĀ®/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.

Introduction

The IBM Blue Gene*/L (BG/L) supercomputer [1], planned to be in operation in 2005, will be an order of magnitude faster than the Earth Simulator. BG/L will feature eight times more processors than current massively parallel systems. To tame this vast parallelism, new approaches and tools have had to be developed. However, developing highly efficient numerical software has to start with optimizing the computational kernels for the nonstandard floating-point unit (FPU) of the BG/L processors. This so-called double FPU provides support for complex arithmetic as an important prerequisite to speed up large scientific codes.

The utilization of nonstandard FPUs in computational kernels, like fast Fourier transforms (FFTs), is by no means straightforward. Optimization of FFT kernels leads to complicated data dependencies of real variables that cannot easily be mapped to the elaborate BG/L FPU. This problem is particularly demanding in the context of automatic performance tuning, but it must be addressed in order to obtain high-performance FFT implementations, which are required as major building blocks for the large scientific codes planned to be run on BG/L. Most of these applications require very fast one-dimensional FFT routines to be run on a single processor for computing relatively small transforms.

This paper introduces a new FFT library, BGL/FFTW-GEL, that runs efficiently on the BG/L prototypes. This library is the first numerical library for BG/L not developed by IBM. It takes full advantage of the double FPU by means of short-vector single-instruction multiple-data (SIMD) vectorization.

BGL/FFTW-GEL is the result of a combination of FFTW with special-purpose vectorization technology in the Vienna MAP vectorizer [2-4]. FFT codes produced by BGL/FFTW-GEL are running five times faster than standard nonadaptive FFT libraries [2]. On the DD2 prototype, speeds up to 1.8 times greater than the best FFT code not utilizing the special features of the BG/L double FPU were achieved.

The Blue Gene/L supercomputer

The initial DDl prototype of the IBM Blue Gene/L supercomputer [1], equipped with 8,192 custom-made IBM PowerPC* 440 FP2 processors running at 500 MHz, achieved a Linpack performance of R^sub max^ = 11.7 teraflops, i.e., 71% of its theoretical peak performance of R^sub peak^ = 16.4 teraflops. This performance ranks the BG/L prototype machine at position four of the Top500 list (June 2004) [5]. The prototype machine is roughly 1/20th the physical size of machines of comparable compute power-such as Linux** clusters-that exist today.

The 64K-processor BG/L machine currently being built for the Lawrence Livermore National Laboratory (LLNL) will be eight times larger, occupying 64 full racks. When completed in 2005, the LLNL supercomputer-featuring 360 teraflops peak performance-is expected to lead the Top500 list. Compared with the fastest supercomputers of today, it will be an order of magnitude faster, consume 1/15th of the power, and be ten times more compact.

Complex and real arithmetic

Since there are many areas of scientific computing, such as computational electronics, in which complex arithmetic plays an important role, its native support has been integrated into the FPUs of computers devoted to such applications. Nevertheless, even algorithms using complex arithmetic may have to be reformulated in terms of real arithmetic to allow for the application of the inevitable optimization techniques to achieve satisfactory performance of scientific codes: common subexpression elimination, constant folding, and copy propagation on the real and imaginary parts.

BG/L double FPU

The BG/L PowerPC 440 (PPC440) double floating-point FPU (FP2) was obtained by replicating the standard PPC440 FPU and adding crossover datapaths and sign-change capabilities to allow the short-vector SIMD fused multiply-add (FMA) operations to support complex multiplication. Up to four real floating-point operations (one SIMD FMA) can be issued every cycle, and efficient intermixing of scalar and vector operations is possible. The data to be processed has to be naturally aligned on 16-byte boundaries in memory.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest