Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development, Mar-May 2006 by Dongarra, J, Bosilca, G, Chen, Z, Eijkhout, V, Et al

ATLAS uses an orthogonal search [13]. For an optimization problem min f(x^sub 1^, X^sub 2^, ..., X^sub n^), parameters X^sub i^ (where 1 ≤ i ≤ n) are initialized with reference values. From x^sub 1^ to x^sub n^, orthogonal search does a linear onedimensional search for the optimal value of x^sub i^, and it uses previously found optimal values for x^sub 1^, x^sub 2^, ..., x^sub n-1^.

Applying simplex search to ATLAS

We have replaced the ATLAS global search with the modified Nelder-Mead simplex search and conducted experiments on four different architectures: 2.4-GHz Intel Pentium** 4, 900-MHz Intel Itanium** 2, 1.3-GHz IBM POWER4*, and 900-MHz Sun UltraSPARC**.

Given values for a set of parameters, the ATLAS code generator generates a code variant of matrix multiply. The code is executed with randomly generated 1000 × 1000 dense matrices as input. After execution of the search heuristic, the output is a set of parameters that gives the best performance for that platform. Figure 1 compares the total time spent by each of the search methods on the search itself. The Itanium 2 search time (for all search techniques) is much longer than those for the other platforms because we are using the Intel compiler, which, in our experience, takes longer to compile the same piece of code than the GNU compiler collection (GCC) used on the other platforms. Figure 2 shows the comparison of the performance of matrix multiply on different sizes of matrices using the ATLAS libraries generated by the simplex search and the original ATLAS search.

Empirical generic code optimization

Current empirical optimization techniques such as ATLAS and FFTW can achieve good performance because the algorithms to be optimized are known ahead of time. We are addressing this limitation by extending the techniques used in ATLAS to the optimization of arbitrary code. Since the algorithm to be optimized is not known in advance, it requires compiler technology to analyze the source code and generate the candidate implementations. The ROSE project [15, 16] from the Lawrence Livermore National Laboratory provides, among other benefits, a source-to-source codetransformation tool that can produce blocked and unrolled versions of the input code. In combination with our search heuristic and hardware information, we can use ROSE to perform empirical code optimization. For example, on the basis of an automatic characterization of the hardware, we direct their compiler to perform automatic loop blocking at varying sixes, which we can then evaluate to find the best block size for that loop. To perform the evaluations, we have developed a test infrastructure that automatically generates a liming driver for the optimized routine on the basis of a simple description of the arguments.

The generic code optimization system is structured as a feedback loop. The code is fed into the loop processor for optimization and separately fed into the timing driver generator, which generates the code that actually runs the oplimized code variant to determine its execution time. The results of the timing are led back into the search engine. On the basis of these results, the search engine may adjust the parameters used to generate the next code variant. The initial set of parameters can be estimated on the basis of the characteristics of the hardware (e.g.. cache size).


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement
Click Here

Content provided in partnership with ProQuest