Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development, Mar-May 2006 by Dongarra, J, Bosilca, G, Chen, Z, Eijkhout, V, Et al

Trivially, the owner (or manager) of the system is interested in optimal resource utilization, while the user expects the shortest time to obtain the solution. Instead of aiming at the optimization of either the former (by maximizing memory utilization and sacrificing the total solution time by minimizing the number of processes involved) or the latter (by using all of the available processors), a benchmarking engineer would be interested in best floating-point performance.

Experimental results

Figure 3 illustrates how the time to solution is influenced by the aspect ratio of the logical process grid for a range of process counts. (Each processor was a 1.4-GHz AMD Athlon** with 2 GB of memory; the interconnect was Myricom Myrinet** 2000.) It is clear that sometimes it might be beneficial not to use all of the available processors for computation (the idle processors might be used, for example, for fault-tolerance reasons). This is especially true if the number of processors is a prime number, which leads to a one-dimensional process grid and thus very poor performance on many systems. It is unrealistic to expect that nonexpert users will make the correct decision in every case. It is a matter of having either expertise or relevant experimental data to guide the choice, and our experiences suggest that perhaps a combination of both is required to make good decisions consistently. As a side note, the collection of data for Figure 3 required a number of floating-point operations that would compute the LU factorization of a square dense matrix of order almost 300k. Matrices of that size are usually suitable for supercomputers (the slowest supercomputer on the TOP500** [34] list that factored such a matrix was on position 16 in November 2002).

Figure 4 shows the large extent to which the aspect ratio of the logical process grid influences another facet of numerical computation: per-processor performance of the LFC parallel solver. The plots in the figure show data for various numbers of processors (between 40 and 64) and consequently do not represent a function, because, for example, ratio 1 may be obtained with 7 × 7 and 8 × 8 process grids (within the specified range of the number of processors). The figure shows the performance of both parallel LU decomposition using the Gaussian elimination algorithm and parallel Cholesky factorization code. A side note: The data is relevant for only the LFC parallel linear solvers that are based on the solvers from ScaLAPACK. [26]; it would not be indicative of the performance of a different solver, such as HPL [35], which uses different communication patterns and consequently behaves differently with respect to the process grid aspect ratio.

SALSA

Algorithm choice, the topic of this section, is an inherently dynamic activity in which the numerical content of the user data is of prime importance. Speaking abstractly, we could say that the need for dynamic strategies arises here from the fact that any description of the input space is of a very high dimension. As a corollary, we cannot hope to search this input space exhaustively, and we have to resort to some form of modeling of the parameter space.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest