Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development, Mar-May 2005 by Gara, A, Blumrich, M A, Chen, D, Chiu, G L-T, Et al

The collective network is also used for global broadcast of data, rather than transmitting it around in rings on the torus. For one-to-all communications, this is a tremendous improvement from a software point of view over the nearest-neighbor 3D torus network. The broadcast functionality is also very useful when there are one-to-all transfers that must be concurrent with communications over the torus network. Of course, a broadcast can also be handled over the torus network, but it involves significant synchronization effort and has a longer latency. The bandwidth of the torus can exceed the collective network for large messages, leading to a crossover point at which the torus becomes the more efficient network.

A global floating-point sum over the entire machine can be done in approximately 10 �s by utilizing the collective network twice. Two passes are required because the global network supports only integer reduction operations. On the first pass, the maximum of all exponents is obtained; on the second pass, all of the shifted mantissas are added. The collective network partitions in a manner akin to the torus network. When a user partition is formed, an independent collective network is formed for the partition; it includes all nodes in the partition (and no nodes in any other partition).

The collective network is also used to forward file-system traffic to I/O nodes, which are identical to the compute nodes with the exception that the Gigabit Ethernet is wired out to the external switch fabric used for file-system connectivity.

The routing of the collective network is static but general in that each node contains a static routing table that is used in conjunction with a small header field in each packet to determine a class. The class is used to locally determine the routing of the packet. With this technique, multiple independent collective networks can be virtualized in a single physical network. Two standard examples of this are the class that connects a small group of compute nodes to an I/O node and a class that includes all compute nodes in the system. In addition, the hardware supports two virtual channels in the collective network, allowing for nonblocking operations between two independent communications.

Barrier network

As we scale applications to larger processor and node counts, the latency characteristics of global operations will have to improve considerably. We have implemented an independent barrier network to address this architectural issue. This network contains four independent channels and is effectively a global OR over all nodes. Individual signals are combined in hardware and propagate to the physical top of a combining tree. The resultant signal is then broadcast down this tree. A global AND can be achieved by using inverted logic. The AND is used as a global barrier, while the OR is a global interrupt that is used when the entire machine or partition must be stopped as soon as possible for diagnostic purposes. The barrier network is optimized for latency, having a round-trip latency of less than 1.5 �s for a system size of 64Ki nodes. This network can also be partitioned on the same midplane boundaries as the torus and collective networks.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest