MPI Performance on a p690

Mark R. Fahey
September 19, 2002

This document describes two sets of tests that were conducted on a cluster of IBM p690 nodes at ORNL. These tests investigated the performance of MPI programs across multiple p690 nodes in various configurations.


Section 1: HPL tests

The first set of tests were multiple Highly Parallel Linpack (HPL) runs in various configurations. This was an effort to characterize what is best to use (US or IP) for parallel jobs spanning multiple p690 nodes. The results are, as one might expect, dependent on the percent of communication in an application.

This was first motivated by evidence of extremely poor performance across multiple 32-way nodes. Poor performance across multiple nodes was expected due to the adapter and switch technology. To elaborate, p690 nodes can only have 2 adapters per node; it doesn't matter if it is a 4-way or 32-way. Thus, multiple MPI tasks running on a 32-way node can easily "overwhelm" the available adapter bandwidth. Increased bandwidth per CPU can be achieved by using logical partitions (LPARs). However, what was not expected, is that (in certain cases),

For some examples, using LPAR nodes with US protocol with default ordering of tasks does not yield best performance. In addition, for applications that define a process grid (e.g., ScaLAPACK), the shape of the process grid is of vital importance because this affects the communication pattern/load which we will see (implicitly) determines what configuration is best to run under.

For example, consider the HPL benchmark with 256 MPI tasks. On 32-way nodes in IP mode with a 32x8 process grid:

In this case, a 32x8 process grid was used. Although test results indicate that a 8x32 would yield better timings, this test is representative of applications with similar communication patterns. If the communication patterns are fixed, then these results are of vital importance when running these applications.

Thus many HPL tests were run varying

The following two figures each show the results of varying the communication protocol, task ordering, and node type. The figures differ in that one is for the "x by 8" processor grid and the other is for the "8 by x" processor grid.

Figure 1 shows the results of HPL tests using an "x by 8" process grid where x varied from 28 to 32. The x-axis of the figure corresponds to this x parameter that varies from 28 to 32.

Figure 1a: Time in seconds
Not Available!
Figure 1b: GFlops per processor
Not Available!
Figure 2 shows the results of HPL tests using an "8 by x" process grid where x varied from 28 to 32. The x-axis of the figure corresponds to this x parameter that varies from 28 to 32.
Figure 2a: Time in seconds
Not Available!
Figure 2b: GFlops per processor
Not Available!

What do these tests show? Figure 1a is the most telling in that using IP or cyclic ordering can have a dramatic affect on performance. Note that the best timings in Figures 1a and 2a are close around 1950 sec. This closeness was achieved in a vastly different way: cyclic ordering with a "32 by 8" process grid (Figure 1a) and block ordering with a "8 by 32" process grid.

Why did the cyclic ordering "fix" the "32 by 8" process grid test? In this case, compared to the "8 by 32" process grid case, the number of messages increases significanly although the amount of data send is nearly the same. Also, the pattern of messages is different, it might be considered "transposed". Thus, the it makes sense that the task ordering should be "transposed" which the cyclic ordering does.

The conclusions one can draw primarily from Figure 1a are that

Also note that in many cases IP is not that bad compared to US, and sometimes faster as shown above.

Section 2: PALLAS MPI Benchmarks

256 Processor tests using V2.2 of the Pallas MPI Benchmark Suite. For each of the following collective communication calls, several tests were run over 256 processors varying the communication protocol, the node type, and the ordering of the MPI tasks.

Not one configuration clearly outperforms the others.

Allgather

Not Available! Not Available!

Allgatherv

Not Available! Not Available!

Alltoall

Not Available! Not Available!

Reduce

Not Available! Not Available!

Allreduce

Not Available! Not Available!

Bcast

Not Available! Not Available!

Exchange

Not Available! Not Available!

Reduce-scatter

Not Available! Not Available!