Component Overhead for Large Processor Counts
Peggy Li/JPL
November 3, 2006
Objective
The objective of this task is to evaluate the performance of the ESMF Superstructure Functions on large number of processors, i.e., over 1000 processors. The ESMF superstructure functions include the ESMF initialization and termination, and the component creation, initialization and execution and termination.
We conducted the performance evaluation on the Cray XT3 at Oak Ridge National Laboratory. It is running UNICOS/lc 1.4.35 operating system with PGI 6.1.4 compilers. The ESMF tag used to run the benchmark is ESMF_3_0_1_beta_snapshot_06.
The source code of the benchmark program can be browsed or downloaded at the SourceForge ESMF contributions site. Use the following command to retrieve the code:
cvs -z3 -d:pserver:anonymous@esmfcontrib.cvs.sourceforge.net:/cvsroot/esmfcontrib co -P performance_tests/petascale/superstructure
Benchmark Program
The benchmark program is adapted from the system test program ESMF_CompCreateSTest.F90 included in the ESMF 3.0.1. The test program tests basic ESMF superstructure functions, such as ESMF_Initialize(), ESMF_Finalize(), ESMF_GridCompCreate(), ESMF_GridCompInitialize(), ESMF_GridCompRun(), and ESMF_GridCompFinalize(). This benchmark program also includes a simple grid component user_model.F90. It defines an init subroutine, a run subroutine and a finalize subroutine. In order to measure the ESMF overhead only, the user defined routines are all empty routines.
The Results
We timed the ESMF_Initialize() and ESMF_Finalize() as well as all the ESMF_GridComp subroutines using MPI_WTIME(). We put MPI_Barrier() before and after the subroutine calls except for ESMF_Initialize() and ESMF_Finalize(). We cannot call MPI_Barrier() before ESMF_Initialize() because MPI_Init() is called inside ESMF_Initialize() and we cannot call MPI_Barrier() before MPI_Init() gets called. Similarly to ESMF_Finalize(), we cannot put a MPI_Barrier() after ESMF_Finalize() since MPI_Finalize() is called by ESMF_Finalize().
The benchmark results shown below were generated on the Cray XT3 at Oak Ridge National Laboratory. We ran the benchmark on 4, 8, 16, 32, 64, 128, 256, 512, 1024 and 2048 processors. For each configuration, we ran the benchmark 6 times and used the best (the fastest) result out of the six runs. The run routine was called three times in the program and the time for ESMF_GridCompRun() is the average of the three calls.
Table 1 shows the timing results for ESMF_Initialize() and ESMF_Finalize() using two different Error Log file type options (ESMF_LOG_SINGLE and ESMF_LOG_MULTI). All the numbers are in milliseconds. When ESMF_Initialize() is called, each processor will open or create a log file for error message logging. When the argument defaultLogType is set to ESMF_LOG_SINGLE, all the processes will write to a single log file named ESMF_LogFile. If defaultLogType is set to ESMF_LOG_MULTI, each processor will open its own log file called PETnn.ESMF_LogFile. The time to call ESMF_Initialize() and ESMF_Finalize() increases linearly with the increase of the number of processors. A more detailed timing analysis reveals that over 95% of the total time is spent in opening the log file in ESMF_Initialize() and flushing and closing the log file in ESMF_Finalize(). The ESMF_Initialize performance for multiple log file is better for large number of processors (>1000), but the timing difference between single log file and multiple log file modes is not significant. However, the ESMF_Finalize() with single log file runs much slower than that with multiple log files. In either case, we observed more than linear slow down from 1024 processors to 2048 processors. In particular, the ESMF_Finalize() time with ESMF_LOG_MULTI jumped from 76.79 msecs to 3272.23 msecs when processor count increases from 1024 to 2048. I repeated the 2048-processor run several times and the result is consistent. I suspect it is probably due to the parallel IO limitation on the Cray XT3.
Table 1: The performance of ESMF_Initialize
and ESMF_Finalize
|
|
ESMF_Initialize |
ESMF_Finalize |
||
|
# Processors |
Single Log File |
Multiple Log Files |
Single Log File |
Multiple Log Files |
|
4 |
45.1099 |
55.8719 |
10.6060 |
3.8490 |
|
8 |
56.6609 |
65.0341 |
22.2728 |
4.6570 |
|
16 |
78.2170 |
79.3970 |
45.4728 |
6.5839 |
|
32 |
128.3328 |
130.0080 |
96.3781 |
6.6268 |
|
64 |
262.7029 |
300.4269 |
191.8740 |
6.6380 |
|
128 |
531.5239 |
629.7939 |
387.7599 |
7.0879 |
|
256 |
1105.5291 |
1176.2240 |
803.4339 |
13.5019 |
|
512 |
2354.6889 |
2985.8060 |
1572.3071 |
25.3999 |
|
1024 |
5533.4878 |
4157.5648 |
3173.5279 |
76.7910 |
|
2048 |
15760.0679 |
12389.1499 |
6145.0428 |
3272.2380 |
Table 2 and Figure 1 depict the timings for the ESMF
component subroutines. All the numbers
are in microseconds. All the subroutines
except for ESMF_GridCompRun() are called only once by the application program. Therefore, the overhead of ESMF_GridCompRun()
is the one we care the most. From the
timing results in Table 2, the time to call ESMF_GridCompRun() is less than
20 microseconds for 1024 or less processors and only 29.4 microseconds for 2048
processors.
Table 2. ESMF
Component Routines Timing
|
# Processors |
GridCompCreate |
GridCompInit |
GridCompRun |
GridCompFinalize |
|
4 |
59.10 |
37.20 |
3.60 |
7.90 |
|
8 |
76.10 |
47.90 |
4.60 |
10.00 |
|
16 |
95.80 |
64.10 |
5.30 |
12.10 |
|
32 |
122.10 |
74.10 |
6.30 |
14.80 |
|
64 |
161.80 |
91.00 |
7.30 |
16.90 |
|
128 |
266.10 |
105.10 |
8.30 |
19.80 |
|
256 |
604.90 |
108.90 |
11.00 |
23.80 |
|
512 |
1871.80 |
115.10 |
14.00 |
27.90 |
|
1024 |
6957.10 |
430.01 |
19.60 |
35.00 |
|
2048 |
28701.00 |
998.97 |
29.40 |
47.20 |

Figure 1. ESMF
Component Subroutine Overhead
Optimization and
Conclusion
The two ESMF superstructure functions that have scalability
problem are ESMF_Initialize() and ESMF_Finalize(). As discussed above, 95% of the time in these two functions are spent in opening, flushing
and closing the log file. In order to
make these two functions run more efficiently in peta-scale
computers, we added another option to the ESMF Log Error Type, i.e.,
ESMF_LOG_NONE. When ESMF_Initialize() is called, user
can specify argument defaultLogType
to ESMF_LOG_NONE. In this mode, no log
file will be opened and no system error message will be written out. This change was made to ESMF 3.0.1 snapshot
#06 and above. Table 3 shows the
additional timing for this new mode (all the timings are in milliseconds). Significant improvement has been observed for
all the configurations. Figure 2 and
Figure 3 shows the ESMF_Initialize and ESMF_Finalize timing comparison for three different options
Using 4 to 2048 processors.
Table 3: Timing comparison with three different log
file type options
|
|
ESMF_Initialize |
ESMF_Finalize |
||||
|
# Procs |
Single Log File |
Multiple Log Files |
No Log Files |
Single Log File |
Multiple Log Files |
No Log File |
|
4 |
45.1099 |
55.8719 |
2.2139 |
10.6060 |
3.8490 |
0.0789 |
|
8 |
56.6609 |
65.0341 |
2.4030 |
22.2728 |
4.6570 |
0.1130 |
|
16 |
78.2170 |
79.3970 |
2.5470 |
45.4728 |
6.5839 |
0.1600 |
|
32 |
128.3328 |
130.0080 |
2.7969 |
96.3781 |
6.6268 |
0.2009 |
|
64 |
262.7029 |
300.4269 |
2.9039 |
191.8740 |
6.6380 |
0.2649 |
|
128 |
531.5239 |
629.7939 |
3.5570 |
387.7599 |
7.0879 |
0.3008 |
|
256 |
1105.5291 |
1176.2240 |
5.4331 |
803.4339 |
13.5019 |
0.3349 |
|
512 |
2354.6889 |
2985.8060 |
11.7719 |
1572.3071 |
25.3999 |
0.3600 |
|
1024 |
5533.4878 |
4157.5648 |
32.5110 |
3173.5279 |
76.7910 |
0.3891 |
|
2048 |
15760.0679 |
12389.1499 |
122.5300 |
6145.0428 |
3272.2380 |
0.4170 |

Figure 2: ESMF_Initialize
Timing on XT3

Figure 3. ESMF_Finalize Timing on XT3
In summary, the overhead of the ESMF superstructure
functions does increase with the number of processors. However, the overhead remains reasonably low
for all the component functions. The
parallel file IO is the major scalability problem we observed in the
benchmarked ESMF functions. Disabling
the error log file hides the problem but does not solve it. Using MPI 2.0 parallel I/O APIs is a
possible solution to the log error file performance problem. MPI I/O APIs are
supposed to be portable and behave consistently across different
platforms. We will need further
performance evaluation on the MPI parallel I/O performance on peta-scale computers.
