About ESMF Download Users Developers Management Work Plans Metrics Impacts

Component Overhead for Large Processor Counts

Peggy Li/JPL
November 3, 2006

Objective

 

The objective of this task is to evaluate the performance of the ESMF Superstructure Functions on large number of processors, i.e., over 1000 processors.  The ESMF superstructure functions include the ESMF initialization and termination, and the component creation, initialization and execution and termination. 

 

We conducted the performance evaluation on the Cray XT3 at Oak Ridge National Laboratory.  It is running UNICOS/lc 1.4.35 operating system with PGI 6.1.4 compilers.  The ESMF tag used to run the benchmark is ESMF_3_0_1_beta_snapshot_06.

 

The source code of the benchmark program can be browsed or downloaded at the SourceForge ESMF contributions site. Use the following command to retrieve the code:

 

cvs -z3 -d:pserver:anonymous@esmfcontrib.cvs.sourceforge.net:/cvsroot/esmfcontrib co -P performance_tests/petascale/superstructure

 

Benchmark Program

 

The benchmark program is adapted from the system test program ESMF_CompCreateSTest.F90 included in the ESMF 3.0.1.  The test program tests basic ESMF superstructure functions, such as ESMF_Initialize(), ESMF_Finalize(), ESMF_GridCompCreate(), ESMF_GridCompInitialize(), ESMF_GridCompRun(), and ESMF_GridCompFinalize(). This benchmark program also includes a simple grid component user_model.F90.  It defines an init subroutine, a run subroutine and a finalize subroutine.  In order to measure the ESMF overhead only, the user defined routines are all empty routines.

 

The Results

 

We timed the ESMF_Initialize() and ESMF_Finalize() as well as all the ESMF_GridComp subroutines using MPI_WTIME().  We put MPI_Barrier() before and after the subroutine calls except for ESMF_Initialize() and ESMF_Finalize().  We cannot call MPI_Barrier() before ESMF_Initialize() because MPI_Init() is called inside ESMF_Initialize() and we cannot call MPI_Barrier() before MPI_Init() gets called.  Similarly to ESMF_Finalize(), we cannot put a MPI_Barrier() after ESMF_Finalize() since MPI_Finalize() is called by ESMF_Finalize().

 

The benchmark results shown below were generated on the Cray XT3 at Oak Ridge National Laboratory.  We ran the benchmark on 4, 8, 16, 32, 64, 128, 256, 512, 1024 and 2048 processors.  For each configuration, we ran the benchmark 6 times and used the best (the fastest) result out of the six runs.   The run routine was called three times in the program and the time for ESMF_GridCompRun() is the average of the three calls. 

 

Table 1 shows the timing results for ESMF_Initialize() and ESMF_Finalize() using two different Error Log file type options (ESMF_LOG_SINGLE and ESMF_LOG_MULTI).  All the numbers are in milliseconds.  When ESMF_Initialize() is called, each processor will open or create a log file for error message logging.  When the argument defaultLogType is set to ESMF_LOG_SINGLE, all the processes will write to a single log file named ESMF_LogFile.  If defaultLogType is set to ESMF_LOG_MULTI, each processor will open its own log file called PETnn.ESMF_LogFile.     The time to call ESMF_Initialize() and ESMF_Finalize() increases linearly with the increase of the number of processors.  A more detailed timing analysis reveals that over 95% of the total time is spent in opening the log file in ESMF_Initialize() and flushing and closing the log file in ESMF_Finalize().  The ESMF_Initialize performance for multiple log file is better for large number of processors (>1000), but the timing difference between single log file and multiple log file modes is not significant.  However, the ESMF_Finalize() with single log file runs much slower than that with multiple log files.   In either case, we observed more than linear slow down from 1024 processors to 2048 processors.  In particular, the ESMF_Finalize() time with ESMF_LOG_MULTI jumped from 76.79 msecs to 3272.23 msecs when processor count increases from 1024 to 2048.  I repeated the 2048-processor run several times and the result is consistent.  I suspect it is probably due to the parallel IO limitation on the Cray XT3.

 

Table 1:  The performance of ESMF_Initialize and ESMF_Finalize

 

ESMF_Initialize

ESMF_Finalize

# Processors

Single Log File

Multiple Log Files

Single Log File

Multiple Log Files

4

45.1099

55.8719

10.6060

3.8490

8

56.6609

65.0341

22.2728

4.6570

16

78.2170

79.3970

45.4728

6.5839

32

128.3328

130.0080

96.3781

6.6268

64

262.7029

300.4269

191.8740

6.6380

128

531.5239

629.7939

387.7599

7.0879

256

1105.5291

1176.2240

803.4339

13.5019

512

2354.6889

2985.8060

1572.3071

25.3999

1024

5533.4878

 4157.5648

3173.5279

 76.7910

2048

15760.0679

 12389.1499

6145.0428

 3272.2380

 

 

Table 2 and Figure 1 depict the timings for the ESMF component subroutines.  All the numbers are in microseconds.  All the subroutines except for ESMF_GridCompRun() are called only once by the application program.  Therefore, the overhead of ESMF_GridCompRun() is the one we care the most.  From the timing results in Table 2, the time to call ESMF_GridCompRun() is less than 20 microseconds for 1024 or less processors and only 29.4 microseconds for 2048 processors.

 

Table 2. ESMF Component Routines Timing

# Processors

GridCompCreate

GridCompInit

GridCompRun

GridCompFinalize

4

59.10

37.20

3.60

7.90

8

76.10

47.90

4.60

10.00

16

95.80

64.10

5.30

12.10

32

122.10

74.10

6.30

14.80

64

161.80

91.00

7.30

16.90

128

266.10

105.10

8.30

19.80

256

604.90

108.90

11.00

23.80

512

1871.80

115.10

14.00

27.90

1024

6957.10

430.01

19.60

35.00

2048

28701.00

998.97

29.40

47.20

 

Figure 1. ESMF Component Subroutine Overhead

 

Optimization and Conclusion

 

The two ESMF superstructure functions that have scalability problem are ESMF_Initialize() and ESMF_Finalize().   As discussed above, 95% of the time in these two functions are spent in opening, flushing and closing the log file.  In order to make these two functions run more efficiently in peta-scale computers, we added another option to the ESMF Log Error Type, i.e., ESMF_LOG_NONE.  When ESMF_Initialize() is called, user can specify argument defaultLogType to ESMF_LOG_NONE.  In this mode, no log file will be opened and no system error message will be written out.  This change was made to ESMF 3.0.1 snapshot #06 and above.   Table 3 shows the additional timing for this new mode (all the timings are in milliseconds).  Significant improvement has been observed for all the configurations.     Figure 2 and Figure 3 shows the ESMF_Initialize and ESMF_Finalize timing comparison for three different options

Using 4 to 2048 processors.

 

Table 3:  Timing comparison with three different log file type options

 

ESMF_Initialize

ESMF_Finalize

# Procs

Single Log File

Multiple Log Files

No Log Files

Single Log File

Multiple Log Files

No Log File

4

45.1099

55.8719

2.2139

10.6060

3.8490

0.0789

8

56.6609

65.0341

2.4030

22.2728

4.6570

0.1130

16

78.2170

79.3970

2.5470

45.4728

6.5839

0.1600

32

128.3328

130.0080

2.7969

96.3781

6.6268

0.2009

64

262.7029

300.4269

2.9039

191.8740

6.6380

0.2649

128

531.5239

629.7939

3.5570

387.7599

7.0879

0.3008

256

1105.5291

1176.2240

5.4331

803.4339

13.5019

0.3349

512

2354.6889

2985.8060

11.7719

1572.3071

25.3999

0.3600

1024

5533.4878

 4157.5648

32.5110

3173.5279

 76.7910

0.3891

2048

15760.0679

 12389.1499

122.5300

6145.0428

3272.2380

0.4170

 

Figure 2:  ESMF_Initialize Timing on XT3

 

Figure 3. ESMF_Finalize Timing on XT3

 

In summary, the overhead of the ESMF superstructure functions does increase with the number of processors.  However, the overhead remains reasonably low for all the component functions.  The parallel file IO is the major scalability problem we observed in the benchmarked ESMF functions.  Disabling the error log file hides the problem but does not solve it.   Using MPI 2.0 parallel I/O APIs is a possible solution to the log error file performance problem.   MPI I/O APIs are supposed to be portable and behave consistently across different platforms.   We will need further performance evaluation on the MPI parallel I/O performance on peta-scale computers.