About ESMF Download Users Developers Management Work Plans Metrics Impacts
Data Redistribution in the NCAR CAM and CLM

Peggy Li/JPL
September 13, 2006

 

The objective of this task is to evaluate and optimize the performance of ESMF arbitrarily distributed grid redistribution functions in preparation for performance evaluations that are part of the CCSM Stage 1 Evaluation Plan. In Stage 1 of the Plan, we focus on the data redistribution between the atmosphere model, CAM and the land model, CLM.

 

The benchmark results shown below were generated on the IBM Cluster at NCAR, bluevista, and the Cray X1E at Cray Inc., earth. Bluevista is a IBM SMP cluster runing AIX version 5.3. The benchmark program was compiled using 64-bit IBM Fortran90 (version 8.1.1) and C++ compilers (version 6.0.0). earth is a Cray X1E system running Unicos/mp operating system. The benchmark program was compiled using 64-bit Cray Fortran90 and C++ compilers (version 5.5.0.4).

 

The source code of the benchmark program can be browsed or downloaded at the SourceForge ESMF contributions site. Use the following command to retrieve the code:

 

cvs -z3 -d:pserver:anonymous@esmfcontrib.cvs.sourceforge.net:/cvsroot/esmfcontrib co -P performance_tests/ccsm/ESMF/arb2arb

 

This evaluation was performed with ESMF version 2.2.2r.

 

ESMF Benchmark Program

 

Our benchmark program contains four components: an Atmosphere Grid Component, a Land Grid Component, an Atmosphere-to-Land Coupler Component and a Land-to-Atmosphere Coupler Component.  The main goal of the benchmark program is to evaluate the overhead of the ESMF data redistribution functions.  Following is a brief description of each component:

 

         AtmGrdComp (Atmosphere Grid Component) – This component mimics the CAM model in CCSM.  It creates a 2D arbitrarily distributed global rectangular grid and a bundle of 19 floating-point fields associated with the grid.  This bundle will be exported to the A2LCplComp component for redistribution to the LndGrdComp component.  It also imports a 13 field bundle from the L2ACplComp component.

 

         LndGrdComp (Land Grid Component) – This component mimics the CLM model in CCSM.   The land grid is the same as the CAM grid but is distributed differently.  The Land model contains sparse data  - only 40% of grid points contain valid data.  The data are organized as a one-dimensional array, then distributed evenly in all the processors.  This component exports a bundle with 13 floating-point fields to the L2ACplComp component and imports a 19 field bundle  form the A2LCplComp component.

 

         A2LCplComp (Atmosphere to Land Coupler Component) – This component handles data redistribution from AtmGrdComp to LndGrdComp. 

 

         L2ACplComp (Atmosphere to Land Coupler Component) – This component handles data redistribution from LndGrdComp to AtmGrdComp.

 

We implemented direct arbitrary-to-arbitrary redistribution between the CAM grid and the CLM grid. In the main program, we run grid redistribution in a loop of 50 iterations calling the run routines of  AtmGrdComp, A2LCplComp, LndGrdComp, and L2ACplComp in sequence.   The CAM grid is first redistributed into the CLM grid by A2LCplComp, then back to the CAM grid by L2ACplComp.  The redistributed fields in the CAM grid will be compared with the original CAM fields for correctness.  The AtmGrdComp calculates and collects the total number of correct results at PET 0.  The total number of correct results should be equal to the total number of land points.  If they are not equal, an error message will be printed.

 

We measure both the time to initialize the redistribution, i.e. ESMF_BundleRedistStore() and the time to do the redistribution, i.e., ESMF_BundleRedistRun().  The initialization time is the sum of the two ESMF_BundleRedistStore() calls in A2LCplComp and L2ACplComp.  The run time is the total time to distribute CAM grid to CLM grid in A2LCplComp and to redistribute it back from CLM to CAM grid in L2ACplComp.  We report the average run time out of 50 iterations.

 

The Grids

 

In this exercise, we benchmark two sizes of grids: 128x64 elements and 256x128 elements.  For each grid, we use the data decomposition listed in Table 1.

 

Grid Size

Non-OpenMP (# processors)

OpenMP

128x64

16, 32, 64

8 tasks, 8 threads

256x128

32, 64, 128

16 tasks, 8 threads

Table 1:  Test Grids and Their Decompositions

 

The decomposition for the CAM grid is the Option 0 decomposition with local data split into chunks to maintain load-balancing. (http://swiki.ucar.edu/ccsm/35) However, after sorted in longitude/latitude order, the option 0 CAM decomposition is the same as latitude-only decomposition.  Figure 1 is an 8-task OpenMP decomposition of the CAM grid of size 128x64.  The CLM decomposition is different.  The CLM data is organized as 1D vector, an 8-task OpenMP decomposition of the grid size 128x64 is shown in Figure 2.  The grid point with value 0 (dark blue) are the missing data.   The redistribution benchmark for OpenMP case is no different from the non-OpenMP case. 

 

The grid point indices for each PET are stored in the input files in ASCII format, one file per PET per configuation.    We have input files for the above 8 configurations.   The CAM distribution file is read in by the AtmGrdComp component in the initialization routine to construct the CAM grid.  The CLM distribution file is read in by the LndGrdComp component in the initialization routine to construct the CLM grid.

Figure 1  128x64 CAM Grid Decomposition for 8 tasks

Figure 2 128x64 CLM Grid Decomposition for 8 tasks

 

The Results

 

The benchmark results shown below were generated on the IBM Cluster at NCAR, bluevista and the Cray X1E at Cray Inc., earth. We ran each benchmark program three times for each configuration and used the best (the fastest) result out of the three runs.   The run time is the average of 50 iterations built in the program.  All the units are in milliseconds.  The timing results for the 128x64 grid are shown in Table 2, Figure 3 and 4. On the IBM cluster, the initialization time is below 50 milliseconds and the redistribution run time is within 2 milliseconds (except for the 64 processors case). The timing results for the 256x128 grid are shown in Table 3, Figure 5 and 6.  On the IBM machine, the initialization time is below 150 milliseconds and the run time is within 7 milliseconds except for the 128 processor case. 

 

The performance of X1E is much worse than that of IBM-SP.  In general, the program on X1E runs over 10 times slower than on the IBM for all the eight configurations.  When we did the performance profiling on X1, we found out there are two functions performed very poorly on X1, i.e., MPI_Broadcast and memcpy.   These two functions were heavily used in the original ESMF code thus becoming the bottleneck in the ESMF benchmark program.   The poor performance of MPI_Broadcast affects the performance of ESMF_BundleRedistStore() and memcpy affects the performance of ESMF_BundleRedistRun().   We hence replaced a series of MPI_Broadcast calls by EMSF_VMAllGatherV, and memcpy calls by simple assignment statements.  The results shown below reflect the timing of the optimized code.

 

128x64 grid

# Nodes

Init (X1E)

Init (IBM)

run (X1E)

run (IBM)

8

357.0178

40.5002

16.5927

1.2776

16

218.2901

34.1019

14.8972

1.5684

32

389.4656

31.3586

34.928

1.9814

64

425.2421

29.9956

59.4735

2.9228

Table 2:  Redistribution Time for the 128x64 grid

 

 

Figure 3.  Redistribution Initialization Time for the 128x64 grid

 

Figure 4.  Redistribution Run Time for the 128x64 grid

 

256x128 grid

# Nodes

Init (X1E)

Init (IBM)

run (X1E)

run (IBM)

16

924.6599

150.6831

30.0566

4.0421

32

1087.6294

140.4841

40.592

3.3827

64

1149.7676

124.6631

64.6535

4.6124

128

1728.3291

128.1008

129.1839

7.5746

Table 3:  Redistribution Time for the 256x128 grid

 

Figure 5.  Redistribution Initialization Time for the 256x128 grid

 

Figure 6.  Redistribution Run Time for the 256x128 grid

 

Optimization

 

The main objective of this task is to find out the bottlenecks in the ESMF implementation for the arbitrary to arbitrary grid redistribution and optimize it.  In addition to the optimization we did specifically for X1 as described above, we also implemented several general optimizations in both ESMF_BundleRedistStore and ESMF_BundleRedistRun.  The major one is a redesign of  function ESMF_RoutePrecomputeRedistV() called by ESMF_BundleRedistStore for the arbitrary-to-arbitrary grid redistribution.  In this routine, the Send and Recv Routing Tables are calculated in each PET: the Send Routing Table determines which destination PETs the local source grid points should be sent to and the Recv Routing Table determines where the remote grid points come from for my local destination grid.  The new algorithm sorts the local and the global grid points in the order of its grid index to reduce the time to calculate the intersection of the source grid and the destination grid. 

 

Conclusion

 

From the timing results reported above, we concluded that the ESMF performance is comparable to the current CCSM implementation for the arbitrary grid to arbitrary grid redistribution using the CCSM grid distributions on the two target machines, i.e., the IBM-SP and the Cray X1E.