February 26, 2002
This note describes preliminary results of effort to characterize history output performance of the WRF model through different implementations of the WRF I/O API on the IBM SP using GPFS. Ran a number of test cases on the IBM SP at NCEP (machine name: bsp) writing output to GPFS in NetCDF or Binary format every 36 time steps – that is, every hour (assuming a 100 second time step) – from the WRF model running on 12 compute processes and 0, 1, or 4 I/O quilt servers.
The first graph shows effective bytes/second writing approximately 14 Mbytes of data ever 36 time steps for 0, 1, and 4 Quilt Servers. This is determined by dividing the number of bytes written by the time spent by WRF actively writing the data and not doing useful computation (in other words, time spent writing the data asynchronously is not counted since useful work is going on in the foreground).
As expected, the binary output is significantly faster, 15.7 Mbytes/second, compared with 5 Mbytes/second for NetCDF when there are no quilt servers and no asynchronous output going on. However, when an additional MPI process is dedicated as an output Quilt Server, both formats attain over 80 Mbytes/second because there is ample computational work to hide the time spent writing the data to GPFS using either format. With 4 Quilt Server processes, the binary output is again faster but the reason for this is unknown, since the number, pattern, and volume of messages between the compute processes and the Quilt Servers is identical regardless of format used to write the data to disk from the master Quilt Server. Perhaps there is additional message traffic internal to GPFS as a result of NetCDF moving the file pointer and this additional communication contends with the communication between the multiple quilt servers or the model execution itself? In any case, the difference on 4 Quilt Servers is actually negligible. Cost for output when using four quilt servers for this 12 processor run is 1/3 of a percent of total run time with NetCDF compared with 1/4 of a percent with Binary output.
Obviously, output performance must be considered in relation to the corresponding computational cost of a scenario. The ratio of output cost to computational cost will vary with problem size, configuration, and number of processors. The volume and frequency of output data is also variable. The relative cost of Binary and NetCDF output must be considered in this context.
The second graph projects output cost as a percentage of runtime when computational speed is increased and the volume and frequency of output is held fixed. Note that this is an extrapolation based on the 12 processor timings. The X-axis is the computational speed of the run – that is, the speed of the run in hours per hour excluding I/O cost (the 12 processor data point represents 79 simulated hours per hour); the Y-axis is the percentage of total run time that will be consumed by I/O for a run whose computational speed is X simulated hours per hour.
NetCDF without asynchronous output, “netcdf (0)”, consumes 6 percent of the cost of the run on 12 processors. It will exceed 10 percent of the cost of the run if computational speed is doubled. On the other hand, binary output, “binary (0)” consumes only about 2 percent of cost of the run on 12 processors and this cost does not exceed 10 percent until computational speed is increased 6-fold.
Using asynchronous output with Quilt Servers, the cost of output with NetCDF, “netcdf(1, 4)”, becomes indistinguishable from the cost of Binary output, “binary (1,4)”, up through a 20-fold increase in computational speed. At 20-fold computational speed, I/O is only 6 percent of the run time regardless of whether the data is being written via NetCDF or Binary. Beyond that point, the amount of computation becomes insufficient to mask the cost of writing the NetCDF output to GPFS on the Quilt Server and cost jumps to that of the non-asynchronous NetCDF output time -- with the result that the code takes twice as long to write the data as to compute it. At this point, however, the model is running at almost 3000 times real time: total time for a 48 hour simulation – I/O and computation – is only 2 minutes anyway.