John G. Michalakes1, Michael McAtee2, Jerry Wegiel3
1National Center for Atmospheric Research, NCAR/MMM Boulder, CO 80301
2 Aerospace Corporation, Los Angeles, California
3 Air Force Weather Agency, Offutt AFB, Nebraska
The Weather Research and Forecast (WRF) project is a multi-institution effort to develop an advanced mesoscale forecast and data assimilation system that is accurate, efficient, and scalable across a range of problems and over a host of computer platforms. The current version, WRF 1.2, was released in spring, 2002. Operational deployment is targeted for the 2004-05 time frame. A significant portion of the WRF effort, funded under the DoD HPCMO CHSSI program (CWO-6), has focused on developing and implementing software infrastructure that facilitates modular, flexible, reusable, and performance portable software. This presentation will outline key elements of the WRF Advanced Software Framework (ASF) and provide recent WRF model performance on a number of computational platforms.
The WRF project is developing a next-generation mesoscale forecast model and data assimilation system that will advance both the understanding and the prediction of mesoscale precipitation systems and will promote closer ties between the research and operational forecasting communities. It is intended for a wide range of applications, from idealized research simulations to operational forecasting, with priority emphasis on horizontal grids of 1–10 kilometers. Features include advanced numerics and data assimilation techniques, a multiple relocatable nesting capability, and improved physics, particularly for treatment of convection and mesoscale precipitation. An entirely new model, WRF provides the opportunity to develop state-of-the-art software that is flexible, maintainable, extensible, efficient, and portable to a range of high-performance computing platforms (Michalakes et al. 1999, 2001a). The software infrastructure developed is the WRF Advanced Software Framework (ASF). The next section provides an overview of the WRF development effort, project organization, and objectives. Section 3 describes the WRF software infrastructure. Section 4 presents performance and scaling results. Section 5 is the conclusion.
2. WRF PROJECT OVERVIEW
is a large, multi-institutional collaboration
involving the NCAR Mesoscale and Microscale Meteorology Division; the National
Centers for Environment Prediction (NCEP), the Forecast Systems Laboratory
(FSL), the National Severe Storms Laboratory (NSSL), and Geophysical Fluid
Dynamics Laboratory (GFDL) of the National Oceanic and Atmospheric
Administration (NOAA); the Air Force Weather Agency (AFWA), Naval Research
Laboratory (NRL), and High Performance Computing Modernization Office (HPCMO)
within the U.S. Department of Defense (DoD); the Federal Aviation
Administration (FAA); the University of
Oklahoma Center for the Analysis and Predictions of Storms (CAPS); the Environmental Protection Agency Atmospheric Modeling Division; the Atmospheric Sciences Division at the NASA Goddard Space Flight Center; and the university research community. Significant funding for WRF development was provided by project CWO-6: Scalable WRF Model Development under the DoD HPCMO Common High Performance Computing Software Support Initiative (CHSSI). The WRF project also has support from the United States Weather Research Program (USWRP), the FAA, and from the participating institutions themselves.
The project schedule calls for release of a fully functional research quality version of the model by the end of 2002; a fully functional operational code is scheduled for 2004, with some aspects of the development such as four-dimensional variational data assimilation (4DVAR) and ensemble assimilation techniques extending into the 2005-2008 time-frame. The first release of a WRF prototype to the community was in November 2000. WRF is now in its third community release. WRF 1.2, released in April 2002, features two run-time selectable dynamical cores (Eulerian split-explicit with height- or mass-based vertical coordinates), at least one and in most cases several options for each major type of physical parameterization (microphysics, cumulus parameterization, planetary boundary layer, turbulence, radiation), and a new land surface model. WRF runs efficiently on single-processor, shared-memory parallel, distributed-memory parallel, and hybrid parallel systems, including IBM, SGI, Compaq, and Linux (PC and Alpha). As part of the ongoing validation and verification of WRF, the model is run daily in quasi-operational NWP mode at a number of the collaborating institution’s sites. NCAR runs the model daily at 22-kilometer resolution over the U.S. The model is run at 10-kilometer resolution on the large Linux/Alpha Beowulf system at FSL. WRF real time NWP runs are also conducted at NSSL and at AFWA. WRF is still considered a prototype; however, it has been released and supported to the community since late 2000, and there are now some 600 registered users, many of whom will attend the 3rd WRF Users’ Workshop and Tutorial at NCAR later this month (June 25-28, 2002).
3. SOFTWARE INFRASTRUCTURE
The WRF software infrastructure is designed to support the goals and requirements of the WRF model. WRF is a community model developed, maintained, distributed, and supported for a diverse community of users applying the model to a host of different problems, in many cases introducing their own local modifications to the code. Therefore, the WRF model must be portable, efficient, flexible, extensible, maintainable, and understandable. Multiple dynamics and physics options, nesting, and data options should be run-time configurable. The model design should support efficient and reasonably rapid software development and promote and facilitate code reuse, leveraging the investment in scientific and infrastructure development by other groups and contributing to the pool of such software.
Key aspects of the WRF software design are: (1) single-source code methodology. It is infeasible to support multiple versions of the WRF model software for different applications or computer platforms; (2) use of modern-programming language constructs in Fortran90, including modules, dynamic memory allocation, derived data types (structures), and recursion (array syntax is avoided for performance reasons); (3) a layered software architecture with well-defined interfaces that separate computer platform-specific concerns from model-specific concerns, enhancing modularity and reuse; (4) multi-level parallel decomposition to facilitate efficient execution on all foreseeable parallel computer architectures and processor types; (5) a code registry data base and tool that helps programmers manage data structures and interfaces across the software hierarchy; (6) application program interfaces (APIs) to insulate the WRF software from external packages for communication, I/O, and data formats that may vary by institution, model application, or computer architecture; (7) a nesting/coupling infrastructure that is scalable and efficient; and (8) choice of a storage order and loop nesting order that is optimally performance-portable.
3.1 Hierarchical Software Architecture
The WRF software architecture (Figure 1) consists of three distinct layers: a model layer that is usually written by scientists, a driver layer that is responsible for parallel decomposition, allocating and deallocating memory, and controlling the integration sequence and I/O, and a mediation layer that communicates between the driver and model. In this manner, the user code is isolated from the concerns of parallelism. The design facilitates interchangeable, reusable software; for example other drivers can be used without affecting the underlying model. In like fashion, driver software written for the WRF model can be used for other models, provided that they conform to the interface specifications.
WRF model layer subroutines perform actual computations in the model: advection, diffusion, time-integration, physics, etc. and are required to be tile-callable, which means the routine can be called for an arbitrary rectangular subset of the domain. Whereas the driver layer stores all the fields for a domain within a single data object, a Fortran90 derived data-type that is opaque to the driver layer, the model layer deals exclusively with individual data arrays passed as arguments – the mediation layer is responsible for dereferencing the arrays from the derived data types. The interface between the WRF model layer and other layers of the architecture is well defined and public. Adherence to this interface is a requirement for model layer code to be incorporated into the WRF software.
The WRF model layer interface specifies a strong version of the interface rules outlined in Kalnay et al., 1989. The domain, memory, and run dimensions are passed separately and unambiguously into the model layer subroutine as integer arguments denoting starts and ends of each of the three spatial domains (eighteen arguments total). Domain dimensions describe the logical (undecomposed) domain and are used primarily for boundary tests. Memory dimensions provide the dimensions of local memory arrays sized to handle the data and halo regions allocated to a process. The run dimensions provide starts and ends for loops over the local subdomain. All indexing within the WRF model is global; that is, the value of an i, j, or k reference into a state array is always the same – the undecomposed index of the point -- regardless of how the domain is subdivided over process memories. To ensure thread-safety, there is no communication of model state through common blocks or as module data and there may be no saved data within the subroutine itself. WRF model layer subroutines may not include I/O, interprocessor communication, or multi-threading mechanisms; therefore, the temporal scope of a model layer subroutine is that unit of computation that can be performed without concern for coherency – that is, without the need to communicate (e.g. update halo regions) with other processes or synchronize with other threads.
This layered approach to the WRF software design, and in particular the model-layer interface specification, supports a completely general but efficient approach to parallelism by insulating the model layer code from all computer architecture-specific concerns. By following the interface convention, a scientist can contribute code and be assured that it will run in parallel in WRF without any special effort on his or her part. Further, adherence to the rules for plug-compatible physics allows WRF model layer routines to be easily interchanged with other models. The WRF driver layer software can be separated from the WRF model layer and used as a framework to parallelize other components of the WRF system (WRF 3DVAR; see McAtee et al., 2002 in this volume) or other models entirely (the NOAA/NCEP MesoNH model). Finally, the layered design allows WRF to take advantage of existing software infrastructure if it exists at a particular institution -- for example, NOAA/GFDL’s Flexible Modeling System (FMS) (GFDL, 2002) – and to leverage broad-based community efforts to develop common model infrastructure such as the NASA-funded Earth System Modeling Framework project (Dickenson et al., 2002).
3.2 Multi-level Parallel Decomposition
Parallelism in WRF is by two-dimensional horizontal domain decomposition; that is, the X and Y (approximately west-east and south-north) dimensions are block-decomposed over processes or threads. In order to be completely general with respect to shared- and distributed-memory and hybrid-parallel computers, WRF employs a two-level decomposition strategy. The X and Y dimensions of a model domain, for example CONUS, are decomposed into patches that are assigned to distributed memory processes; the patches may be further subdivided into tiles that are assigned to multiple threads on each process. In this way, the code is trivially portable to computers that are single-processor, pure shared-memory, pure-distributed memory, distributed shared memory, and distributed memory clusters of shared memory. Further, the size and aspect ratio of the tiles is controllable in the WRF driver layer, which may specify long thin tiles on vector systems or small cache-friendly tiles on microprocessors – all without change to the code itself. This also supports irregular decomposition over differently sized rectangular subdomains to facilitate load balancing. A two-level decomposition is described in Foster and Michalakes, 1993, and others have also successfully employed two-level approaches to parallelizing geophysical models (Desgagne et al., 1997, Sawdey et al., 1997).
The WRF software infrastructure also provides X-Z and Y-Z parallel domain decompositions and communication for transposing data between these decompositions to support spectral transforms, recursive filters (currently used in WRF 3DVAR), and other non-local methods.
The WRF software infrastructure provides a great deal of flexibility with respect to computer architectures, dynamical cores, applications, and external libraries. The registry is a computer aided software engineering (CASE) mechanism built into the WRF software framework to help manage the complexity that supports this flexibility. Essentially an active data dictionary, the registry comprises (1) a user accessible/modifiable database of information about a source code and (2) a mechanism that uses the database to automatically generate large sections of code that would, otherwise, be extremely error-prone and effort-intensive to generate and manage by hand. Currently, the WRF registry database contains 570 entries and from these, 30-thousand lines of code are automatically generated: code to declare, allocate, initialize, and associate model state data with particular dynamical cores; dummy argument lists and their declarations, as well as actual argument lists that are part of the interface between layers in the WRF software architecture (driver, mediation, and model layers); calls to routines for initial, restart, history and boundary I/O; halo-exchange, periodic boundary updates, transpose, and nesting communications for selected state data fields in selected patterns; and code to define, input, and broadcast model configuration (e.g. namelist) information within the running application. The registry provides a single point of control for making changes that affect many different aspects of the code. Adding or modifying a state variable to a model involves modifying a single line of a single file. This capability has enhanced productivity in the WRF development effort. In addition, the registry serves as a prototype for the development of an Application-Specific Interactive Develop Environment (ASIDE), a possible next step beyond computational frameworks.
3.5 I/O Interface and Data Architecture
Efficiency is a key concern for model I/O. Overlaying this concern, users of the WRF model, especially operational centers, also have considerable investment in existing data infrastructure: observational and preprocessing systems, analysis and visualization systems, sources of data, and forecast products that end-users expect in a particular format. The operational forecasting community uses the WMO standard GRIB and BUFR formats. The research community is similarly invested in NetCDF and, to a lesser degree, HDF. Over both communities, there is interest in flexible, easy-to-access, self-describing data formats, though the acceptable performance penalty for such utility varies between operational and research users. WRF must be adaptable to the needs of particular users without impacting the WRF code itself.
The WRF data architecture insulates the WRF software from installation-specific aspects of I/O, storage, and data formatting (Figure 2). The principal feature of this architecture is the WRF I/O API, a package-neutral interface to external data services. Adapting WRF to the requirements of a particular installation involves simply developing a package- or installation-specific implementation of the subroutines specified in the I/O API. WRF code that calls the API need not change. Features of the I/O API include:
· A two step open-for-write that provides a package-neutral mechanism for pre-defining contents of a dataset for packages such as NetCDF that can use this information to provide more efficient access,
· Supports both serial (for efficiency) and random access of fields within a dataset by time and field name,
· Supports transposition of 2- and 3-D fields between various local memory orders within the applications to the storage order found in the dataset,
· Transparent conversion between single- and double-precision (with possible loss of precision; stored precision is that of the writing application),
· Support setting and getting of global- and field-associated metadata,
· Package-neutral support for observation data types (planned; currently BUFR),
· Built-in support for reading/writing parallel file systems (planned), and
· A non file-/device-centric view of datasets (planned), making storage details transparent and facilitating grid-computing based data systems.
A NetCDF implementation of the WRF I/O API has been developed by Jacques Middlecoff at NOAA/FSL, in collaboration with from Carlie Coats (MCNC Corp.) and others, and is distributed with the WRF model. An HDF implementation is being developed at the National Center for Supercomputer Applications (NCSA) at the University of Illinois. GRIB format output is not generated by the I/O API itself but rather, is handled as a post-processing task at the two operational centers, NOAA/NCEP and AFWA. This is because GRIB conversion is not reversible without loss of precision and is therefore unsuitable for storing model output for restarts or cycling applications. In addition to supporting fully self-describing and random access formats such as HDF and NetCDF, the I/O API specifies an “internal” access mode that trades some of the full generality of the I/O API for efficiency. In this mode, applications still read and write through the I/O API but data is explicitly file-centric, there is no random access, and there is no transposition or translation of data, which is stored in memory order, precision, and native machine format of the writing application.
Package-specific implementations of the WRF I/O API are provided as libraries and can be linked to and called directly by other applications that produce or process WRF data. The WRF model itself calls the I/O API through a number of additional layers of infrastructure (also package-neutral) that provide the following functionality:
· Fast, asynchronous history and restart output through one or more dedicated I/O-server processes (performance of this scheme is discussed in the next section),
· Built-in collection and distribution of decomposed data to and from serial datasets,
· Fast file-per-process I/O as an option,
· Multiple input and output streams, separately run-time assignable to particular data formats, and
· Control over what fields make up a dataset is through the WRF registry.
3.6 Nesting and Coupling
The WRF software architecture is intended to support nesting for mesh refinement and coupling of large multi-model simulations for forecasting of severe weather, ecosystem simulations, air quality, and other research and operational applications. These functions are still under development.
Inter-model coupling will be implemented as two sub-levels within the WRF software infrastructure. The first level (A) will support regular, scheduled exchange of data and control suitable for tightly-to-moderately coupled model interactions, such as the exchange of atmosphere/ocean fluxes. The next level (B), will support peer-to-peer event-driven interactions between coupled components – individual models or components that are themselves coupled models -- in a large multi-scale multi-disciplinary simulation of a complex ecosystem. Both levels involve grid-to-grid translation, interpolation, and other computational transformations and efficient communication between component models running in single-program multiple-data (SPMD), multi-program multi-data (MPMD) or some mixed mode. As with other interfaces to external functionality in WRF, coupling will be implemented as an API for coupling services in the WRF ASF with initial reference implementations based on the Model Coupling Toolkit (MCT) (Larson and Jacob, 2001) and the Model Coupling Executable Library (MCEL) (Bettencourt, 2002). Ultimately, the NASA ESMF software may provide this functionality.
Conceptually, nesting is a special form of model coupling except that the requirement for efficiency is much stronger because domains interact every time step. Nesting in WRF is being implemented as two-way interacting domains that are mesh aligned and coincident (that is, non-rotated and each point in the parent domain corresponds exactly to a point in the nest). Nests may move, overlap, and telescope to arbitrary depth. Nests may be instantiated and dis-instantiated on-the-fly. The parallel implementation of nests will be SPMD – every domain will be decomposed independently over the set of processes. Forcing and feedback between domains will be through scatter-gather communications.
3.7 Storage Order
Good computational performance and scaling is a key requirement for the WRF software infrastructure and has been the subject of considerable study and effort since early in the design process. The storage order of vertical (K), west-east (I), and south-north (J) axes in three-dimensional arrays and the corresponding loop nesting order are crucial issues for the design of NWP codes. Conventional wisdom, bolstered by some scattered experience, holds that a KIJ ordering is optimal for RISC processors, based on the assumption that having the vertical K-dimension stride-one/innermost allows entire vertical columns of the domain to fit into processor data-cache. Compared to IJK implementations, KIJ physics implementations reduce horizontal IJ size temporary arrays to scalars that can be stored in registers, thereby reducing the memory footprint and memory traffic relative to IJK formulations. Conventional wisdom also holds that KIJ ordering is anathema to vector processors because K is the shortest dimension in an atmospheric model domain and because the K dimension has data dependencies that break vectorization, particularly in model physics. It is axiomatic that vector systems preferred IJK or, better still, LijK data and loop organizations because these provide long, dependency-free stride-one dimensions needed for efficient vectorization.
Michael Ashworth of Daresbury Laboratory, U.K., studied and reported on the performance effects of non-optimal storage order on vector and RISC-based systems running a geophysical kernel code (Ashworth, 1998). This study provided evidence to support the idea that IJK was better on vector and KIJ was better on microprocessors, but with the additional quantitative result: RISC systems were much more tolerant of vector-friendly code organizations, suffering considerably smaller penalties than did vector systems in the reverse case. Therefore, if WRF were to target only RISC systems, the K-innermost ordering would be the clear choice. However, if WRF were to target both RISC and vector systems, an I-innermost ordering should be chosen -- provided the penalty on RISC were not too great. To inform the decision for WRF, a recapitulation of Ashworth's study was conducted using the, at that time, relatively small dry (virtually no physics) WRF model prototype. These were cast in IJK, KIJ, and one new storage order, IKJ.
The WRF storage order study showed that IKJ is a reasonable compromise that provides good efficiency on microprocessors yet retains the I-innermost property essential for good vector performance. Based on the results of the study, which was conducted using Compaq Alpha EV56 and EV6, IBM Power3, and Fujitsu VPP5000 processors, an IKJ ordering was chosen for WRF in April, 2000. Since that time, with the addition of new dynamics and many model physics packages the WRF code has grown to large to manipulate storage and loop order for study. However, a more thorough investigation was conducted using computational kernels from WRF (Michalakes et al., 2001b).
This section presents WRF I/O performance and preliminary computational performance on a large NWP forecast case.
4.1 I/O Performance
Moving data into and out of a running NWP model is a critical concern, especially since computational speed typically scales much better than I/O. Moreover, overhead associated with self-describing data formats such as NetCDF may increase the cost to write a dataset several fold, compared to native unformatted binary I/O. WRF I/O benchmarks were conducted on the IBM SP system at NCAR. Figure 3 shows the effective bandwidth writing a 14 Megabyte frame of model output to a GPFS mounted file system was 16 MB/second using native unformatted binary I/O versus only 5 MB/second through NetCDF. Neither rate is good, however, considering that even on only modest numbers of processors, each second spent doing I/O is equivalent to multiple steps of model integration. For this reason, the WRF software infrastructure includes a parallel “quilting” scheme developed by Tom Black (NOAA/NCEP) and Jim Tuccillo (IBM) for NCEP’s operational Eta model. This allows extra processes assigned at the start of the run to serve as I/O “quilt-servers” whose sole responsibility is to receive decomposed output data from computational processes, then asynchronously reassemble and write the data while model integration proceeds without further delay. Figure 3 shows the improvement in effective bandwidth using first one additional process and then 4 additional processes for both binary and NetCDF output. Quilt-server I/O boosts the effective output bandwidth using either format to greater than 80 MB/second, as long as there is sufficient model computation to cover time spent by the servers. If not, computation will still stall, waiting for the servers to catch up. To eliminate this potential bottleneck, future implementations will allow multiple banks of quilt-server processes to handle output data in round-robin fashion, ensuring scaling of output to the full bandwidth capacity of the underlying parallel file system.
4.2 Computational Performance and Scaling
Figure 4 shows performance of WRF running on two target computing platforms. The first is Blackforest, an IBM SP system at NCAR. The system consists of 293 Winterhawk-II nodes, each with four 375-MHz Power3 processors. Theoretical peak performance is 1.5 Gflop/second per processor. The second system is the Terascale Computing System (TCS), a 750 node Compaq system at the Pittsburgh Supercomputing Center. Each node contains four 1-GHz Alpha EV68 processors. Theoretical peak performance for TCS is 2 Gflop/second per processor. The problem benchmarked is a large forecast domain covering the continental United States at12 kilometer resolution and 35 vertical layers (4.5 million cells). Computational rates shown in Mflop/second are based on 22 billion floating point operations per average time step, as measured by the SGI Speedshop tool, and on timings gathered using the UNIX gettimeofday real-time system timer. I/O and initialization times were not included. On 512 processors of the Pittsburgh system, WRF runs at 110 Gflop/second, completing a 48 hour forecast in about 10 minutes; 512 processor scaling efficiency relative to 16 processors is 57 percent.
The 12 kilometer resolution WRF CONUS domain is too large to benchmark on a single-processor. On a different NWP case, single processor sustained performance has been measured at about 500 Mflop/second on the Compaq and about 180 Mflop/second on the IBM. Single-processor sustained performance on the latest generation IBM Power4 processors has been measured at about 500 Mflop/second; however, at this writing there were no large Regatta-class IBM systems available for benchmarks on large (>32) numbers of processors.
Efficiency as a percentage of theoretical peak performance is plotted on the right hand axis of Figure 4. By this measure, the Compaq system runs WRF twice as efficiently as the IBM system; this is a misleading, however, since theoretical peak is clock speed times the number of floating point units. The IBM processor has four floating point units and the Compaq has only two. In effect, the measure penalizes the IBM design for having more (albeit little-used) floating point hardware on their chips. Normalizing instead to simple clock rate instead of theoretical peak floating point performance, the efficiencies become roughly comparable, though IBM is still slightly less efficient than the Compaq.
Computational efficiency of WRF relative to other NWP models is also being investigated. Benchmark comparisons with the operational NOAA/NCEP Eta model show that, in terms of computational cost per simulated time period, the WRF model is about twice as expensive as the Eta model at the same horizontal resolution. Some of this difference is related to code efficiency. At the time of the comparison in mid-2001, WRF ran at .78 the floating-point rate of the Eta model on an SGI Origin2000 system at NCAR: 167 Mflop/second versus 130 Mflop/second. This difference can be at least partially attributed to the newness of the WRF code; even so, this does not account for all the speed difference between the codes. Using the SGI Speedshop utility to count floating point operations shows that WRF entails 1.6 more operations for a given period of model integration than Eta: 3200 floating operations per cell per integrated minute for WRF versus 2000 operations for Eta. The fact that WRF is non-hydrostatic and uses higher order numerics (3rd order integration, 5th order advection) as well as differences in type and calling frequency of physics packages account for much of the remaining differences. Investigation is ongoing as model development continues. It should be noted, finally, that a more important measure of WRF performance is “forecast efficiency” for NWP operations and “scientific efficiency” for researchers: the quality of a simulation realized for a given investment in computation. Understandably, this issue is the subject of considerable study and discussion within the WRF collaboration. As model verification with respect to observations continues, there is reason to believe that WRF is efficient by existing and developing measures of forecast quality. Results are preliminary and, in any case, the question is beyond the scope of this paper.
The WRF project is developing a state-of-the-art limited-area forecast and data assimilation system that will provide a common tool for use by both the operational and research communities. Through the use of a hierarchical software design, multi-level parallel decomposition, CASE-like software management tools, and package-neutral approaches to external packages for communication, I/O, and other services, the WRF model is satisfying the countervailing concerns for a maintainable community model on the one hand and efficiency across a range of high-performance computing platforms on the other.
Acknowledgements: This work was supported by the Department of Defense (DoD) High Performance Computing (HPC) Modernization Office’s Common High Performance Computing Software Support Initiative (CHSSI).
Ashworth, M. (1999): Optimisation for vector and RISC processors. Towards Teracomputing. World Scientific, River Edge, New Jersey, pp. 353-359.
Bettencourt, M. T. Distributed Model Coupling Framework, to appear in proceedings of HPDC-11, July 2002.
Desgagne, M., S. J. Thomas, R. Benoit, M. Valin, and A.V. Malevsky, Making its Mark, World Scientific, River Edge, New Jersey (1997), pp. 155—181.
Dickenson, R.E., S.E. Zebiak, J.L. Anderson, M.L. Blackmon, C. DeLuca, T.F. Hogan, M. Iredell, M. Ji, R. Rood, M.J. Suarez and K.E. Taylor: “How Can We Advance Our Weather and Climate Models as a Community?” Bulletin of the American Meteorological Society, Volume 83, Number 3, March 2002. See also, http://www.esmf.ucar.edu.
Foster, I. and J. Michalakes, Parallel Supercomputing in Atmospheric Science, World Scientific, River Edge, New Jersey (1993), pp. 354-363
GFDL, Flexible Modeling System, 2002: http://www.gfdl.noaa.gov/~fms
Kálnay E., M. Kanamitsu, J. Pfaendtner, J. Sela, M. Suarez, J. Stackpole, J. Tuccillo, L. Umscheid, and D. Williamson: "Rules for interchange of physical parameterization," Bull. Amer. Meteor. Soc. 70 (1989) 620-622.
Larson, J.W., R.L. Jacob, I.T. Foster, J. Guo: “The Model Coupling Toolkit,” in Computational Science -- ICCS 2001, Springer, New York, 2001. pp.185-194.
McAtee, M., Bourgeois, A., Barker, D., Wegiel, J., and Michalakes, J.: “Data Assimilation in the Weather Research and Forecast Model,” in proceedings of UGC 2002. June. Austin, Texas.
Michalakes, J., S. Chen, J. Dudhia, L. Hart, J. Klemp, and J. Middlecoff (2001a): “Development of a Next Generation Regional Weather Research and Forecast Model,'' in Developments in Teracomputing, World Scientific, River Edge, New Jersey, 2001. pp. 269-276.
Michalakes, J., R. Loft, and A. Bourgeois (2001b): ``Performance-Portability and the Weather Research and Forecast Model,'' in on-line proceedings of the HPC Asia 2001 conference, Gold Coast, Queensland, Australia, Sept. 24-28, 2001. http://www.gu.edu.au/ins/its/hpcasia2001.
Michalakes, J., J. Dudhia, D. Gill, J. Klemp, and W. Skamarock: "Design of a next-generation regional weather research and forecast model," Towards Teracomputing, World Scientific, River Edge, New Jersey, 1999. pp. 117-124.
Sawdey, A., M. O’Keefe, and W. Jones, Making its Mark, World Scientific, River Edge, New Jersey (1997), pp. 209-225.
 Lij means the horizontal I and J dimensions are combined into a single 1-D vector. Such a transformation is permitted in WRF – changes would be localized to the WRF mediation layer and would not affect the model layer in the vast majority of codes.
 This is the time on 16 processors divided by time on 512 processors divided by 32, the factor of increase in processors; in other words, 16 T16 / 512 T512.