http://www.qpsf.edu.au/workshop/forge/forge.html (World Wide Web Directory, ~04/1995)
FORGE Introduction
Note : All manuals for the FORGE tools can be found on the SP2 system in
/opt/forge/docs. They are all in PostScript format.
What Is FORGE?
FORGE is an intergrated collection of interactive tools to enable the
easy and simple parallelisation of FORTRAN programs.
Previously it has taken significant effort to convert large serial FORTRAN
programs for parallel execution, especially distributed memory machines
such as the IBM SP2. The FORGE tools enable the user to analyse a
program as a single entity before employing automatic code and data
distribution.
FORGE Overview
- Interactive FORTRAN program browser for the
analysis of code.
- Interprocedural view -
the entire program is treated as a single
entity. FORGE uses a global database representation of a program
that can follow the flow of data through call parameters and COMMON
blocks. References to variables can be traced throughout the program
call tree no matter how complicated the CALL/COMMON/EQUIVALENCE
aliasing.
- Intuitive Motif based GUI - FORGE uses standard Motif style menu
panels and display windows. Multiple display windows can be viewed
simultaneously and separately moved, iconified, resized etc.
- Variable tracing -
the ability to display ALL source code
references to a variable, constant, parameter or external.
- Data and Control Flow -
from a reference to a variable at any point
in the program, a view of earliers assignments to the variable, and
any future uses is possible. Also the viewing of subroutine calls,
DO, IF and GOTO control blocks is possible.
- COMMON Block Grid -
view the usage of all COMMON blocks across all
routines in a single grid display.
- Query Searches - search the entire program for variable that
satisfy particular context criteria.
- Performance Profile -
instrumentation of a program to produce timing
profiles to determine where most time is spent during execution.
- Source Code Reformatter -
automatically change the format
(indentation, declaration order, character case, etc.) of a program
according to the user's preferences.
- Interactive Distributed Memory
Parallelizer (DMP).
- Generates fully scalable FORTRAN 77 SPMD program.
- Data arrays are distributed along any single dimension.
- DO loops automatically have their iterations distributed across
multiple processors.
- Extensive checking to ensure the restructured program executes
correctly.
- Support for many different message passing libraries such as
IBM's MPL, PVM, Express, and Linda.
- Parallel Performance Profiler and Simulator.
- Produces execution times, communication cost and wait times for each
subroutine and parallel DO loop.
- Results are used to determine the performance of the parallel program
as well as fine-tune or modify code and data distribution.
- FORTRAN 90 and High Performance FORTRAN (HPF) Support.
- Use of standard FORTRAN 90 and HPF directives to control
parallelisation of the program.
- Consistency checks to ensure FORTRAN 90 and HPF directives do not
conflict with program context. Special passes to check all directives
against the static analysis of the program and issue diagnostics for
arrays that are illegally or inconsistently partitioned.
- Separate directives for both loop and data distribution
Getting started with FORGE
- Before the FORGE tools can be used several environment variables have
to be defined.
- For C-shell users :
% setenv PATH ${PATH}:/opt/forge:/opt/forge/run
% setenv APRHOME /opt/forge/env
% setenv APRDIR /opt/forge/apr
- For Korn-shell users :
$ export PATH=$PATH:/opt/forge:/opt/forge/run
$ export APRHOME=/opt/forge/env
$ export APRDIR=/opt/forge/apr
FORGE Explorer
- The FORGE tools share a common database representation of a FORTRAN
program.
- FORGE Explorer builds and manipulates this database.
- FORGE Explorer is a Motif based GUI program that permits the user to
access information in this database inorder to analyse a complete
FORTRAN programs.
- FORGE Explorer requires that all FORTRAN source files, as well as
include files are grouped together into a "package".
- Once a package is defined, FORGE Explorer parses all the files to build
the global database.
- All the FORGE information, including the database, is contained in the
APR directory within the current specified
directory.
FORGE Explorer
Starting FORGE Explorer
FORGE Explorer
Analysing FORTRAN Programs
- From the "View" pull-down menu various windows can be displayed. Some
of these are shown here.
Click here for "Call Subchain..." example.
Click here for "Common Blocks..." example.
- From these windows more information about the program, such as
variable tracing and data flow information, can be brought up in
other windows by pressing the "Variables..." button after selecting
an item (Subroutine, DO loop, COMMON block) in the window.
Click here for "Variables..." example.
- After selecting a variable in the window a trace of this variable's
references can be displayed in another window.
Click here for "Trace..." example.
- From the trace window data flow information about selected variables
can be display.
Click here for "Data Flow...Earlier Sets" example.
Click here for "Data Flow...Later Uses" example.
FORGE Explorer
Instrumenting and Profiling FORTRAN Programs
- Inorder for the automatic paralleliser to make sensible decisions, it
is necessary to profile the sequential version.
This is done by
instrumenting the code with timing routines and then executing the
sequential program. This execution will then generate a timing file
that is used by FORGE Explorer and the Distributed Memory Parallelizer.
- To instrument a program, select "Instrument..." from the "Tools"
pull-down menu. This brings up a window with all the source files
defined for the current package.
- Once you have selected all the files you want to instrument, press the
"Instrument..." button.
- FORGE Explorer will now ask you how to produce the instrumented code.
Either as a temporary experimental file, all source in one file,
create new files in a directory, or place each routine into a separate
file.
- Once you select the type of output, FORGE Explorer will then generate
the instrumented code.
- When the new code is created, compile it using the normal
xlf command but a special FORGE timing library
must be included. This can be done by including
-L$APRHOME/lib/serial/RS6K -lts_in
on the xlf command line.
- Once compiled the instrumented program can be executed as normal, but
at the completion of the run, profiling data is output to stdout. This
should be re-directed to a file (preferable with a
.tim extension).
- The newly created timing file can now be specified for the package by
selecting "Define Package..." from the "Modify" pull-down menu.
FORGE Explorer
Re-formatting FORTRAN Programs
Distributed Memory Parallelizer (DMP)
Starting the DMP
- To start the X-windows program ensure that your
DISPLAY environment variable is defined. Then
type :
% exp_forge
This will bring up the Baseline FORGE Browser's Main Window Panel.
Click here for Main Window Panel.
- The Baseline FORGE Browser has many of the same features as
FORGE Explorer. Once you have selected the
package you want to parallelise,
select "Analyze Program".
Baseline FORGE will now display a call tree
of the program. Select the part of the program that you want to
parallelise - normally the main program.
- Once you have selected the node to analyse you can start the
DMP by selecting "Parallelize for Distributed Memory"
Click here for DMP start.
Distributed Memory Parallelizer (DMP)
Loop Parallelisation with the DMP
- The DMP defaults to "Loop Selection" which allows the user to
interactively select which loops to parallelise.
The left-hand side
of the window shows the program call tree which includes all
subroutine calls and DO loops. A column labelled "%INCL" lists the
percentage of execution time spent in the subroutine or DO loop. These
figures are obtained from the profile data generated from the
instrumented code created by FORGE Explorer.
Click here for Loop Selection example.
- The buttons at the top of the call tree column allow the user to get
the DMP to help select which loops to parallelise.
The "Next Highest
Time" moves the select pointer to the next highest time loop from the
current position.
The "Select Best" button gets the DMP to select
the best loops to parallelise base on the current data distribution.
- Once you have selected all the loops you want to parallelise, click on
"Analyze Automatically" to get the DMP to check each loop selected
to see if it can be parallelised and if so, actually mark the loop as
a distributed loop.
- Any inhibitors that stop the parallelisation of
a loop are displayed and the user is asked if the inhibitors are to be
ignored. Some inhibitors cannot be ignored and therefore the loop
is not parallelised. Examples of solid inhibitors are :
- Premature exits from or jump into or out of DO loops;
- I/O statements in a loop or a routine called from the loop;
- A call to an unknown routine;
- The loop contains a parallel loop. Outer loops cannot be
parallelised.
- If you want to select and analyse each loop individually, select
"Analyze Interactively" before selecting the loops to parallelise.
- As each loop is selected in the call tree, the DMP analyses the loop
and displays the impact of parallelising this loop. This includes
information as to the data needed to be communicated, the arrays
referenced and modified, any loop inhibitors etc.
Click here for Analyze Interactively example.
- From the information displayed the user can decide whether to :
- go ahead and distribute the loop - select "Distribute";
- replicate the loop on all processors - select "Replicate"; or
- cancel the operation altogether - select "Cancel".
Distributed Memory Parallelizer (DMP)
Data Distribution with the DMP
- From the Main DMP window, move the select pointer to the point where
you want data distribution to occur. If you want data to be
distributed for the entire program, select the main program node.
- Select "Data Decomposition" from the menu list. This will change the
window to the list of variables available for distribution at this
point in the program.
- Select the variable you want to distribute and press the "Decompose"
button. This will bring up a list of available decomposition modes.
Click here for Data Decomposition example.
- There are several decomposition schemes available in the DMP :
- Block decompositions spread the data of an array into
contiguous blocks of elements. Each processor is allocated
a single block of data. The size of the blocks depend on the
array size and the number of processors :
block size = array size / number of processors
Block decompositions provide a reasonable load balance over
processors in finite element grid applications using fully
dimensioned arrays and loops that reference neighbouring grid
points.
- Cyclic Decompositions distribute elements of an array
sequentially one element at a time over all processors.
In general, cyclic decompositions achieve better load balancing
over processors in situations where loops are referenced with
stride, or where inter-iteration dependencies are present.
- There are two ways memory is allocated for distributed arrays :
- A "FULL" partitioned array is the same size as the original
array; or
- A "SHRUNK" partitioned array only has enough memory allocated to
contain the array elements distributed to the processor.
- Only one dimension of an array can be distributed and this is specified
in the "Dimension" field of the "Decomposition Operations".
- The "Move Data" field of the "Decomposition Operations" specifies
whether data in a distributed array has to be communicated between
processors.
If set to "ON" then when elements of a distributed array are modified
they are updated on all processors when required. This obviously
introduces communication overheads.
Scratch work arrays and temporary array values usually do not need to
be updated on all processors so the "Move Data" field should be "OFF".
- If the required decomposition operation is not available a new one may
be defined by clicking the "Create" button.
- In the Data Decomposition mode, variables can be traced throughout the
program in a similar way as in
FORGE Explorer.
- The "Do Automatic Distribution" selection in the main DMP window get
the DMP to develop its own array partitioning and loop distribution
strategy based upon an analysis of the performance profile data.
For some programs it may be found that this option is overly aggressive,
and introduces excessive parallel overheads. This can be overcome by
the user selecting some loops and arrays to distribute and then using
this option to complete the parallelisation based on these initial
choices.
Distributed Memory Parallelizer (DMP)
Compiling Parallel FORTRAN Programs
- Once you have completed parallelising your program with the DMP, exit
from the X-windows program.
- All the parallelisation information is stored as part of the FORGE
global database. Absolutely NO modifications have been made to any of
the original source code.
- Inorder to generate a parallel FORTRAN program you must invoke the
FORGE pre-processor. The command format is :
% pref77 -p package [-o outfile] [-f forgepath] [-u]
[-t sources] [-z]
This pre-processor accesses the specified package database and
generates a
FORTRAN77 source file with FORGE message passing calls. The default
file name is package_name_pf.f, but can be
changed with the -o option
The other pref77 options are :
- -f forgepath specifies the path to
the FORGE product directories. Default is
$APRHOME.
- -u specifies no optimisation of
communication calls.
- -t sources specifies individual
package source files to process.
- -z specifies to generate
parallel profiling calls.
- Once pref77 has generated parallel FORTRAN code,
it is compiled using the mpxlf compiler and linked
with the FORGE libraries by adding :
-L$APRHOME/lib/mpl/SP2 -ldd_n
to the mpxlf command line.
- The executable generated by mpxlf can be
submitted for parallel execution on any number of processors in the
same way as other parallel programs are executed. That is via POE or
LoadLeveler.
Distributed Memory Parallelizer (DMP)
Profiling Parallel FORTRAN Programs
- FORGE parallel programs can be profiled very simply.
- Include the -z option to the
pref77 command line.
This generates special calls to the run-time timing library in the
FORTRAN code.
- Compile the new FORTRAN source with mpxlf but
include a different library by adding :
-L$APRHOME/lib/mpl/SP2 -ldd_tn
to the mpxlf command line.
- Execute the new parallel program and re-direct stdout to a file
(preferable with a .ptim extension).
- To view this profile data, re-start exp_forge
and jump to the main DMP window. From here select
"Show Parallel Statistics" and then select the parallel profile data
file.
- Select "Sorted View" to get FORGE to display the profile data in
different orders - Elapsed Time, CPU Time, Communication Time, Wait
Time, and FORGE Overhead Time.
Click here for Parallel Profile example.
References
"FORGE Explorer User's Guide" Applied Parallel Research
"FORGE 90 DMP User's Guide" Applied Parallel Research
"FORGE Technical Note" Applied Parallel Research
"The FORGE Product Set : Release Notes and Installation Guide"
Applied Parallel Research
"FORGE DataSheets" Applied Parallel Research
Dr. Simon Wail
Australian Computing and Communications Institute
simonw@acci.com.au