http://www.qpsf.edu.au/workshop/forge/forge.html (World Wide Web Directory, ~04/1995)

FORGE Introduction

What Is FORGE?
FORGE Overview
Getting started with FORGE
FORGE Explorer
Distributed Memory Parallelizer (DMP)
Acknowledgements and References

Note : All manuals for the FORGE tools can be found on the SP2 system in /opt/forge/docs. They are all in PostScript format.

What Is FORGE?

FORGE is an intergrated collection of interactive tools to enable the easy and simple parallelisation of FORTRAN programs.

Previously it has taken significant effort to convert large serial FORTRAN programs for parallel execution, especially distributed memory machines such as the IBM SP2. The FORGE tools enable the user to analyse a program as a single entity before employing automatic code and data distribution.

FORGE Overview

Interactive FORTRAN program browser for the analysis of code.
- Interprocedural view - the entire program is treated as a single entity. FORGE uses a global database representation of a program that can follow the flow of data through call parameters and COMMON blocks. References to variables can be traced throughout the program call tree no matter how complicated the CALL/COMMON/EQUIVALENCE aliasing.
- Intuitive Motif based GUI - FORGE uses standard Motif style menu panels and display windows. Multiple display windows can be viewed simultaneously and separately moved, iconified, resized etc.
- Variable tracing - the ability to display ALL source code references to a variable, constant, parameter or external.
- Data and Control Flow - from a reference to a variable at any point in the program, a view of earliers assignments to the variable, and any future uses is possible. Also the viewing of subroutine calls, DO, IF and GOTO control blocks is possible.
- COMMON Block Grid - view the usage of all COMMON blocks across all routines in a single grid display.
- Query Searches - search the entire program for variable that satisfy particular context criteria.
- Performance Profile - instrumentation of a program to produce timing profiles to determine where most time is spent during execution.
- Source Code Reformatter - automatically change the format (indentation, declaration order, character case, etc.) of a program according to the user's preferences.
Interactive Distributed Memory Parallelizer (DMP).
- Generates fully scalable FORTRAN 77 SPMD program.
- Data arrays are distributed along any single dimension.
- DO loops automatically have their iterations distributed across multiple processors.
- Extensive checking to ensure the restructured program executes correctly.
- Support for many different message passing libraries such as IBM's MPL, PVM, Express, and Linda.
Parallel Performance Profiler and Simulator.
- Produces execution times, communication cost and wait times for each subroutine and parallel DO loop.
- Results are used to determine the performance of the parallel program as well as fine-tune or modify code and data distribution.
FORTRAN 90 and High Performance FORTRAN (HPF) Support.
- Use of standard FORTRAN 90 and HPF directives to control parallelisation of the program.
- Consistency checks to ensure FORTRAN 90 and HPF directives do not conflict with program context. Special passes to check all directives against the static analysis of the program and issue diagnostics for arrays that are illegally or inconsistently partitioned.
- Separate directives for both loop and data distribution

Getting started with FORGE

Before the FORGE tools can be used several environment variables have to be defined.

For C-shell users :


    % setenv PATH ${PATH}:/opt/forge:/opt/forge/run
    % setenv APRHOME /opt/forge/env
    % setenv APRDIR /opt/forge/apr

For Korn-shell users :


    $ export PATH=$PATH:/opt/forge:/opt/forge/run
    $ export APRHOME=/opt/forge/env
    $ export APRDIR=/opt/forge/apr

FORGE Explorer

The FORGE tools share a common database representation of a FORTRAN program.
FORGE Explorer builds and manipulates this database.
FORGE Explorer is a Motif based GUI program that permits the user to access information in this database inorder to analyse a complete FORTRAN programs.
FORGE Explorer requires that all FORTRAN source files, as well as include files are grouped together into a "package".
Once a package is defined, FORGE Explorer parses all the files to build the global database.
All the FORGE information, including the database, is contained in the APR directory within the current specified directory.

FORGE Explorer 
Starting FORGE Explorer

To start the Motif X-windows program ensure that your DISPLAY environment variable is defined. Then type :
```
% forgex
```
This will bring up FORGE Explorer's Main Window Panel.
Click here for Main Window Panel.
From the "Modify" pull-down menu select "Define Package..." to start to specify the source files for the new package.
Click here for "Define Package..." example.
Once all the source files have been added, a target hardware needs to be specified. Use "ibm-SP2". For a list of available hardware click on the arrow to the right of the text window.
If the code has been previously been instrumented and profiled by FORGE then a timing file can be specified.
When the package has been completely defined click the "Done" button.

FORGE Explorer 
Analysing FORTRAN Programs

From the "View" pull-down menu various windows can be displayed. Some of these are shown here.
Click here for "Call Subchain..." example.
Click here for "Common Blocks..." example.
From these windows more information about the program, such as variable tracing and data flow information, can be brought up in other windows by pressing the "Variables..." button after selecting an item (Subroutine, DO loop, COMMON block) in the window.
Click here for "Variables..." example.
After selecting a variable in the window a trace of this variable's references can be displayed in another window.
Click here for "Trace..." example.
From the trace window data flow information about selected variables can be display.
Click here for "Data Flow...Earlier Sets" example.
Click here for "Data Flow...Later Uses" example.

FORGE Explorer 
Instrumenting and Profiling FORTRAN Programs

Inorder for the automatic paralleliser to make sensible decisions, it is necessary to profile the sequential version.
This is done by instrumenting the code with timing routines and then executing the sequential program. This execution will then generate a timing file that is used by FORGE Explorer and the Distributed Memory Parallelizer.
To instrument a program, select "Instrument..." from the "Tools" pull-down menu. This brings up a window with all the source files defined for the current package.
Once you have selected all the files you want to instrument, press the "Instrument..." button.
FORGE Explorer will now ask you how to produce the instrumented code. Either as a temporary experimental file, all source in one file, create new files in a directory, or place each routine into a separate file.
Once you select the type of output, FORGE Explorer will then generate the instrumented code.
When the new code is created, compile it using the normal xlf command but a special FORGE timing library must be included. This can be done by including
```
-L$APRHOME/lib/serial/RS6K -lts_in
```
on the xlf command line.
Once compiled the instrumented program can be executed as normal, but at the completion of the run, profiling data is output to stdout. This should be re-directed to a file (preferable with a .tim extension).
The newly created timing file can now be specified for the package by selecting "Define Package..." from the "Modify" pull-down menu.

FORGE Explorer 
Re-formatting FORTRAN Programs

To reformat source code, select "Reformat..." from the "Tools" pull-down menu. This brings up a window with all the source files defined for the current package.
Once you have selected all the files you want to reformat, press the "Reformat..." button.
FORGE Explorer will now ask you how to produce the reformatted code. Either as a temporary experimental file, all source in one file, create new files in a directory, or place each routine into a separate file.
Once you select the type of output, FORGE Explorer will then generate the reformatted code in accordance to the current options.
These options can be changed by selecting "Formatting Options..." from the "Options" pull-down menu.
Click here for "Formatting Options..." example.

Distributed Memory Parallelizer (DMP) 
Starting the DMP

To start the X-windows program ensure that your DISPLAY environment variable is defined. Then type :
```
% exp_forge
```
This will bring up the Baseline FORGE Browser's Main Window Panel.
Click here for Main Window Panel.
The Baseline FORGE Browser has many of the same features as FORGE Explorer. Once you have selected the package you want to parallelise, select "Analyze Program".
Baseline FORGE will now display a call tree of the program. Select the part of the program that you want to parallelise - normally the main program.
Once you have selected the node to analyse you can start the DMP by selecting "Parallelize for Distributed Memory"
Click here for DMP start.

Distributed Memory Parallelizer (DMP) 
Loop Parallelisation with the DMP

The DMP defaults to "Loop Selection" which allows the user to interactively select which loops to parallelise.
The left-hand side of the window shows the program call tree which includes all subroutine calls and DO loops. A column labelled "%INCL" lists the percentage of execution time spent in the subroutine or DO loop. These figures are obtained from the profile data generated from the instrumented code created by FORGE Explorer.
Click here for Loop Selection example.
The buttons at the top of the call tree column allow the user to get the DMP to help select which loops to parallelise.
The "Next Highest Time" moves the select pointer to the next highest time loop from the current position.
The "Select Best" button gets the DMP to select the best loops to parallelise base on the current data distribution.
Once you have selected all the loops you want to parallelise, click on "Analyze Automatically" to get the DMP to check each loop selected to see if it can be parallelised and if so, actually mark the loop as a distributed loop.
Any inhibitors that stop the parallelisation of a loop are displayed and the user is asked if the inhibitors are to be ignored. Some inhibitors cannot be ignored and therefore the loop is not parallelised. Examples of solid inhibitors are :
- Premature exits from or jump into or out of DO loops;
- I/O statements in a loop or a routine called from the loop;
- A call to an unknown routine;
- The loop contains a parallel loop. Outer loops cannot be parallelised.
If you want to select and analyse each loop individually, select "Analyze Interactively" before selecting the loops to parallelise.
As each loop is selected in the call tree, the DMP analyses the loop and displays the impact of parallelising this loop. This includes information as to the data needed to be communicated, the arrays referenced and modified, any loop inhibitors etc.
Click here for Analyze Interactively example.
From the information displayed the user can decide whether to :
- go ahead and distribute the loop - select "Distribute";
- replicate the loop on all processors - select "Replicate"; or
- cancel the operation altogether - select "Cancel".

Distributed Memory Parallelizer (DMP) 
Data Distribution with the DMP

From the Main DMP window, move the select pointer to the point where you want data distribution to occur. If you want data to be distributed for the entire program, select the main program node.
Select "Data Decomposition" from the menu list. This will change the window to the list of variables available for distribution at this point in the program.
Select the variable you want to distribute and press the "Decompose" button. This will bring up a list of available decomposition modes.
Click here for Data Decomposition example.
There are several decomposition schemes available in the DMP :
- Block decompositions spread the data of an array into contiguous blocks of elements. Each processor is allocated a single block of data. The size of the blocks depend on the array size and the number of processors :
```
block size = array size / number of processors
```
  Block decompositions provide a reasonable load balance over processors in finite element grid applications using fully dimensioned arrays and loops that reference neighbouring grid points.
- Cyclic Decompositions distribute elements of an array sequentially one element at a time over all processors.
  In general, cyclic decompositions achieve better load balancing over processors in situations where loops are referenced with stride, or where inter-iteration dependencies are present.
There are two ways memory is allocated for distributed arrays :
- A "FULL" partitioned array is the same size as the original array; or
- A "SHRUNK" partitioned array only has enough memory allocated to contain the array elements distributed to the processor.
Only one dimension of an array can be distributed and this is specified in the "Dimension" field of the "Decomposition Operations".
The "Move Data" field of the "Decomposition Operations" specifies whether data in a distributed array has to be communicated between processors.
If set to "ON" then when elements of a distributed array are modified they are updated on all processors when required. This obviously introduces communication overheads.
Scratch work arrays and temporary array values usually do not need to be updated on all processors so the "Move Data" field should be "OFF".
If the required decomposition operation is not available a new one may be defined by clicking the "Create" button.
In the Data Decomposition mode, variables can be traced throughout the program in a similar way as in FORGE Explorer.
The "Do Automatic Distribution" selection in the main DMP window get the DMP to develop its own array partitioning and loop distribution strategy based upon an analysis of the performance profile data.
For some programs it may be found that this option is overly aggressive, and introduces excessive parallel overheads. This can be overcome by the user selecting some loops and arrays to distribute and then using this option to complete the parallelisation based on these initial choices.

Distributed Memory Parallelizer (DMP) 
Compiling Parallel FORTRAN Programs

Once you have completed parallelising your program with the DMP, exit from the X-windows program.
All the parallelisation information is stored as part of the FORGE global database. Absolutely NO modifications have been made to any of the original source code.
Inorder to generate a parallel FORTRAN program you must invoke the FORGE pre-processor. The command format is :
```
% pref77 -p package [-o outfile] [-f forgepath] [-u]
                    [-t sources] [-z]
```
This pre-processor accesses the specified package database and generates a FORTRAN77 source file with FORGE message passing calls. The default file name is package_name_pf.f, but can be changed with the -o option
The other pref77 options are :
- -f forgepath specifies the path to the FORGE product directories. Default is $APRHOME.
- -u specifies no optimisation of communication calls.
- -t sources specifies individual package source files to process.
- -z specifies to generate parallel profiling calls.
Once pref77 has generated parallel FORTRAN code, it is compiled using the mpxlf compiler and linked with the FORGE libraries by adding :
```
-L$APRHOME/lib/mpl/SP2 -ldd_n
```
to the mpxlf command line.
The executable generated by mpxlf can be submitted for parallel execution on any number of processors in the same way as other parallel programs are executed. That is via POE or LoadLeveler.

Distributed Memory Parallelizer (DMP) 
Profiling Parallel FORTRAN Programs

FORGE parallel programs can be profiled very simply.
Include the -z option to the pref77 command line.
This generates special calls to the run-time timing library in the FORTRAN code.
Compile the new FORTRAN source with mpxlf but include a different library by adding :
```
-L$APRHOME/lib/mpl/SP2 -ldd_tn
```
to the mpxlf command line.
Execute the new parallel program and re-direct stdout to a file (preferable with a .ptim extension).
To view this profile data, re-start exp_forge and jump to the main DMP window. From here select "Show Parallel Statistics" and then select the parallel profile data file.
Select "Sorted View" to get FORGE to display the profile data in different orders - Elapsed Time, CPU Time, Communication Time, Wait Time, and FORGE Overhead Time.
Click here for Parallel Profile example.

References

"FORGE Explorer User's Guide" Applied Parallel Research
"FORGE 90 DMP User's Guide" Applied Parallel Research
"FORGE Technical Note" Applied Parallel Research
"The FORGE Product Set : Release Notes and Installation Guide" Applied Parallel Research
"FORGE DataSheets" Applied Parallel Research


Dr. Simon Wail
Australian Computing and Communications Institute
simonw@acci.com.au

FORGE Introduction

What Is FORGE?

FORGE Overview

Getting started with FORGE

FORGE Explorer

FORGE Explorer Starting FORGE Explorer

FORGE Explorer Analysing FORTRAN Programs

FORGE Explorer Instrumenting and Profiling FORTRAN Programs

FORGE Explorer Re-formatting FORTRAN Programs

Distributed Memory Parallelizer (DMP) Starting the DMP

Distributed Memory Parallelizer (DMP) Loop Parallelisation with the DMP

Distributed Memory Parallelizer (DMP) Data Distribution with the DMP

Distributed Memory Parallelizer (DMP) Compiling Parallel FORTRAN Programs

Distributed Memory Parallelizer (DMP) Profiling Parallel FORTRAN Programs

References