Distributed Code Generation
Distributed Code Generation - The WP65 Scheme
Last update: 95/09/1 - Author: Corinne ANCOURT
This phase was designed to automatize the generation of Fortran 77
program onto distributed memory machines using an Emulated Shared
Memory scheme and to exploit the universal message passing
capability provided by the INMOS T9000 processor and C104 hardware
router. One half of the processors perform computations and the other
half emulate memory banks providing the compiler with a better
understood target machine, a multiprocessor with a fast local memory
managed as a software cache and a slow shared memory. The fast context
switching times and intelligent on-chip channel processors make possible
to overlap computations and communications when T9000 and C104 are used.
This work was partially funded by ESPRIT project 2701 (PUMA
- WorkPackage 6.5) and by DRET.
Input
This phase takes as input a sequential Fortran77 program meeting the following conditions:
- it is made of a single main program whose body is made of a set of
perfectly or non-perfectly nested loop(s) whose upper and lower
bounds are affine expressions and assignments ;
- it may contain indirections but neither guards nor calls and I/Os.
WP65 Approach
Task generation is based on control partitioning. The data
dependence graph between program instructions is used to build parallel
tasks. Loop transformations like tiling transformation and
distribution are used on nested loops in order to define blocks of
loop iterations that can be computed in parallel.
The dependence graph is used to decide if a given tiling is legal. The current implementation does not include an automatic estimation of the tile size and a default size is used.
Each tile is seen as a logically independent task. Each task is made of three parts: a prologue to read the input data from the emulated shared memory, a computational part and a final part to store the results. Ideally several tasks should be executed by the same physical processor to overlap communications and computations.
Output
A 2-PMD distributed Fortran77 program containing calls to the
runtime communication library PVM is generated. The input program is
transformed into two subroutines:
- one subroutine, called
COMPUTE(PROC_ID)
, contains the computational part of the code and receives a (logical) processor number as parameter; - one subroutine, called
BANK(BANK_ID)
, contains the shared memory emulator part of the code and receives a (logical) bank number as parameter.
The general structures of these two subroutines are very close since each send (receive) must be met by a corresponding receive (send). Like the input program they are sequences of nested loop. The outermost loop nest defines which tile is being executed. Each tile body is made of two or three sections:
- a sequence of nested loops used to load the different local
copies from the emulated shared memory; this sequence may
be empty when there is no input, as is the case with a simple
array initialization; it can be viewed as a software cache prefetch.
- a sequence of nested loops to execute all iterations of the
current tile; the loop body is a copy of the initial loop
body but references to user variables are replaced by
references to local variables; this section is empty in
the memory emulator code;
- a sequence of nested loops used to store the different local
copies in the emulated shared memory, i.e. a cache flush.
Features
The potential advantages of this approach are:
- implicit data distribution based on control partitioning, which
means there is no need for static or dynamic explicit data partitioning;
- efficient handling of more general constructs than with the owner
rule (e.g. Fortran HPF compilation);
- memory servers do not have to be run on the same processors as
computations; this increases the communication bandwidth and decreases
the number of context switches on computational processors;
- easier load balancing because any process can be started on any
processor;
- parallelism granularity can be easily tuned because it is not
implicitly linked to a data distribution; this also may improve load balancing.
The obvious disavantage is that a full software cache cannot be fully statically compiled. However regular code can exploit the underlying INMOS hardware very efficiently.
More information
A full description of the approach and examples are given in [3].
To run the WP65 phase with wpips, ask the distributed view.
e-mail: ancourt@cri.mines-paristech.fr and pipsgroup@cri.mines-paristech.fr
References
- 1
- C. Ancourt,
Generation automatique de codes de transfert pour multiprocesseurs
a memoires locales,
These de doctorat, Universite Paris 6, 1990
- 2
- C. Ancourt, F. Irigoin,
Scanning Polyhedra with DO loops,
Third ACM Symposium on Principles and Practice of Parallel Programming (PPoPP'91), Williamsburg, 1991
- 3
- C. Ancourt, F. Irigoin,
Automatic Code Distribution,
: The Third Workshop on Compilers for Parallel
Computers (CPC'92), Vienna, Austria, July 6-9, 1992.
- 4
- F. Irigoin, R. Triolet,
Supernode Partitioning,
ACM Symposium on Principles of Programming Languages, San-Diego, 1988
- 5
- F. Irigoin, P. Jouvelot, R. Triolet,
Semantical Interprocedural Parallelization: An Overview of the PIPS Project,
ACM International Conference on Supercomputing (ICS'91), Cologne, 1991
- 6
- L. Zhou, Enhanced Static Evaluation of Fortran Programs in the PIPS Environment, Tech. Report E/160/CRI, Ecole des Mines de Paris, 1991
Document Actions