Author: Alexis Platonoff firstname.lastname@example.org Last update: September 14th, 1995
IntroductionThe polyhedric method was invented by Paul Feautrier of the PRiSM laboratory of University of Versailles (France). Initially, it was implemented in the parallelizer PAF (Automatic Parallelizer of Fortran) developped at PRiSM with Le_Lisp. The implementation of this method in the parallelizer PIPS was carried out under a contract between the University of Versailles, the Ecole des mines de Paris and the Commissariat à l'Energie Atomique (CEA). It has been done in CEA at Centre d'Etudes de Limeil-Valenton by Benoit de Dinechin and Arnauld Leservot and Alexis Platonoff [Pla95a], with the help of Antoine Cloué and Francois Dumontet.
The compilation is a sequence of five phases. The first phase checks that the imput is a static control program, i.e., fulfills some restrictions. The second phase builds an array data flow graph (DFG). The third one constructs a scheduling function compatible with the DFG. This function minimizes the execution time for an unbounded PRAM. The fourth one computes a mapping value for a physical distributed memory parallel machine. Finally, the last phase generates the parallel code.
Although this compilation scheme is implemented in PIPS, it is NOT interprocedural yet.
Static ControlThis method is limited to a certain class of programs called static control programs [Fea91]. In such programs, the authorized statements are:
- DO loop and the IF test control structures;
- assignment and input/output statements;
Array Data Flow GraphThe Array Data Flow Graph (DFG) is computed from a static control program [Fea91]. The DFG is in fact a kind of dependence graph in which only true dependences appear. Arrays and scalars are expended (implicitly) and assigned only once.
To each node of the DFG are associated an instruction and an execution domain. The execution domain is a set of constraints specifying the variation domain of the surrounding loop indices of the instruction.
Nodes are connected by arcs which represent true data dependences. To each edge of the DFG are associated the reference of the use dependence, a transformation function in which each index of the definition iteration domain is expressed as a linear function of the indices of the use iteration domain, and a governing predicate that specifies the sub-space of the sink's iteration domain on which the edge exists (it is a system of constraints upon the indices of the sink's iteration domain).
SchedulingThe scheduling function is computed from the DFG [Fea92a] to minimize the total execution time with no resource constraint. The schedule provides for each operation of a program the logical date at which it must be executed. For an occurence of indice s, it is expressed as a multidimensional affine function Ts of the loop indices and the structure parameters.
For a given time t, the scheduling function defines a set of independent operations which can be executed simultaneously. This set is called a front. The execution of two successive fronts must be sequential.
Even with the static control restrictions, it is not possible to always have a unidimensional linear schedule. Typically, this happens when there are two or more sequential loops in the same loop nest. In that case, a multidimensional linear schedule is computed [Fea92b].
MappingThe placement function associates to each statement a multidimensional affine function of the loop indices and the structure parameters [Fea93]. This function specifies explicitly the placement of the instruction on a virtual processor grid, i.e., gives the identity of the processor that executes each occurence of the instruction.
Each dimension of the placement function defines a placement direction for the instruction and all these directions constitute the distribution space of the target machine. The number of dimensions to compute is arbitrary with a maximum equal to the dimension of iteration space minus the dimension of the scheduling function.
Initially, the goal of the algorithm that computes the placement function was to reduce the number of communications. For an edge of the DFG, there is a potential communication between its source and its destination, which can be represented by a distance. If such a distance is equal to zero (the edge is ``cut'') then the source and the destination will be mapped onto the same processor, and there will be no communication at all.
The principle of the method is to nullify as many distances as possible. To each edge is associated a cutting condition represented by a system of equalities. The length of an edge is nullified if its cutting condition is satisfied, i.e. its equalities are satisfied. These equalities corresponds to the nullification of the factors of the variables (loop indices and structure parameters) appearing in the distance. In most cases, satisfying the cutting conditions of all the edges will lead to the trivial solution: some null dimensions, i.e., all the occurences of the instruction are placed on the same processor. To avoid this, some edges should not be cut. To choose them, a greedy algorithm has been proposed [Fea93], it treats the edges by decreasing importance. The importance of an edge is represented by its weight which is equal to the volume of data which has to be sent if the edge is not cut.
This method does not take into account the type (and hence the temporal cost) of the communications. To that purpose, an extension of the method has been implemented. It is based on a special treatment of potential communications that can be optimized [Pla95b], for instance spread, reduction...
Code GenerationThe final phase builds the parallel program using the results of all previous phases. At a given time step, the parallel program has to execute the corresponding front (see above), synchronize and pass to the next time step. This induces the following general architecture of the parallel program:
do t = 1, number of fronts execute concurrently all operations in F(t) synchronize end doThe code generation is based upon three transformations:
- Total expansion of scalars and arrays:
- transformation of the initial program into a dynamic single assignment form [Fea88].
- Loop reordering:
- rearrangment of the iteration domain of the initial loops according to the scheduling and placement functions (the first gives the sequential loops, the second the parallel ones). The reordering is equivalent to scanning polyhedra with do loops [AI91].
- substitution of all the array access functions with new ones computed according to the new loops.
In PIPS, the generated parallel code can be either CM Fortran (for the CM-5) or CRAFT Fortran (for the Cray-T3D).
References[AI91] C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. in PPOPP'91, 1991.
[CF93] J.-F. Collard and P. Feautrier. Automatic generation of data parallel code. In Fourth International Workshop on Compilers for Parallel Computers, Delft University of Technology, The Netherlands, December 1993.
[Fea88] P. Feautrier. Array expansion. in ACM Int. Conf. on Supercomputing, Saint-Malo, jul 1988.
[Fea91] P. Feautrier. Dataflow Analysis of Array and Scalar References. Int. Journal of Parallel Programming, 20(1):23-53, February 1991.
[Fea92a] P.Feautrier. Some Efficient Solutions to the Affine Scheduling Problem, Part I : One-dimensional Time. Int. J. of Parallel Programming, 21(5):313-348, October 1992.
[Fea92b] P. Feautrier. Some Efficient Solutions to the Affine Scheduling Problem, Part II : Multidimensional Time. Int. J. of Parallel Programming, 21(6):389-420, December 1992.
[Fea93] P. Feautrier. Toward Automatic Partitioning of Arrays on Distributed Memory Computers. In ACM ICS'93, pages 175-184, Tokyo, July 1993.
[Pla95a] A. Platonoff. Contribution à la distribution Automatique des Données pour Machines Massivement Parallèles. Thèse de Doctorat de l'Université P. et M. Curie, March 9th 1995.
[Pla95b] A. Platonoff. Automatic Data Distribution for Massively Parallel Computers. In 5th International Workshop on Compilers for Parallel Computers, Malaga University, Spain, June 1995.