You are here: Home / Documentation / WPips & EPips User Manual

WPips & EPips User Manual

 

 

 

WPips & EPips User Manual
(Paralléliseur Interprocédural de Programmes Scientifiques)
--
Linear Algebra based Automatic Parallelizer & Program Transformer

Ronan Keryell

A WWW version of this report can be shown at http://www.cri.ensmp.fr/pips/wpips-epips-user-manual.htdocand a printable version of this report at http://www.cri.ensmp.fr/pips/wpips-epips-user-manual.htdoc/wpips-epips-user-manual.pdf.

 

Contents

 

file=images/logo-pips-embossed.eps,width=0.25

 


Introduction

PIPS IS AN AUTOMATIC PARALLELIZER FOR SCIENTIFIC PROGRAMS THAT TAKES AS INPUT FORTRAN 77 AND HPF CODES. SINCE ITS INCEPTION IN 1988, IT EMPHASIZES INTERPROCEDURAL TECHNIQUES BASED ON LINEAR ALGEBRA TECHNIQUES.

PIPS IS A HIGHLY MODULAR FRAMEWORK WHICH CAN BE USED TO TEST AND IMPLEMENT INDEPENDENTLY VARIOUS COMPILATION SCHEMES, PROGRAM ANALYSES AND TRANSFORMATIONS, SUCH AS AUTOMATIC VECTORIZATION AND PARALLELIZATION (CRAY FORTRAN, CRAFT, CM FORTRAN, F90 & PVM OUTPUT), HPF COMPILATION, LOOP TRANSFORMATIONS, PARTIAL EVALUATION,...

THE COMPILER IS MADE OF INDEPENDENT PHASES THAT ARE CALLED ON DEMAND TO PERFORM THE ANALYSES OR TRANSFORMATIONS REQUIRED BY THE USER. CONSISTENCY AND INTERPROCEDURALITY ISSUES ARE DEALT WITH AUTOMATICALLY.

THE PHASE CHAINING IS DRIVEN BY AN INTERPROCEDURAL MAKE SYSTEM AND INTERNAL DATA STRUCTURES ARE AUTOMATICALLY STORED IN A DATABASE PIPSDBM WICH MANAGES PERSISTENCE THROUGH DIFFERENT RUNS. PERSISTENCE IS USED TO AVOID THE MEMORY BOTTLENECK WHEN LARGE PROGRAMS ARE ANALYZED INTERPROCEDURALLY. IT IS ALSO USED BY NEWCOMMERS TO DEVELOP NEW APPLICATIONS WITHOUT INTERFERRING WITH EXISTING CODE.

PIPS IS BUILT ON TOP OF TWO TOOLS. THE FIRST ONE IS DATA STRUCTURE DESCRIPTION LANGUAGE, , WHICH MANAGES ALLOCATION, DE-ALLOCATION, STORAGE FOR PERSISTENT OBJECTS, WALKS THRU COMPLEX DATA STRUCTURES, AND PROVIDES BASIC CLASSES, SUCH AS LIST, SET AND MAPPING. IS EMBEDDED IN C, COMMONLISP AND SML. ALL PIPS DATA-TYPES ARE DECLARED WITH DESCRIPTIONS AND USED WITH THE C OR THE COMMON LISP EMBEDDING.

THE SECOND TOOL IS THE LIBRARY WHICH HANDLES LINEAR FORMULæ AND STRUCTURES BASED ON THESE, SUCH AS VECTORS, CONSTRAINTS, POLYHEDRA, GENERATING SYSTEM. THIS LIBRARY IS USED TO ANALYZE CODES (PRECONDITIONS, REGIONS, DEPENDENCE TEST) AND TO GENERATE NEW VERSIONS OF CODES (PARTIAL EVALUATION, UNIMODULAR TRANSFORMATIONS, TILING, SEND/RECEIVE, TEMPORARY ARRAY ALLOCATION,...)

THREE INTERFACES ARE AVAILABLE: A SHELL INTERFACE (PIPS), AN X-WINDOWS INTERFACE (WPIPS) AND A HYPERTEXTUAL INTERFACE (EPIPS).

TO HAVE MORE INFORMATION: HTTP://WWW.CRI.ENSMP.FR/PIPS.

THIS REPORT GIVE A USER MANUAL ON THE X-WINDOWS (XVIEW) INTERFACE AND THE HYPERTEXTUAL GNU-EMACS BASED INTERFACE .

WITH THESE INTERFACES, A USER CAN USE MOST OF THE PIPS PROJECT ENVIRONMENT FEATURES:

  • CREATE WORKSPACES FROM FORTRAN SOURCES;
  • CLOSE AND RE-OPEN THESE WORKSPACE LATER BY EXPLOITING THE PERSISTENT DATA TYPES;
  • EDIT THE FORTRAN SOURCES;
  • ANALYSE THE PROGRAMS BY DISPLAYING VARIOUS INTERPROCEDURAL INFORMATIONS BASED ON THE INTERPRODURAL LINEAR ALGEBRA FRAMEWORK DEVELOPPED IN THE PIPS PROJECT, SUCH AS THE KNOWN EXECUTION CONTEXT FOR EACH INSTRUCTION (THE );
  • APPLY INTERACTIVELY BY USING MENUS VARIOUS PROGRAM TRANSFORMATIONS SUCH AS PARTIAL EVALUATION, LOOP TRANSFORMATIONS, ETC, TO IMPROVE PROGRAM EFFICIENCY;
  • PARALLELIZE THE CODE AND TARGET VARIOUS PARALLEL FORTRAN DIALECTS INCLUDING ONE EMULATING A SHARED MEMORY IN ;
  • COMPILE CODE TO FORTRAN 77 WITH A BACK-END

THE PIPS ENVIRONMENT IS ALSO CONFIGURABLE AND QUITE MODULAR AT THE DEVELOPPER LEVEL AND ALLOW EASILY SOME VARIOUS EXTENSIONS.

 


Getting started

In the GettingStarted directory, there is a collection of small programs to start up with the PIPS ENVIRONMENT.

 


Parallelizing a small program

We begin here by using 4 different parallelizing methods on a small program, a matrix multiplication.

 


Parallelize or vectorize the code

 


Generate parallel PVM code from HPF

 


Generate distributed code for processors and memory banks

This section comes mainly from Corinne ANCOURT.

 

To provide accurate results in reasonable time, real applications such as seismic, fluid mechanics and structure computation applications (that use large set of data and costly algorithms) need to be widely parallelized. Distributed memory machines are good candidates for these applications because they support large number of processors without the shared memory bottle neck problem. However, distributed memory machines are much more difficult to program efficiently than shared memory machines. Shared memories are often managed by the hardware. Conscientious programmer must only restructure its program in order to take care of cache use. In distributed memory machines, the distribution of data onto the local memories must be designed carefully in order to keep good performance and to avoid too much communications.

Many approaches to generate distributed code have been suggested. Some are language-based. The programmer has to specify the data distribution pattern and the parallelism but he does not have to generate processes nor send/receive pairs between processors. Processes are automatically derived from the data distribution using the so-called owner computes rule and a SPMD model: each processor executes instruction producing a piece of data located in its memory. Send/receive pairs are derived from the data distribution and from the instruction reference patterns.

Other approaches are based on the operating system or/and hardware support. A virtual shared memory provides the uniform address space. Some software mechanisms or complex cache systems maintains the memory consistency.

Three different approaches have been implemented in Pips. The first one is language-based. High Performance Fortran is designed to express data parallel applications for distributed memory machines. It provides the user with a set of directives to specify data distribution and parallelism. A prototype HPF compiler is implemented in Pips.

The second approach is also language-based but the data distribution and the scheduling are automatically computed. Designed for static control programs, this technique generates SIMD programs expressed in CRAFT (Fortran CRAY-T3D) or CM-Fortran.

Finally, the third approach suggests the emulation of a shared memory onto a distributed memory machine. Classical parallelization techniques are used to generate SPMD code. The compiler manages the emulated shared memory and the maintenance of memory coherency.

The following section introduces some characteristics of these three approaches. It focuses on the scheduling differences. More details are given in the associated presentation papers. Example presented in Figure [*] illustrates the approaches.

 

Figure: Program 1
1#1

 

HPF program

In HPF, the data distribution and the parallelism are specified via directives. The parallel execution of the distributed application is guided by the owner computes rule: a processor can update only its local variables. Let's take our example.

 

Figure: Data distribution
2#2

Due to the data dependences existing on the second dimension of Array B, the data distribution that minimizes the communications groups the array elements by rows. Here, blocks of 2 rows are distributed onto the 5 processor local memories as depicted in Figure [*].

 

Figure: HPF Directives
3#3

According to the owner computes rule, 2 blocks of iterations of J are executed on each processor. The corresponding generated code is presented in Figure [*].

 

Figure: Code generated from HPF
4#4

The "HPFC" presentation details all the characteristics of our HPF compiler.

 

Automatic placement

The problem of solving automatically data and control distributions is NP-complet. Thus, they are sequentially solved. In the suggested approach, scheduling is first computed. Then, the data distribution computes the mapping of each array onto a virtual processor grid so as to minimize the communication cost.

First, the array data flow graph is built. It characterizes the precedence relations between instruction instances and contains only true dependences because the program is put into a single assignment form. Figure [*] presents these precedence relations for program 1. Instructions S1 and S2 that assign array elements of B, should be executed before instruction S3. Some array elements are assigned by S1 and others by S2. The resulting code generated from the DFG is presented in Figure [*]

 

Figure: Precedence Relations
5#5

 

Figure: CRAFT Code
6#6

This technique is detailed in the "Polyhedric method" presentation.

 

Emulated Shared Memory

Since the management of data distribution is complex, this approach suggests the emulation of a shared memory onto a distributed memory machine. Control distribution is applied through tiling. The iteration domain is partitioned into tiles that can be executed concurrently on different processors. Figure [*] illustrates this tiling and the assignment to the processors. Tiles are distributed in a cyclic way.

 

Figure: Control distribution
7#7

Data distribution is implicit. One half of the processors perform computations and the other half emulate memory banks. Figure [*] presents the corresponding generated code for the computations. The necessary communications are inserted in the generated code automatically by the compiler. The "Distributed code generation" presentation details all the compilation phases of this approach.

 

Figure: Wp65 code
8#8

 

 


Parallelize the code with a polyhedric method

 


Optimizing a small program

 


Array Regions for Interprocedural Parallelization and Array Privatization

This is mainly the internal report A/279/CRI from Béatrice CREUSILLET you can get at http://www.cri.ensmp.fr/doc/A-279.ps.Zhttp://www.cri.ensmp.fr/doc/A-279.ps.Z.

 

 

 

 

 

 

 

 

 

 

 

 

.

 

 

 

Interprocedural Parallelization

AILE is an application from the ONERA, the French institute of aerospatial research. It has more than 3000 lines of FORTRAN code. It has been slightly modified to test the coherence of some input values.

The aim of this demonstration is to show that interprocedural analyses are necessary for an automatic parallelization.

For that purpose, we have chosen the subroutine (or module) EXTR, which is called by the module GEOM, itself called by the main routine AILE. An excerpt is given in Figure [*] (without the intermediate call to GEOM).

 

Figure: Excerpt from program AILE.
      PROGRAM AILE
      DIMENSION T(52,21,60)
      COMMON/CT/T
      COMMON/CI/I1,I2,IMAX,I1P1,I1P2,I2M1,I2M2,IBF
      COMMON/CJ/J1,J2,JMAX,J1P1,J1P2,J2M1,J2M2,JA,JB,JAM1,JBP1
      COMMON/CK/K1,K2,KMAX,K1P1,K1P2,K2M1,K2M2
      COMMON/CNI/L      
      ...
      READ(NXYZ) I1,I2,J1,JA,K1,K2
C     
      IF(J1.GE.1.AND.K1.GE.1) THEN
         N4=4
         J1=J1+1
         J2=2*JA+1
         JA=JA+1
         K1=K1+1
         ...
         CALL EXTR(NI,NC)
      ENDIF
      END

      SUBROUTINE EXTR(NI,NC)
      DIMENSION T(52,21,60)
      COMMON/CT/T
      COMMON/CI/I1,I2,IMAX,I1P1,I1P2,I2M1,I2M2,IBF
      COMMON/CJ/J1,J2,JMAX,J1P1,J1P2,J2M1,J2M2,JA,JB,JAM1,JBP1
      COMMON/CK/K1,K2,KMAX,K1P1,K1P2,K2M1,K2M2
      COMMON/CNI/L
      L=NI
      K=K1
      DO 300 J=J1,JA
         S1=D(J,K  ,J,K+1)
         S2=D(J,K+1,J,K+2)+S1
         S3=D(J,K+2,J,K+3)+S2
         T(J,1,NC+3)=S2*S3/
  
  
    
      
      S1-S2)*(S1-S3[+]
    
  


         T(J,1,NC+4)=S3*S1/
  
  
    
      
      S2-S3)*(S2-S1[+]
    
  


         T(J,1,NC+5)=S1*S2/
  
  
    
      
      S3-S1)*(S3-S2[+]
    
  


         JH=J1+J2-J
         T(JH,1,NC+3)=T(J,1,NC+3)
         T(JH,1,NC+4)=T(J,1,NC+4)
         T(JH,1,NC+5)=T(J,1,NC+5)
 300  CONTINUE      
      END

      REAL FUNCTION D(J,K,JP,KP)
      DIMENSION T(52,21,60)
      COMMON/CT/T
      COMMON/CNI/L
C     
      D=SQRT
  
  
    
      
      T(J,K,L  )-T(JP,KP,L  [+]
    
  

**2
     1     +(T(J,K,L+1)-T(JP,KP,L+1))**2
     2     +(T(J,K,L+2)-T(JP,KP,L+2))**2)
      END

 

EXTR

EXTR contains a DO loop that has several characteristics:

  1. There are several read and write references to elements of the array T. This induces dependences that cannot be disproved if we don't know the relations between index expressions, and more precisely between J and JH. We already know that JH=J1+J2-J, but we don't know the values of J1, J2 and JA, which are global variables initialized in AILE. Thus, we can disprove the loop-carried dependences between T(J,1,NC+3) and T(JH,1,NC+3) for instance, only if we interprocedurally propagate the values of J1, J2 and JA from AILE. This type of information is called precondition in PIPS [#!Irig:91!#,#!cnrs-nsf92:pips!#].

     

  2. There are three calls to the function D in EXTR. D contains several read references to the global array T. So, we must assume that the whole array is potentially read by each call to D. This induces dependences in EXTR between the calls to D and the other statements. In order to disprove these dependences, we need a way to represent the set of array elements read by any invocation of D, and be able to use this information at each call site. These sets are called array regions in PIPS [#!Creu:95a!#].

     

  3. S1, S2, S3 and JH are defined and used at each iteration. This induces loop-carried dependences. But we may notice that each use is preceded by a definition in the same iteration. These variables can be privatized (this means that a local copy is assigned to each iteration) to remove the spurious dependences.

 

D

As written before, there are several references to elements of the array T in D. Our aim is to represent this set of elements, such that it can be used at each call site to help disproving dependences.

If we know nothing about the relations between the values of K and KP or between J and JP, all we can deduce is that the third index of all the array elements ranges between L and L+2. This is represented by the region:

 

10#10


 

The 11#11 variables represent the dimensions of the array; R means that we consider the read effects on the variable; and MAY means that the region is an over-approximation of the set of elements that are actually read.

The relations between the values of K and KP or J and JP are those that exist between the real arguments. At each call site, we have JP==J and KP==K+1. These contidions hold true before each execution of D; we call them preconditions. Under these conditions, we can now recompute the region associated to the array T:

 

12#12


 

Notice that this is a MUST region, because it exactly represents the set of array elements read by any invocation of function D.

 

Parallelisation of EXTR

We can now parallelize EXTR by:

  1. privatizing the scalar variables;
  2. using array regions to summarize the read effects on the array T by each invocation of D;
  3. using the preconditions induced by the initialization of global scalar variables (in AILE) to disprove the remaining dependences.

This leads to the parallelized version of Figure [*].

 

Figure: Parallelized version of EXTR.
      SUBROUTINE EXTR(NI,NC)
      DIMENSION T(52,21,60)
      COMMON/CT/T
      COMMON/CI/I1,I2,IMAX,I1P1,I1P2,I2M1,I2M2,IBF
      COMMON/CJ/J1,J2,JMAX,J1P1,J1P2,J2M1,J2M2,JA,JB,JAM1,JBP1
      COMMON/CK/K1,K2,KMAX,K1P1,K1P2,K2M1,K2M2
      COMMON/CNI/L
      L = NI                                                        
      K = K1                                                        
      DOALL J = J1, JA
         PRIVATE S1,S2,S3
         S1 = D(J, K, J, K+1)                                       
         S2 = D(J, K+1, J, K+2)+S1                                  
         S3 = D(J, K+2, J, K+3)+S2                                  
         T(J,1,NC+3) = S2*S3/
  
  
    
      
      S1-S2)*(S1-S3[+]
    
  

                      
         T(J,1,NC+4) = S3*S1/
  
  
    
      
      S2-S3)*(S2-S1[+]
    
  

                      
         T(J,1,NC+5) = S1*S2/
  
  
    
      
      S3-S1)*(S3-S2[+]
    
  

                      
      ENDDO
      DOALL J = J1, JA
         PRIVATE JH
         JH = J1+J2-J                                               
         T(JH,1,NC+3) = T(J,1,NC+3)                                 
         T(JH,1,NC+4) = T(J,1,NC+4)                                 
         T(JH,1,NC+5) = T(J,1,NC+5)                                 
      ENDDO
      END

 

Array Privatization

Array privatization is not yet implemented in PIPS, but the information needed to perform the transformation is already available: IN and OUT regions [#!Creu:95a!#,#!Creu:95b!#].

To illustrate the characteritics of these regions, we will consider two examples: NORM is another excerpt from AILE, and RENPAR6 is a contrived example that highlights some details of the computation of regions and the possibilities opened up by IN and OUT regions.

 

NORM

 

Figure: Another excerpt from AILE: NORM
      PROGRAM AILE
      DIMENSION T(52,21,60)
      COMMON/CT/T
      COMMON/CI/I1,I2,IMAX,I1P1,I1P2,I2M1,I2M2,IBF
      COMMON/CJ/J1,J2,JMAX,J1P1,J1P2,J2M1,J2M2,JA,JB,JAM1,JBP1
      COMMON/CK/K1,K2,KMAX,K1P1,K1P2,K2M1,K2M2
      COMMON/CNI/L
      DATA N1,N3,N4,N7,N10,N14,N17/1,3,4,7,10,14,17/

      READ(NXYZ) I1,I2,J1,JA,K1,K2
C     
      IF(J1.GE.1.AND.K1.GE.1) THEN
         N4=4
         J1=J1+1
         J2=2*JA+1
         JA=JA+1
         K1=K1+1
         CALL NORM(N10,N7,N4,N14,N17,I2)
      ENDIF
      END

      SUBROUTINE NORM(LI,NI,MI,NN,NC ,I) 
      DIMENSION T(52,21,60)
      DIMENSION TI(3)

      COMMON/T/T
      COMMON/I/I1,I2,IMAX,I1P1,I1P2,I2M1,I2M2,IBF
      COMMON/J/J1,J2,JMAX,J1P1,J1P2,J2M1,J2M2,JA,JB,JAM1,JBP1
      COMMON/K/K1,K2,KMAX,K1P1,K1P2,K2M1,K2M2
      COMMON/IO/LEC ,IMP,KIMP,NXYZ,NGEO,NDIST

C ....
DO 300 K=K1,K2
      DO 300 J=J1,JA

      CALL PVNMUT(TI)
      T(J,K,NN  )=S*TI(1)
      T(J,K,NN+1)=S*TI(2)
      T(J,K,NN+2)=S*TI(3)
  300 CONTINUE
C ....
      END

      SUBROUTINE PVNMUT(C)
      DIMENSION C(3), CX(3)
      CX(1)= 1
      CX(2)= 2
      CX(3)= 3
      R=SQRT(CX(1)*CX(1)+CX(2)*CX(2)+CX(3)*CX(3))
      IF(R.LT.1.E-12) R=1.
      DO I = 1,3
      C(I) = CX(I)/R
      ENDDO
      RETURN
      END

This is a very simple example (see Figure [*]) that shows the necessity of array privatization, and the need for IN and OUT array regions.

In the loop of subroutine NORM, the references to the array T do not induce loop-carried dependences. Furthermore, there are only read-read dependences on S. However, notice that the array TI is a real argument in the call to PVNMUT, and that there are 3 read references to array TI. This induces potential interprocedural dependences. We have seen with the previous example that these dependences can sometimes be disproved with array regions.

We must first compute the regions of array TI that are referenced in PVNMUT. In PVNMUT, TI is called C. And the 3 elements of C are written, but not read. This leads to:

 

13#13


 

(W means that this is a write effect)

At the call site, C is translated into TI, which gives the region:

 

14#14


 

And finally, the regions corresponding to the whole body of the loop nest are:

 

15#15


 

These regions are identical, which means that each iteration of loops K and J reads and writes to the same memory locations of array TI. Thus, there are loop-carried dependences, and the loop cannot be parallelized.

However, these dependences are false dependences, because if we allocate a copy of array TI to each iteration (in fact to each processor), there are no more dependences. This is what is called array privatization. In order to privatize an array, we must be sure that, in each iteration, no element is read before being written in the same iteration. Thus, there are no loop-carried producer-consumer dependences.

This last property cannot be verified by using READ regions, because they contain all the elements that are read, and not only those that are read before being written. This is represented in PIPS by IN regions. In our case, we must verify that no element of TI belongs to the IN region corresponding to the loop body, which is the case.

We must also be sure that no element of TI that is initialized by a single iteration is used in the subsequent iterations or after the loops. This information is provided in PIPS by the OUT regions. They represent the set of live array elements, that is to say those that are used in the continuation.

We can now parallelize NORM by:

  1. using array regions to perform the dependence analysis;
  2. using IN and OUT array regions to privatize the array TI.

This leads to the parallelized version of Figure [*].

 

Figure: Parallelized version of NORM.
      SUBROUTINE NORM(LI,NI,MI,NN,NC ,I) 
      DIMENSION T(52,21,60)
      DIMENSION TI(3)

      COMMON/CT/T
      COMMON/I/I1,I2,IMAX,I1P1,I1P2,I2M1,I2M2,IBF
      COMMON/J/J1,J2,JMAX,J1P1,J1P2,J2M1,J2M2,JA,JB,JAM1,JBP1
      COMMON/K/K1,K2,KMAX,K1P1,K1P2,K2M1,K2M2
      COMMON/IO/LEC ,IMP,KIMP,NXYZ,NGEO,NDIST

C     ....
      DOALL K = K1, K2
         PRIVATE J
         DOALL J = J1, JA
            PRIVATE TI
            CALL PVNMUT(TI)                                             
            T(J,K,NN) = S*TI(1)                                         
            T(J,K,NN+1) = S*TI(2)                                       
            T(J,K,NN+2) = S*TI(3)                                       
         ENDDO
      ENDDO
C     ....
      END

 

RENPAR6

 

Figure: Contrived example: RENPAR6
      SUBROUTINE RENPAR6(A,N,K,M)
      INTEGER N,K,M,A(N)
      DIMENSION WORK(100,100)
      K = M * M
      DO I = 1,N
         DO J = 1,N
            WORK(J,K) = J + K
         ENDDO

         CALL INC1(K)

         DO J = 1,N
            WORK(J,K) = J * J - K * K
            A(I) = A(I) + WORK(J,K) + WORK(J,K-1)
         ENDDO
      ENDDO
      END
      
      SUBROUTINE INC1(I)
      I = I + 1
      END

RENPAR6 is a contrived example (see Figure [*]) designed to show on a very simple program the power of READ, WRITE, IN and OUT regions, and some particular details of their computations, especially when integer scalar variables that appear in array indices are modified.

The main purpose is to see that array WORK is only a temporary and can be privatized. Notice that the value of K is unknown on entry to the loop I, and that its value is modified by a call to INC1 at each iteration (INC1 simply increments its value by 1).

We are interested in the sets of array elements that are referenced in each iteration. However, since the value of K is not the same in the two written references, we cannot summarize the write accesses if we do not know the relation that exists between the two values of K. This is achieved in PIPS by using transformers, that here show how the new value of K is related to the value before the CALL (K#init):

 

16#16


 

And the transformer of the loop shows how the value of K at each step is related to the values of I and K#init (value of K before the loop):

 

17#17


 

This previous information is used to summarize the sets of elements that are read or written by each program structure. In order to compute the summary for the loop I, we must merge the sets for the two J loops. Be careful that the value of K is not the same for these two loops. We must use the transformer of the CALL to translate the value of K in the second region into the value of K before the CALL. At this step, we have a summary of what is done by a single iteration. We then compute the regions for the whole loop I. This is done with the help of the transformer of the loop that gives the relation between K and I.

However, as we have seen with NORM, READ and WRITE regions are not sufficient for array privatization, because we must verify that every element of WORK that is read by an iteration is previously written in the same iteration. This is achieved by the IN region. Then OUT regions allow us to verify that no element of WORK is used in the subsequent iterations or in the continuation of the loop.

We can now try to parallelize RENPAR6 by:

  1. using transfomers to compute array regions;
  2. using array regions to perform the dependence analysis;
  3. using IN and OUT array regions to privatize the array WORK.

This leads to the parallelized version of Figure [*]. The array WORK is privatized in loop I. However, the loop is not parallelized, because automatic induction variable substitution is not available in PIPS. This transformation has been performed by hand. This leads to the subroutine RENPAR6_2 in figure [*]. And after array privatization, PIPS is able to parallelize the loop I (see Figure [*]).

 

Figure: Parallelized version of RENPAR6.
      SUBROUTINE RENPAR6(A,N,K,M)
      INTEGER N,K,M,A(N)
      DIMENSION WORK(100,100)
      K = M*M                                                      
      DO I = 1, N
         PRIVATE WORK,I
         DOALL J = 1, N
            PRIVATE J
            WORK(J,K) = J+K                                         
         ENDDO
         CALL INC1(K)                                              
         DOALL J = 1, N
            PRIVATE J
            WORK(J,K) = J*J-K*K                                     
         ENDDO
         DO J = 1, N
            PRIVATE J
            A(I) = A(I)+WORK(J,K)+WORK(J,K-1)                        
         ENDDO
      ENDDO
      END

 

Figure: RENPAR6_2.
      SUBROUTINE RENPAR6_2(A,N,K,M)
      INTEGER N,K,M,A(N)
      DIMENSION WORK(100,100)
      K0 = M * M
      DO I = 1,N
         K = K0+I-1
         DO J = 1,N
            WORK(J,K) = J + K
         ENDDO

         CALL INC1(K)

         DO J = 1,N
            WORK(J,K) = J * J - K * K
            A(I) = A(I) + WORK(J,K) + WORK(J,K-1)
         ENDDO
      ENDDO
      END

 

Figure: Parallelized version of RENPAR6_2.
      SUBROUTINE RENPAR6_2(A,N,K,M)
      INTEGER N,K,M,A(N)
      DIMENSION WORK(100,100)
      K0 = M*M                                                 
      DOALL I = 1, N
         PRIVATE WORK,J,K,I
         K = K0+I-1                                                
         DOALL J = 1, N
            PRIVATE J
            WORK(J,K) = J+K                                        
         ENDDO
         CALL INC1(K)                                               
         DOALL J = 1, N
            PRIVATE J
            WORK(J,K) = J*J-K*K                                       
         ENDDO
         DO J = 1, N
            PRIVATE J
            A(I) = A(I)+WORK(J,K)+WORK(J,K-1)                         
         ENDDO
      ENDDO
      END

In fact, IN and OUT regions could also be used to reduce the set of elements of array WORK to allocate to each processor, because each iteration only accesses a sub-array. These regions provide an exact representation of the set of elements that are actually needed.

 

User manual

 


PIPS input language

To be defined... :-(

F77, implicit none, include, .F (cpp) & .f

What is not in the PIPSinput language...

 


The PIPS Unix commands

 


WPips

The wpips command is used to run the OpenLook/X11 interface for PIPS. wpips does not need any OpenWindows specific feature. You need to have a correctly initialized DISPLAY variable with enough access rights to the X11 display.

The default directory is the one where wpips is launched.

 


EPips

For more GNU Emacs familiar users, there is an extension to wpips that use some Emacs windows to display various PIPS informations. You can bennefit various Emacs advanced features, like couloured prettyprinted Fortran code, hypertextual interaction on the code, etc.

If you have already an Emacs running, M-X epips launches a special wpips instance from Emacs. You need to load some E-Lisp stuff before, for example by modifying your .emacs file according to the PIPS README. The default directory is the one of the Emacs buffer where wpips is launched.

You can also launch a separate Emacs that deals with

 


Some basics about the OpenLook interface

The most useful button is the right mouse button since it is used to select everything you want in the menus or the panels.

The left button is used to pick a default selection as a short cut if you want exactly what you want. By using the control+right mouse button you can change the default selection of a menu as you want.

Some menus have a ``push pin''. If you click on it, the menu is changed in a panel window you can place as you want. It is useful when you often use a menu.

Some menu items or display items may be shaded. That means that they cannot be selected by the user according to the current situation.

 


Basic Pips containers: workspaces & modules

In order to analyse Fortran programs, PIPS create a workspace where it puts all the information about the source, the transformated code, some compiled code, some executables for output, etc. Thus the first thing to begin with PIPS is to ask for a workspace creation in the current directory.

Each source code is splitted in modules that are the functions and procedures of the program. Most of the PIPS transformations can deal with a module but some other ones, like interprocedural analyses, deal with all the modules at once, that is with all the modules of the workspace.

Workspaces and modules are thus the basic containers used in PIPS.

 


WPips: the main panel

18#18

The main panel contains most of the menus usable in PIPS and is the window that appears first when PIPS begins. It also give various informations on the PIPS current state.

 


Message

The line give you the last information message given by PIPS, such as a warning, a log or an error message. Usually, this line is the same as the last line of the log window (see [*]) but is useful since this may be closed or hidden.

 


Directory

This line display the current directory and can be user-edited to change directory (but only when there is no workspace currently open). By using the small directory menu, one can change the directory by browsing the tree structure.

 


Workspace

This line of the main panel display the current open workspace. It can be edited to open an old workspace or create a new one if a workspace with this name does not already exist in the current directory. If a workspace is already open, it is first closed befor opening or creating a new one.

There is also a small workspace menu that allows to open a workspace from the current directory, to close the current one or to delete a workspace. If one try to delete the current workspace, it is closed first.

Creating a workspace ask for a workspace name. If a workspace with this name already exists, PIPS ask for its deletion. Then, PIPS pop a window with the list of all the Fortran programs of the current directory. One can give a list of file names separated with space(s) or more easily by selecting the files with the mouse in the scrolling list.

 


Module

After selecting a workspace, the module line and menu should become active.

The module line display the current module selected that is the main module (the module from a PROGRAM statement) if any.

The user can type a module name to select one or use the small menu to select one quickly (if there is not too many modules to fit in the screen).

 


Memory

This line display an idea of the break limit of the PIPS process, that is the memory used in megabytes by the code and the data (but not by the stack). This information cannot be modified by the user3.1.

 


CPU time

Another useful information for experimentation is the user time given by this line.

 


Number of display windows

This item shows the number of active windows at this time. The default number is 2, that means that when some code is displayed it uses, cyclicly the 2 windows availables (see also [*] to retains some windows).

The number of active windows can be modified by editing the line or easier by clicking up or down the small arrows.

 


PIPS icon button

This button is used to interrupt the current PIPS work. It is taken into account by at a phase boundary. That means that if you are doing a compute intensive phase, you will wait until the end of this one, since it is the only way to have clean and easy data coherence.

 


Log window

Since PIPSis an interactive environment, some information about what is hapening or what failed, etc. is important. A special window is allocated to this purpose.

Since dealing with big programs can lead to huge log information, think to empty the log windows from time to time.

 


WPips log window

In the mode, an XView textedit-like window is used. It can be opened, closed or emptied from the main panel.

 


EPips log window

In the Emacs mode, the log window is naturally an Emacs buffer. This able to display different messages with various text properties (such as colors, shape, etc.) according to the importance of each messages: a user warning, a user error, a pips error, etc.

There is also PIPS-specific menus added to the buffer.

 


View windows

According to the mode, that is or , different kind of windows are used to give information to the user, such as display the code, display the data dependences, edit the code, etc.

A window to be diplayed is chosen among a pool of 19#19 windows, 19#19 given by the number of display windows in the main panel (Section [*]). To do some advance usage, it is often useful to increase the number of available windows at the same time from the default value of 2 to a greater value.

Since it is useful to keep some information in a window, such as for example the original code when the user is applying various transformations on the code, windows can be frozen and retained unused when they have a retain attribute. According to the mode, different methods are used to change the retain state.

 


Wpips display windows

 

Default XView windows

By default a XView textedit window is open to display user information.

Functionalities of this kind of windows can be found in the manual page with man textedit. The most interesting thing to know about is that you can display a menu with the right button to do operations on files, etc.

At the bottom of the window there is a mark box to retain a window. Its retain state can be changer by clicking on this button.

 

Alternate view/edit windows

Some users found useful to be able to chose their editors. For this feature, an environment variable is used: PIPS_WPIPS_EDITOR. If this variable is set, its value is the name of the command to execute to display or edit a file. The name of the file to display is concatenated to that value before being executed. For example, in a csh-like shell,

setenv PIPS_WPIPS_EDITOR 'xterm -e vi'

will use a vi editor in an xterm.

Note that with this feature, the control of these windows are under the control of their users, that means that the retain mode is meaningless and there is no pool of windows. A new editor windows is used to display each new data and the user is responsible to remove no longer used windows.

 


Epips display windows

In the Emacs mode, an Emacs buffer is of course used to inherit of the editing power, language-orented editing, hypertextual interactivity, colored highlighting, etc.

Each window has some PIPSspecific menus and a special keymap.

As for the XView windows of , the number of windows in the pool is choosen according to the number of display windows in the main panel (Section [*]).

Note that the visited file name of the buffer is set to the displayed file but since a file cannot be directly visited in different buffer in Emacs, when is asked to display twice the same version of the same resource, a confirmation is asked to the user and the file name is not set in the last buffer.

 


View menu

It is used to display some code or some informations from PIPS. To display these, PIPS may execute lots of analyses or code transformations according to .

 

20#20

 

Lasts/No selection:
in only, to open all the last display windows;

 

:
it is the basic view of a module : the code is displayed as PIPS understands it. It may be decorated with some internal informations by using options ``'' (section [*]);

 

:
since the Sequential View is a prettyprinted version of the code, some program details may have been modified an ``'' give the code before parsing by PIPS. It is closer to the original code. Of course, after transforming the code by the user, this code may be meaningles... As for the Sequential View, the code can be decorated with some internal informations by using options ``'' (section [*]);

 

:
this button is used to display the of the with the graph editor . Each node contains some code statements that can be decorated as for the by using options ``'' (section [*]).

Displaying the control flow graph is interesting to precisely analyze some codes in order to figure out what structural optimizations to apply. This kind of graph view is also used to display the interprocedural control flow graph (see section [*]) and the more classical call graph (see section [*]) the linking the various procedures and functions of a program.

In PIPS, the control graph is represented with a hierachical control flow graph, as explained about the in the . The controf flow graph is hierarchical since a statement can contain an unstructured graph to represent another control graph with no edge outside of the statement.

The control flow is represented in a directed graph of nodes. For example, a leads to an edge from the source to the destination, an with some s leads to one edge to the branch and another one to the branch.

As a consequence a node without predecessor is unreachable and can be discarded (see section [*]).

In the output, the following style hints are used:

  • the first statement block of the programm is yellow;
  • the entry node of an is a light green ellipse;
  • the exit node of an is a light grey ellipse;
  • an unstructured (that is an with some s) is a cyan rhombus (if it is the entry of the it is a light green one);
  • a branch is blue;
  • a branch is red.

 

:
displays the dependence graph view. [*];

 

:
displays the array dataflow graph of the code, that is the information used to track array data flowing in the program as used in the method also known as Feautrier's one. Just an example to explain a little bit the output:
INS_100:
********
 Execution Domain for 100:
{
  J - 10 <= 0 ,
- J + I + 1 <= 0 ,
  I - 10 <= 0 ,
- I + 1 <= 0 ,
} 

 ---Def-Use---> ins_130:
  Reference: X
  Transformation: [I,J]
  Governing predicate:
{
  K - 1 <= 0 ,
} 
  Execution Domain for 130:
{
- I + K + 1 <= 0 ,
- K + 1 <= 0 ,
  J - 10 <= 0 ,
- J + I + 1 <= 0 ,
  I - 10 <= 0 ,
- I + 1 <= 0 ,
}
It first describes the data generated by the instruction 0 of line 10, that is INS100 with its execution domain. There is a use-def dependence with instruction 0 of line 13, that is INS130, about array X only if the governing predicate is true, that is 21#21 here. Then, we have the execution domain on X(I,J) that used the data previously defined in INS100. Transformation: [I,J] means that X(I,J) is defined in the loop-nest INS100(i,j) with 22#22;

 

:
give a tiem base for each instruction. For example
ins_140:
     pred: TRUE
     dims: -1+3*I
means that if the predicate is true (here it is of course always true...), these instruction ins executed at time 23#23;

 

:
gives where an instruction is executed. For example
Ins_140 :I , J
means that these instruction is executed on processor 24#24;

 

:
display a tree of all the functions or procedures called in the current module. The code can also be decorated according to the options ``'' (section [*]);

 

:
display a more precise tree than the . It is the Interprocedural COntrol Flow Graph, where each call, do-loops and added according to the options ``'' (section [*]);

 

:
ask PIPS for parallelizing the code with the WP65/PUMA method, with all the prerequisite of this method on the input code. The output is in fact not 1 code but 2 ones:
  • the computational code;
  • the memory bank code that does the parallel memory feature;

 

:
ask PIPS for parallelizing the code and displaying it according the parallel dialect given in the options ``'' (section [*]);

 

:
launch a Fortran lint on the module and give the information back to the user;

 

Close:
in only, to close all display windows.

 


Transform/Edit

The Transform/Edit menu is used to apply various transformations on the current module in PIPS. Furthermore, the user can edit the code of the module as a special transformation.

 

25#25

26#26

 

:
a special transformation : the user one! Load the original code of the module. Do not forget to save your modification after you have finished (the menu File/Save Current File in a Edit window or the menu Save the file after edit in the seminal .f in ).

27#27

 


Compile

 

28#28

 

:
compile all the modules with the compiler;

 

:
run make on the program;

 

:
go a step further by trying to run the Fortran 77 output of the compiler;

 

:
this menu allows you to view one of the files generated by the compiler. For each module, the main files are the _host.f file for the scalar code and the _node.f file for the parallel code.

 

 


Options

 

29#29

30#30

 


Conclusion

 


Footnotes

... user3.1
It is not possible to give some memory back...

Document Actions

« December 2024 »
December
MoTuWeThFrSaSu
1
2345678
9101112131415
16171819202122
23242526272829
3031