Chapter 1 Introduction

1.1 Motivations

Bulk Synchronous Parallel (BSP) computing is a parallel programming model introduced by Valiant [27, 21, 26] to offer a high degree of abstraction in the same way as PRAM models and yet allow portable and predictable performance on a wide variety of architectures. A BSP computer has three components: a homogeneous set of processor-memory pairs, a communication network allowing inter processor delivery of messages and a global synchronization unit which executes collective requests for a synchronization barrier. A wide range of actual architectures can be seen as BSP computers. The BSP execution model represents a parallel computation on p processors as an alternating sequence of computation super-steps (p asynchronous computations) and communications super-steps (data exchanges between processors) with global synchronization. The BSP cost model estimates execution times by a simple formula. A computation super-step takes as long as its longest sequential process, a global synchronization takes a fixed, system-dependent time L and a communication super-step is completed in time proportional to the arity h of the data exchange: the maximal number of words sent or received by a processor during that super-step. The system-dependent constant g, measured in time/word, is multiplied by h to obtain the estimated communication time. It is useful to measure times in multiples of a Flop so as to normalize g and L w.r.t. the sequential speed of processor nodes. Bulk synchronous parallelism (and the Coarse-Grained Multicomputer model, CGM, which can be seen as a special case of the BSP model) has been used for a large variety of domains: scientific computing [3, 14], genetic algorithms [4] and genetic programming [7], neural networks [25], parallel databases [2], constraint solvers [10], etc. It is to notice that ``A comparison of the proceedings of the eminent conference in the field, the ACM Symposium on Parallel Algorithms and Architectures, between the late eighties and the time from the mid nineties to today reveals a startling change in research focus. Today, the majority of research in parallel algorithms is within the coarse-grained, BSP style, domain'' [6]. The main advantages of the BSP model are:

deadlocks are avoided, indeterminism can be either avoided or restricted to very specific cases. For example in the BSPlib [12], indeterminism can only occur when using the direct remote memory access operation put: two processes can write different values in the same memory address of a third process
portability and performance predictability [11, 15].

Nevertheless the majority of parallel programs written are not BSP programs. There are two main arguments against BSP. First the global synchronization barrier is claimed to be expensive. [13] for example shows the efficiency of the BSPlib against other libraries. A more recent work [16] also points out the advantages of the BSP model over MPI for VIA (a lightweight protocol) nets in particular using a scheduling of messages which can be done at the synchronization barrier (using a latin square) in order to avoid sequentialization of the receipt of messages. Second the BSP model is claimed to be too restrictive. All parallel algorithms are not fitted to its structured parallelism. This argument is not false but is more limited than the opponent of the BSP model think. BSP algorithms which have no relation with older algorithms but which compute the same thing can be found. The performance predictability of the BSP model even allows to design algorithms which cannot be imagined using unstructured parallelism (for example [2]). Divide-and-conquer parallel algorithms are a class of algorithms which seem to be difficult to write using the BSP model and several models derived from the BSP model and allowing subset synchronization have been proposed. We showed that divide-and-conquer algorithms can be written using extensions [19, 18] of our framework for functional bulk synchronous parallel programming [20, 17]. The execution of such programs even follow the pure BSP model. As we faced those criticisms in our previous work on Bulk Synchronous Parallel ML (BSML), we decided to investigate semantics of a new functional parallel language, without synchronization barriers, called Minimally Synchronous Parallel ML (MSPML). As a first phase we aimed at having (almost) the same source language and high level semantics (programming view) than BSML (in particular to be able to use with MSPML work done on type system [9] and proof of parallel BSML programs [8]), but with a different lower level semantics and implementation. With this new language we would like to:

have a functional semantics and a deadlock free language but a simple cost model is no more mandatory ;
compare the efficiency of BSML with respect to MSPML as the comparisons of BSP and other parallel paradigms were done with classical imperative languages (C, Fortran) ;
investigate the expressiveness of MSPML for non BSP-like algorithms.

MSPML will also be our framework to investigate extensions which are not suitable for BSML, such as the nesting of parallel values or which are not intuitive enough in BSML, such as spatial parallel composition. We could also mix MSPML and BSML for distributed supercomputing. Several BSML programs could run on several parallel machines and being coordinated by a MSPML-like program.

1.2 Informal presentation

1.2.1 The Execution and Cost Model

Bulk Synchronous Parallel (BSP) computing is a parallel programming model introduced by Valiant [27, 26] to offer a high degree of abstraction in the same way as PRAM models and yet allow portable and predictable performance on a wide variety of architectures. A BSP computer has three components: a homogeneous set of processor-memory pairs, a communication network allowing inter processor delivery of messages and a global synchronization unit which executes collective requests for a synchronization barrier. A wide range of actual architectures can be seen as BSP computers. The BSP execution model represents a parallel computation on p processors as an alternating sequence of computation super-steps (p asynchronous computations) and communications super-steps (data exchanges between processors) with global synchronization. The BSP cost model estimates execution times by a simple formula. A computation super-step takes as long as its longest sequential process, a global synchronization takes a fixed, system-dependent time L and a communication super-step is completed in time proportional to the arity h of the data exchange: the maximal number of words sent or received by a processor during that super-step. The system-dependent constant g, measured in time/word, is multiplied by h to obtain the estimated communication time. It is useful to measure times in multiples of a Flop so as to normalize g and L w.r.t. the sequential speed of processor nodes. BSPWB, for BSP Without Barrier [24], is a model directly inspired by the BSP model. It proposes to replace the notion of super-step by the notion of m-step defined as: at each m-step, each process performs a sequential computation phase then a communication phase. During this communication phase the processes exchange the data they need for the next m-step. The parallel machine in this model is characterized by three parameters (expressed as multiples of the processors speed): the number of processes p, the latency L of the network, the time g which is taken to one word to be exchanged between two processes. This model could be applied to MSPML but it will be not accurate enough because the bounds used in the cost model are too coarse. A better bound Phi_s,i is given by the Message Passing Machine (MPM) model [23]. The parameters of the Message Passing Machine are the same than those of the BSPWB model. The model uses the set Omega_s,i for a process i and a m-step s defined as:

Omega_s,i =

{
{

j/ process j sends a message

to process i at m-step s

}
}

union{i}

Processes included in Omega_s,i are called ``incoming partners'' of process i at m-step s. Phi_s,i is inductively defined as:

{
{
{
{

Phi_1,i=	max{w_1,j/j in Omega_1,i} + (g× h_1,i + L)
Phi_s,i=	max{Phi_s-1,j + w_s-1,j/j in Omega_s,i}
	+ (g× h_s,i + L)

where h_s,i=max{h_s,i⁺,h_s,i^-} for iin{0,...,p-1} and sin{2,...,R}. Execution time for a program is thus bounded by: Psi=max{Phi_R,j/j in {0,1,...,p-1}} . The MPM model takes into account that a process only synchronizes with each of its incoming partners and is therefore more accurate. The MPM model is used as the execution and cost model for our Minimally Synchronous Parallel ML language.

1.2.2 The MSPML Library

There is no implementation of a full Minimally Synchronous Parallel ML (MSPML) language but rather a partial implementation as a library for the functional programming language Objective Caml [22, 5] (using TCP/IP for communications). The so-called MSPML library is based on the following elements. It gives access to the parameters of the underling architecture which is considered as a Message Passing Machine (MPM). In particular, it offers the function p:unit->int such that the value of p() is p, the static number of processes of the parallel machine. The value of this variable does not change during execution. There is also an abstract polymorphic type alpha par which represents the type of p-wide parallel vectors of objects of type alpha , one per process. The nesting of par types is prohibited. This can be ensured by a type system [9]. The parallel constructs of MSPML operate on parallel vectors. Those parallel vectors are created by: mkpar: (int -> alpha ) -> alpha par so that (mkpar f) stores (f i) on process i for i between 0 and (p-1). We usually write fun pid->e for f to show that the expression e may be different on each processor. This expression e is said to be local. The expression (mkpar f) is a parallel object and it is said to be global. For example the expression mkpar(fun pid->pid) will be evaluated to the parallel vector < 0 ,..., p-1 >. In the MPM model, an algorithm is expressed as a combination of asynchronous local computations and phases of communication. Asynchronous phases are programmed with mkpar and with apply whose type is ( alpha -> beta ) par-> alpha par -> beta par. It is such as apply (mkpar f) (mkpar e) stores (f i) (e i) on process i. The communication phases are expressed by:

get: alpha par->int par-> alpha par

The semantics of this function is given by:

get <v₀,...,v_p-1> <i₀,...,i_p-1> = < v_i₀%p ,..., v_{i_(p-1)%p} >

During the execution of and MSPML program, at each process i the system has a variable mstep_i containing the number of the current m-step. Each time the get vv vi is called, at a give process i:

mstep_i is increased by one.
the value this process holds in parallel vector vv is stored together with the value of mstep_i in the communication environment. A communication environement can be seen as an association list which relates m-step numbers with values.
the value j this process holds in parallel vector vi is the process number from which the process i wants to receive a value. Thus process i sends a request to process j: it asks for the value at m-step mstep_i. When process j receives the request (threads are dedicated to handle requests, so the work of process j is not interupted by the request), there are two cases:
- mstep_j>=mstep_i: it means that process j has already reached the same m-step than process i. Thus process j looks in its communication environment for the value associated with m-step mstep_i and sends it to process i
- mstep_j<mstep_i: nothing can be done until process j reaches the same m-step than process i.

If i=j the step 2 is of course not performed. All this can work only if all processes call the same number of times and in the same order get. Incorrect programs could be written when nested parallelism is used:

mkpar(fun i->let this = mkpar(fun i->i) in 
             if i<(p()/2) then this 
             else get this this)

This is why it is currently forbidden (a type system can enforce this restriction [9]). The full language would also contain a synchronous conditional operation:

if e at n then e₁ else e₂

It evaluates to v1 (the value obtained by evaluating e₁) or v2 (the value obtained by evaluating e₂) depending on the value of the parallel vector of booleans e at process given by the integer n. But Objective Caml is an eager language and this synchronous conditional operation can not be defined as a function. That is why the core MSPML library contains the function:

at: alpha par -> int -> alpha

to be used in the constructions:

if (at vec pid) then ... else ...
match at e pid with ...

at expresses communication phase. Global conditional is necessary to express algorithms like:

Repeat Parallel Iteration Until Max of local errors < epsilon

Without it, the global control cannot take into account data computed locally. It is possible to use the at functions in other situations but one should avoid the (hidden) nesting of parallel vectors. For example the following expression (types are given for subexpressions to ease the understanding):

(* com: (int->true option) par *)
let com = put(mkpar(fun i d->if i=d then Some true else None)) 
and this = mkpar(fun i->i) in
mkpar(fun i->if i<(bsp_p()/2) then at (apply com this) 0 else None)

where alpha option = None | Some of alpha , is not a correct program (you can write it and compile it with the library but the execution will fail and the typechecking by our type system [9] fails) because the parallel expression (apply com this) would be evaluated inside a mkpar. It breaks the model because one part of the parallel machine will evaluate an expression with communications and another half will evaluate an expression without communication: the numbering of steps will be no more consistant between processes. For a detailled discussion about these problems (for BSML), see [9]. The following program can be safely executed because e1 is already evaluated when it is used inside the mkpar function.

let com = put(mkpar(fun i d->if i=d then Some true else None)) 
and this = mkpar(fun i->i) in
let e1 = at (apply com this) 0 in
mkpar(fun i->if i<(bsp_p()/2) then e1 else None)

1.2.3 Examples

Some useful functions can be defined using only the primitives. For example the function replicate creates a parallel vector which contains the same value everywhere. The primitive apply can be used only for a parallel vector of functions which take only one argument. To deal with functions which take two arguments we need to define the apply2 function.

let replicate x = mkpar(fun pid->x)
let apply2 vf v1 v2 = apply (apply vf v1) v2

It is also very common to apply the same sequential function at each process. It can be done using the parfun functions: they differ only in the number of arguments of the function to apply:

let parfun f v = apply (replicate f) v
let parfun2 f v1 v2 = apply (parfun f v1) v2
let parfun3 f v1 v2 v3 = apply (parfun2 f v1 v2) v3

The this() vector is such as process i holds value i:

let this () = mkpar(fun pid->pid)

The semantics of the broadcast is:

bcast <v₀,...,v_p-1> r = < v_r%p ,..., v_r%p >

The direct broadcast function which realizes the broadcast can be written:

let bcast root vv = get vv (replicate root)

A formal semantics of MSPML can be found in [1]. The preliminary experiments done with our implementation of MSPML showed that the MPM model applies well to MSPML. For example, the parallel cost of the direct broadcast is (p-1)× s × g + L, where s denotes the size of the value v_n held at process n in words. Preliminary experiments showed that the actual performance of bcast follows this cost formula.