1.1 Motivations
Bulk Synchronous Parallel (BSP) computing is a parallel programming
model introduced by
Valiant [27, 21, 26] to offer a
high degree of abstraction in the same way as PRAM models and yet
allow portable and predictable performance on a wide variety of
architectures. A BSP computer has three components: a homogeneous set
of processor-memory pairs, a communication network allowing inter
processor delivery of messages and a global synchronization unit which
executes collective requests for a synchronization barrier. A wide
range of actual architectures can be seen as BSP computers.
The BSP execution model represents a parallel computation on p
processors as an alternating sequence of computation super-steps
(p asynchronous computations) and communications super-steps (data
exchanges between processors) with global synchronization. The BSP
cost model estimates execution times by a simple formula. A
computation super-step takes as long as its longest sequential process,
a global synchronization takes a fixed, system-dependent time L and
a communication super-step is completed in time proportional to the
arity h of the data exchange: the maximal number of words sent or
received by a processor during that super-step. The system-dependent
constant g, measured in time/word, is multiplied by h to obtain
the estimated communication time. It is useful to measure times in
multiples of a Flop so as to normalize g and L w.r.t. the
sequential speed of processor nodes.
Bulk synchronous parallelism (and the Coarse-Grained Multicomputer
model, CGM, which can be seen as a special case of the BSP model) has
been used for a large variety of domains: scientific computing
[3, 14], genetic algorithms [4] and
genetic programming [7], neural networks
[25], parallel databases [2], constraint
solvers [10], etc. It is to notice that ``A
comparison of the proceedings of the eminent conference in the field,
the ACM Symposium on Parallel Algorithms and Architectures, between
the late eighties and the time from the mid nineties to today reveals
a startling change in research focus. Today, the majority of research
in parallel algorithms is within the coarse-grained, BSP style,
domain'' [6].
The main advantages of the BSP model are:
-
deadlocks are avoided, indeterminism can be either avoided or
restricted to very specific cases. For example in the
BSPlib [12], indeterminism can only occur when
using the direct remote memory access operation put: two
processes can write different values in the same memory address of a
third process
- portability and performance predictability [11, 15].
Nevertheless the majority of parallel programs written are not BSP
programs. There are two main arguments against BSP. First the global
synchronization barrier is claimed to be expensive.
[13] for example shows the efficiency of the BSPlib
against other libraries. A more recent work [16] also
points out the advantages of the BSP model over MPI for VIA (a
lightweight protocol) nets in particular using a scheduling of
messages which can be done at the synchronization barrier (using a
latin square) in order to avoid sequentialization of the receipt of
messages.
Second the BSP model is claimed to be too restrictive. All parallel
algorithms are not fitted to its structured parallelism. This argument
is not false but is more limited than the opponent of the BSP
model think. BSP algorithms which have no relation with older
algorithms but which compute the same thing can be found. The
performance predictability of the BSP model even allows to design
algorithms which cannot be imagined using unstructured parallelism
(for example [2]). Divide-and-conquer parallel algorithms
are a class of algorithms which seem to be difficult to write using
the BSP model and several models derived from the BSP model and
allowing subset synchronization have been proposed. We showed that
divide-and-conquer algorithms can be written using
extensions [19, 18] of our framework for
functional bulk synchronous parallel
programming [20, 17]. The execution of such
programs even follow the pure BSP model.
As we faced those criticisms in our previous work on Bulk Synchronous
Parallel ML (BSML), we decided to investigate semantics of a new
functional parallel language, without synchronization barriers, called
Minimally Synchronous Parallel ML (MSPML). As a first phase we aimed
at having (almost) the same source language and high level semantics
(programming view) than BSML (in particular to be able to use with
MSPML work done on type system [9] and proof of
parallel BSML programs [8]), but with a different lower
level semantics and implementation.
With this new language we would like to:
-
have a functional semantics and a deadlock free language but
a simple cost model is no more mandatory ;
- compare the efficiency of BSML with respect to MSPML as the
comparisons of BSP and other parallel paradigms were done with
classical imperative languages (C, Fortran) ;
- investigate the expressiveness of MSPML for non BSP-like
algorithms.
MSPML will also be our framework to investigate extensions which are
not suitable for BSML, such as the nesting of parallel values or which
are not intuitive enough in BSML, such as spatial parallel
composition. We could also mix MSPML and BSML for distributed
supercomputing. Several BSML programs could run on several
parallel machines and being coordinated by a MSPML-like program.
1.2 Informal presentation
1.2.1 The Execution and Cost Model
Bulk Synchronous Parallel (BSP) computing is a parallel programming
model introduced by Valiant [27, 26] to
offer a high degree of abstraction in the same way as PRAM models and
yet allow portable and predictable performance on a wide variety of
architectures. A BSP computer has three components: a homogeneous set
of processor-memory pairs, a communication network allowing inter
processor delivery of messages and a global synchronization unit which
executes collective requests for a synchronization barrier. A wide
range of actual architectures can be seen as BSP computers.
The BSP execution model represents a parallel computation on p
processors as an alternating sequence of computation super-steps
(p asynchronous computations) and communications super-steps (data
exchanges between processors) with global synchronization. The BSP
cost model estimates execution times by a simple formula. A
computation super-step takes as long as its longest sequential process,
a global synchronization takes a fixed, system-dependent time L and
a communication super-step is completed in time proportional to the
arity h of the data exchange: the maximal number of words sent or
received by a processor during that super-step. The system-dependent
constant g, measured in time/word, is multiplied by h to obtain
the estimated communication time. It is useful to measure times in
multiples of a Flop so as to normalize g and L w.r.t. the
sequential speed of processor nodes.
BSPWB, for BSP Without Barrier [24], is a model
directly inspired by the BSP model. It proposes to replace the notion
of super-step by the notion of m-step defined as: at each m-step, each
process performs a sequential computation phase then a communication
phase. During this communication phase the processes exchange the data
they need for the next m-step.
The parallel machine in this model is characterized by three
parameters (expressed as multiples of the processors speed): the
number of processes p, the latency L of the network, the time g
which is taken to one word to be exchanged between two processes.
This model could be applied to MSPML but it will be not accurate
enough because the bounds used in the cost model are too coarse.
A better bound Phis,i is given by the Message Passing Machine
(MPM) model [23]. The parameters of the Message
Passing Machine are the same than those of the BSPWB model.
The model uses the set Omegas,i for a process i and a m-step
s defined as:
Omegas,i = |
{
{ |
j/ process j
sends a message |
to process i at m-step
s |
|
}
} |
union{i}
|
Processes included in Omegas,i are called
``incoming partners'' of process i at m-step s.
Phis,i is inductively defined as:
|
{
{
{
{ |
Phi1,i= |
max{w1,j/j in Omega1,i} + (g× h1,i + L) |
Phis,i= |
max{Phis-1,j + ws-1,j/j in Omegas,i} |
|
+ (g× hs,i + L)
|
|
|
where hs,i=max{hs,i+,hs,i-} for
iin{0,...,p-1} and sin{2,...,R}. Execution time for a
program is thus bounded by: Psi=max{PhiR,j/j in
{0,1,...,p-1}} .
The MPM model takes into account that a process only synchronizes with
each of its incoming partners and is therefore more accurate. The MPM
model is used as the execution and cost model for our Minimally
Synchronous Parallel ML language.
1.2.2 The MSPML Library
There is no implementation of a full Minimally Synchronous Parallel ML
(MSPML) language but rather a partial implementation as a library for
the functional programming language Objective
Caml [22, 5] (using TCP/IP for communications).
The so-called MSPML library is based on the following elements.
It gives access to the parameters of the underling architecture which
is considered as a Message Passing Machine (MPM).
In particular, it offers the function p:unit->int such that the
value of p() is p, the static number of processes of the
parallel machine. The value of this variable does not change during
execution. There is also an abstract polymorphic type alpha par
which represents the type of p-wide parallel vectors of objects of
type alpha , one per process. The nesting of par types is
prohibited. This can be ensured by a type system [9].
The parallel constructs of MSPML operate on parallel vectors. Those
parallel vectors are created by:
mkpar: (int -> alpha ) -> alpha par
so
that (mkpar f) stores (f i) on process i for i between
0 and (p-1). We usually write fun pid->e for f to show
that the expression e may be different on each processor. This
expression e is said to be local. The expression
(mkpar f) is a parallel object and it is said to be global.
For example the expression mkpar(fun pid->pid) will be evaluated
to the parallel vector < 0 ,..., p-1 >.
In the MPM model, an algorithm is expressed as a combination of
asynchronous local computations and phases of communication.
Asynchronous phases are programmed with mkpar and with
apply whose type is ( alpha -> beta ) par-> alpha par -> beta par.
It is such as
apply (mkpar f) (mkpar e) stores (f i) (e i)
on process i.
The communication phases are expressed by:
get: alpha par->int par-> alpha par
The semantics of this function is given by:
|
get <v0,...,vp-1> <i0,...,ip-1>
|
= |
< vi0%p ,..., vi(p-1)%p >
|
|
|
During the execution of and MSPML program, at each process i the
system has a variable mstepi containing the number of the
current m-step. Each time the get vv vi is called, at a give
process i:
-
mstepi is increased by one.
- the value this process holds in parallel vector vv is
stored together with the value of mstepi in the
communication environment. A communication environement can be seen
as an association list which relates m-step numbers with values.
- the value j this process holds in parallel vector vi is
the process number from which the process i wants to receive a
value. Thus process i sends a request to process j: it asks for
the value at m-step mstepi. When process j receives
the request (threads are dedicated to handle requests, so the work
of process j is not interupted by the request), there are two
cases:
-
mstepj>=mstepi: it means that process
j has already reached the same m-step than process i. Thus
process j looks in its communication environment for the value
associated with m-step mstepi and sends it to process
i
- mstepj<mstepi: nothing can be done until
process j reaches the same m-step than process i.
If i=j the step 2 is of course not performed. All this can work
only if all processes call the same number of times and in the same
order get. Incorrect programs could be written when nested
parallelism is used:
mkpar(fun i->let this = mkpar(fun i->i) in
if i<(p()/2) then this
else get this this)
This is why it is currently forbidden (a type system can enforce this
restriction [9]).
The full language would also contain a synchronous conditional
operation:
if e at n then e1 else
e2
It evaluates to v1 (the value obtained by evaluating
e1) or v2 (the value obtained by evaluating e2) depending
on the value of the parallel vector of booleans e at process given
by the integer n. But Objective Caml is an
eager language and this synchronous conditional operation can not be
defined as a function. That is why the core MSPML library contains the
function:
at: alpha par -> int -> alpha
to be used in the constructions:
-
if (at vec pid) then ... else ...
- match at e pid with ...
at expresses communication phase. Global
conditional is necessary to express algorithms like:
Repeat Parallel Iteration
Until Max of local errors < epsilon
|
Without it, the global control cannot take into account data computed
locally. It is possible to use the at functions in other
situations but one should avoid the (hidden) nesting of parallel
vectors. For example the following expression (types are given for
subexpressions to ease the understanding):
(* com: (int->true option) par *)
let com = put(mkpar(fun i d->if i=d then Some true else None))
and this = mkpar(fun i->i) in
mkpar(fun i->if i<(bsp_p()/2) then at (apply com this) 0 else None)
where alpha option = None | Some of alpha ,
is not a correct program (you can write it and compile it with the
library but the execution will fail and the typechecking by
our type system [9] fails) because the parallel
expression (apply com this) would be evaluated inside a
mkpar. It breaks the model because one part of the parallel
machine will evaluate an expression with communications and another half will evaluate an expression
without communication: the numbering of steps will be no more
consistant between processes.
For a detailled discussion about these problems (for BSML),
see [9]. The following program can be safely executed
because e1 is already evaluated when it is used inside the
mkpar function.
let com = put(mkpar(fun i d->if i=d then Some true else None))
and this = mkpar(fun i->i) in
let e1 = at (apply com this) 0 in
mkpar(fun i->if i<(bsp_p()/2) then e1 else None)
Some useful functions can be defined using only the primitives.
For example the function replicate creates a parallel
vector which contains the same value everywhere.
The primitive apply can be used only for a
parallel vector of functions which take only one argument.
To deal with functions which take two arguments we need to
define the apply2 function.
let replicate x = mkpar(fun pid->x)
let apply2 vf v1 v2 = apply (apply vf v1) v2
It is also very common to apply the same sequential function at each
process. It can be done using the parfun functions: they differ
only in the number of arguments of the function to apply:
let parfun f v = apply (replicate f) v
let parfun2 f v1 v2 = apply (parfun f v1) v2
let parfun3 f v1 v2 v3 = apply (parfun2 f v1 v2) v3
The this() vector is such as process i holds value i:
let this () = mkpar(fun pid->pid)
The semantics of the broadcast is:
bcast <v0,...,vp-1> r =
< vr%p ,..., vr%p >
The direct broadcast function which
realizes the broadcast can be written:
let bcast root vv = get vv (replicate root)
A formal semantics of MSPML can be found in [1]. The
preliminary experiments done with our implementation of MSPML showed
that the MPM model applies well to MSPML. For example, the parallel
cost of the direct broadcast is (p-1)× s × g + L, where
s denotes the size of the value vn held at process n in words.
Preliminary experiments showed that the actual performance of
bcast follows this cost formula.