Third International Workshop on
High-level Parallel Programming and Applications (HLPP 2005)
July 4-5, 2005, Warwick University, Coventry, United
Kingdom
Accepted Papers
Data-Parallel and BSP Languages and Libraries
Experimental Evaluation of BSP Programming Libraries
Peter Krusche
The model of bulk-synchronous parallel computation (BSP) helps to
implement portable general purpose algorithms while keeping
predictable performance on different parallel computers.
Nevertheless, when programming in BSP style, the running time of
the implementation of an algorithm can be very dependent on the
underlying communications library.
In this study, an overview over existing approaches to practical
BSP programming in C/C++ or Fortran is given and benchmarks were
run for the two main BSP-like communications libraries, BSPlib and
PUB. Furthermore, a memory efficient matrix multiplication
algorithm was implemented and used to compare their performance on
different parallel computers and to evaluate the compliance with
predictions by theoretical results.
A Hybrid Shared Memory Execution Model for
a Data Parallel Language with I/O
Clemens Grelck, Steffen Kuthe and Sven-Bodo Scholz
Execution of programs with data parallel language constructs is
either based on the fork/join or on the SPMD model. Whereas the
former executes a program sequentially and confines parallel
activity to the data parallel constructs, the latter executes the
whole program in parallel: while data parallel constructs are
performed cooperatively, the remaining code is replicated.
However, in the presence of I/O not all operations may actually be
replicated without changing the programs extensional behaviour.
Consequently, even SPMD-style parallel execution contains pockets
of sequential execution, and the two execution models differ
mostly in the default execution mode.
Which execution model is better suited depends on an individual
programs characteristics. Therefore, we propose a hybrid
execution model that combines the advantages of fork/join- and
SPMD-style execution. The hybrid model adapts itself to the needs
of the program compiled. While some program parts are effectively
executed following a fork/join approach, others are executed in
SMPD mode depending on the individual mix of operations. The
number of costly execution mode switches and, hence, the overhead
for synchronization and communication is reduced with respect to
both plain fork/join and SPMD approaches.
Implementation of Parallel Data
Structures in BSML Frederic Gava
A functional data-parallel language called BSML was designed for
programming Bulk-Synchronous Parallel algorithms, a model of
computing which allows parallel programs to be ported to a wide
range of architectures. BSML is based on an extension of the ML
language with parallel operations on a parallel data structure
called parallel vector. The execution time can be
estimated. Dead-locks and indeterminism are avoided. Many
sequential algorithms do not have parallel counterparts and many
non-computer science researchers do not want to deal with parallel
programming. In sequential programming environments, common data
structures are often provided through reusable libraries to
simplify the development of applications. A parallel
representation of some data structures is thus an important issue
to write parallel programs without dealing with all the features
of a parallel language. In this paper we describe the
implementation in BSML of some important parallel data structures
and show how those data types can address the needs of many
potential users of parallel machines that have so far been
deterred by the complexity of parallelizing code.
Threads and Multitasking
HirondML: Fair Threads Migrations for Objective Caml Julien Verlaguet and Emmanuel Chailloux
In this paper, we present HirondML an Objective Caml library
implementing migrating fair threads. Our library is a new
adaptation of the reactive like thread synchronisation system :
the Fair Threads, originally developed at Inria. The fair threads
scheduling policy is a mixture between cooperative and preemptive
systems. By using its cooperative part we design a semantic for
thread migration, considering the possible interactions with the
other fair threads. To minimize the data copies, we adopt an
original rebind policy distinguishing local variables, which are
copied, to global variables which are rebound when a migration
takes place. This choice allows two styles of programming: by
copy or by sharing, and will be illustrated by different
distributed applications and paralellism models.
Integrating Remote Invocations with Asynchronism
and Cooperative Multitasking
Noemi Rodriguez and Silvana Rossetto
As the focus of distributed programming shifted to wide-area
networks, in the late nineties, limitations of the classical,
synchronous, RPC mechanisms for this new environment triggered
research and development of new communication models. Many
developers turned their attention to message-oriented
communication, which had been regarded as too low level for
application programming in the previous decade. In this paper we
argue that, with appropriate language support, it is possible to
couple the benefits of programming with a well-known abstraction
such as RPC with the asynchronous execution model that is needed
for wide-area platforms such as grids. We propose asynchronous
function calls as the basis for all communication. The programmer
can associate a callback to an asynchronous invocation, allowing
results to be handled in an asynchronous way. But, because this is
not a natural model for programmers, we also provide synchronous
invocations, which are built over the asynchronous basic
primitive. We discuss how the concepts of functions as first-class
values and closure can help the construction of programming
abstractions. To allow the computation to proceed at a process
that has issued a remote invocation, we use cooperative
multitasking, implemented through Lua coroutines. This eliminates
many of the problems associated to multithreading, which is the
concurrency mechanism traditionally used in this setting. We use
the Lua programming language because it offers the necessary
support and because of our involvement with Lua for distributed
programming, but the approach we propose could be perfectly
applied in other languages. We include a comparison of performance
of concurrent programs based on coroutines and on multithreading.
Implementing Algorithmic Skeletons
Shared Message Buffering without Intermediate Memory Copy
Gagarine Yaikhom
Algorithms that allow message passing APIs to buffer messages
while avoiding intermediate memory copy are presented. Shared
buffering for one-to-many communications is also discussed. The
algorithms are explained and implemented based on the concept of
buffered branching channels. Differences between the new
interfaces and the MPI interfaces are discussed. To illustrate the
advantages of the new interfaces, an implementation of the
pipeline algorithmic skeleton is discussed.
Improving Functional Topology Skeletons
With Dynamic Channels Jost Berthold and Rita Loogen
Parallel functional programs with implicit communication often
generate purely hierarchical communication topologies during
execution: communication only happens between parent and child
processes. Messages between siblings must be passed via the
parent. This causes inefficiencies that can be avoided by enabling
direct communication between arbitrary processes. The Eden
parallel functional language provides dynamic channels to
implement arbitrary communication topologies. This paper analyses
the impact of dynamic channels on Eden's topology skeletons,
i.e. skeletons which define process topologies such as rings,
toroids, or hypercubes. We compare topology skeletons with and
without dynamic channels with respect to the number of
messages. Our case studies confirm that dynamic channels usually
decrease the number of messages by up to 50% and can reduce
runtime by up to 40%. Detailed analysis of Eden TV (trace viewer)
execution profiles reveals the reasons for these substantial
runtime gains.
On Implementing the Farm Skeleton
Michael Poldner and Herbert Kuchen
Algorithmic skeletons intend to simplify parallel programming by
providing a higher level of abstraction compared to the usual
message passing. Task and data parallel skeletons can be
distinguished. In the present paper, we will consider several
approaches to implement one of the most classical task parallel
skeletons, namely the farm, and compare them w.r.t. scalability,
overhead, potential bottlenecks, and load balancing. We will also
investigate several communication modes for the implementation of
skeletons. Based on experimental results, the advantages and
disadvantages of the different approaches are shown. Moreover, we
will show how to terminate the system of processes properly.
Algorithmic Skeletons Tools and Libraries
High Level Parallel Skeletons for Dynamic Programming
Ignacio Pelaez, Francisco Almeida, Daniel Gonzalez
Dynamic Programming is an important problem-solving technique used
for solving a wide variety of optimization problems. Dynamic
Programming programs are commonly designed as individual
applications and, software tools are usually tailored to specific
classes of recurrences and methodologies. That contrasts with some
other algorithmic techniques where a single generic program may
solve all the instances. We have developed a general skeleton tool
providing support for a wide range of dynamic programming
methodologies on different parallel paradigms. Genericity,
extensibility and efficiency are basic issues of the design
strategy. Parallelism is supplied to the user in a transparent
manner through a common sequential interfece. A set of tests
problems representative of different classes of Dynamic Programing
formulations has been used to validate our skeleton on an IBM-SP.
Skeletal Parallel Programming with OcamlP3L 2.0
R. Di Cosmo, Z. Li, S. Pelagatti, P. Weis
Parallel programming has been proved to be an effective technique
to improve the performance of computationally intensive
programs. However, writing parallel programs is not easy, as new
issues need to be addressed, and activities such as debugging are
usually hard and time consuming. To cope with these difficulties,
skeletal parallel programming has been widely explored in recent
years with very promising results. However, prototypal skeletal
systems developed so far tend to be rather inflexible and
difficult to adapt to many practical parallelization scenarios.
For instance, many systems require all the parallel structure of
an application to be expressed in term of possibly nested
skeletons, which may be hard when parallelizing a large complex
application. Moreover, it is usually difficult to share resources
among different skeleton instances and to reuse similar instances
of the same skeleton without duplicating code. This paper
describes OcamlP3L 2.0, which sensibly changes the skeletal model
of version 1.0, to make the system more usable and flexible. We
report on the current status of the OcamlP3L system. In
particular, we describe the new skeletons and the skeletal
execution model, how applications can be structured using the
available skeletons, how skeletons are compiled and optimized and
provide some small examples.
Performance models
Evaluating computational costs while handling data and control
parallelism
Sonia Campa
The aim of this work is to introduce a computational costs system
associated to a semantic framework for orthogonal data and control
parallelism handling. In such a framework a parallel application
is described by a semantic expression involving in an orthogonal
manner both data access and control parallelism abstractions. The
evaluation of such an expression is driven by a set of rewriting
rules each of which is combined with a computational cost. We
present how to proceed in the evaluation of the final cost of the
application as well as how such information together with the
semantic framework capabilities can be exploited to increase the
overall performance.
Automatic deployment of ASSIST applications using process algebra
Marco Aldinucci and Anne Benoit
The majority of real world applications tend to be demanding in
both computing power and storage requirements. Grid computing
facilitates the development of large scientific applications on an
unprecedented scale. The development of grid applications can be
simplified by using high level programming environments.
Moreover, the use of such environments allows the programmer to
optimise the performance of the application for both static and
dynamic considerations. Thus, in the ASSIST environment, a dynamic
management of the application can be performed, such that the
application adapts itself to the changing availability and
performance of the resources, which is specific to grid
environments. In the present work, we address the problem of the
application deployment by introducing performance models of ASSIST
application using process algebra. These models are automatically
generated before deployment, and thus they allow us to optimise
the static deployment of the application. Some results are shown
to prove the efficiency of this approach. Our methodology is
presented through an example of a classical Divide-and-Conquer
algorithm.