Third International Workshop on
High-level Parallel Programming and Applications (HLPP 2005)
July 4-5, 2005, Warwick University, Coventry, United Kingdom
Accepted Papers

Data-Parallel and BSP Languages and Libraries

Experimental Evaluation of BSP Programming Libraries
Peter Krusche
The model of bulk-synchronous parallel computation (BSP) helps to implement portable general purpose algorithms while keeping predictable performance on different parallel computers. Nevertheless, when programming in BSP style, the running time of the implementation of an algorithm can be very dependent on the underlying communications library. In this study, an overview over existing approaches to practical BSP programming in C/C++ or Fortran is given and benchmarks were run for the two main BSP-like communications libraries, BSPlib and PUB. Furthermore, a memory efficient matrix multiplication algorithm was implemented and used to compare their performance on different parallel computers and to evaluate the compliance with predictions by theoretical results.
A Hybrid Shared Memory Execution Model for a Data Parallel Language with I/O
Clemens Grelck, Steffen Kuthe and Sven-Bodo Scholz
Execution of programs with data parallel language constructs is either based on the fork/join or on the SPMD model. Whereas the former executes a program sequentially and confines parallel activity to the data parallel constructs, the latter executes the whole program in parallel: while data parallel constructs are performed cooperatively, the remaining code is replicated. However, in the presence of I/O not all operations may actually be replicated without changing the programs extensional behaviour. Consequently, even SPMD-style parallel execution contains pockets of sequential execution, and the two execution models differ mostly in the default execution mode. Which execution model is better suited depends on an individual programs characteristics. Therefore, we propose a hybrid execution model that combines the advantages of fork/join- and SPMD-style execution. The hybrid model adapts itself to the needs of the program compiled. While some program parts are effectively executed following a fork/join approach, others are executed in SMPD mode depending on the individual mix of operations. The number of costly execution mode switches and, hence, the overhead for synchronization and communication is reduced with respect to both plain fork/join and SPMD approaches.
Implementation of Parallel Data Structures in BSML
Frederic Gava
A functional data-parallel language called BSML was designed for programming Bulk-Synchronous Parallel algorithms, a model of computing which allows parallel programs to be ported to a wide range of architectures. BSML is based on an extension of the ML language with parallel operations on a parallel data structure called parallel vector. The execution time can be estimated. Dead-locks and indeterminism are avoided. Many sequential algorithms do not have parallel counterparts and many non-computer science researchers do not want to deal with parallel programming. In sequential programming environments, common data structures are often provided through reusable libraries to simplify the development of applications. A parallel representation of some data structures is thus an important issue to write parallel programs without dealing with all the features of a parallel language. In this paper we describe the implementation in BSML of some important parallel data structures and show how those data types can address the needs of many potential users of parallel machines that have so far been deterred by the complexity of parallelizing code.

Threads and Multitasking

HirondML: Fair Threads Migrations for Objective Caml
Julien Verlaguet and Emmanuel Chailloux
In this paper, we present HirondML an Objective Caml library implementing migrating fair threads. Our library is a new adaptation of the reactive like thread synchronisation system : the Fair Threads, originally developed at Inria. The fair threads scheduling policy is a mixture between cooperative and preemptive systems. By using its cooperative part we design a semantic for thread migration, considering the possible interactions with the other fair threads. To minimize the data copies, we adopt an original rebind policy distinguishing local variables, which are copied, to global variables which are rebound when a migration takes place. This choice allows two styles of programming: by copy or by sharing, and will be illustrated by different distributed applications and paralellism models.
Integrating Remote Invocations with Asynchronism and Cooperative Multitasking
Noemi Rodriguez and Silvana Rossetto
As the focus of distributed programming shifted to wide-area networks, in the late nineties, limitations of the classical, synchronous, RPC mechanisms for this new environment triggered research and development of new communication models. Many developers turned their attention to message-oriented communication, which had been regarded as too low level for application programming in the previous decade. In this paper we argue that, with appropriate language support, it is possible to couple the benefits of programming with a well-known abstraction such as RPC with the asynchronous execution model that is needed for wide-area platforms such as grids. We propose asynchronous function calls as the basis for all communication. The programmer can associate a callback to an asynchronous invocation, allowing results to be handled in an asynchronous way. But, because this is not a natural model for programmers, we also provide synchronous invocations, which are built over the asynchronous basic primitive. We discuss how the concepts of functions as first-class values and closure can help the construction of programming abstractions. To allow the computation to proceed at a process that has issued a remote invocation, we use cooperative multitasking, implemented through Lua coroutines. This eliminates many of the problems associated to multithreading, which is the concurrency mechanism traditionally used in this setting. We use the Lua programming language because it offers the necessary support and because of our involvement with Lua for distributed programming, but the approach we propose could be perfectly applied in other languages. We include a comparison of performance of concurrent programs based on coroutines and on multithreading.

Implementing Algorithmic Skeletons

Shared Message Buffering without Intermediate Memory Copy
Gagarine Yaikhom
Algorithms that allow message passing APIs to buffer messages while avoiding intermediate memory copy are presented. Shared buffering for one-to-many communications is also discussed. The algorithms are explained and implemented based on the concept of buffered branching channels. Differences between the new interfaces and the MPI interfaces are discussed. To illustrate the advantages of the new interfaces, an implementation of the pipeline algorithmic skeleton is discussed.
Improving Functional Topology Skeletons With Dynamic Channels
Jost Berthold and Rita Loogen
Parallel functional programs with implicit communication often generate purely hierarchical communication topologies during execution: communication only happens between parent and child processes. Messages between siblings must be passed via the parent. This causes inefficiencies that can be avoided by enabling direct communication between arbitrary processes. The Eden parallel functional language provides dynamic channels to implement arbitrary communication topologies. This paper analyses the impact of dynamic channels on Eden's topology skeletons, i.e. skeletons which define process topologies such as rings, toroids, or hypercubes. We compare topology skeletons with and without dynamic channels with respect to the number of messages. Our case studies confirm that dynamic channels usually decrease the number of messages by up to 50% and can reduce runtime by up to 40%. Detailed analysis of Eden TV (trace viewer) execution profiles reveals the reasons for these substantial runtime gains.
On Implementing the Farm Skeleton
Michael Poldner and Herbert Kuchen
Algorithmic skeletons intend to simplify parallel programming by providing a higher level of abstraction compared to the usual message passing. Task and data parallel skeletons can be distinguished. In the present paper, we will consider several approaches to implement one of the most classical task parallel skeletons, namely the farm, and compare them w.r.t. scalability, overhead, potential bottlenecks, and load balancing. We will also investigate several communication modes for the implementation of skeletons. Based on experimental results, the advantages and disadvantages of the different approaches are shown. Moreover, we will show how to terminate the system of processes properly.

Algorithmic Skeletons Tools and Libraries

High Level Parallel Skeletons for Dynamic Programming
Ignacio Pelaez, Francisco Almeida, Daniel Gonzalez
Dynamic Programming is an important problem-solving technique used for solving a wide variety of optimization problems. Dynamic Programming programs are commonly designed as individual applications and, software tools are usually tailored to specific classes of recurrences and methodologies. That contrasts with some other algorithmic techniques where a single generic program may solve all the instances. We have developed a general skeleton tool providing support for a wide range of dynamic programming methodologies on different parallel paradigms. Genericity, extensibility and efficiency are basic issues of the design strategy. Parallelism is supplied to the user in a transparent manner through a common sequential interfece. A set of tests problems representative of different classes of Dynamic Programing formulations has been used to validate our skeleton on an IBM-SP.
Skeletal Parallel Programming with OcamlP3L 2.0
R. Di Cosmo, Z. Li, S. Pelagatti, P. Weis
Parallel programming has been proved to be an effective technique to improve the performance of computationally intensive programs. However, writing parallel programs is not easy, as new issues need to be addressed, and activities such as debugging are usually hard and time consuming. To cope with these difficulties, skeletal parallel programming has been widely explored in recent years with very promising results. However, prototypal skeletal systems developed so far tend to be rather inflexible and difficult to adapt to many practical parallelization scenarios. For instance, many systems require all the parallel structure of an application to be expressed in term of possibly nested skeletons, which may be hard when parallelizing a large complex application. Moreover, it is usually difficult to share resources among different skeleton instances and to reuse similar instances of the same skeleton without duplicating code. This paper describes OcamlP3L 2.0, which sensibly changes the skeletal model of version 1.0, to make the system more usable and flexible. We report on the current status of the OcamlP3L system. In particular, we describe the new skeletons and the skeletal execution model, how applications can be structured using the available skeletons, how skeletons are compiled and optimized and provide some small examples.

Performance models

Evaluating computational costs while handling data and control parallelism
Sonia Campa
The aim of this work is to introduce a computational costs system associated to a semantic framework for orthogonal data and control parallelism handling. In such a framework a parallel application is described by a semantic expression involving in an orthogonal manner both data access and control parallelism abstractions. The evaluation of such an expression is driven by a set of rewriting rules each of which is combined with a computational cost. We present how to proceed in the evaluation of the final cost of the application as well as how such information together with the semantic framework capabilities can be exploited to increase the overall performance.
Automatic deployment of ASSIST applications using process algebra
Marco Aldinucci and Anne Benoit
The majority of real world applications tend to be demanding in both computing power and storage requirements. Grid computing facilitates the development of large scientific applications on an unprecedented scale. The development of grid applications can be simplified by using high level programming environments. Moreover, the use of such environments allows the programmer to optimise the performance of the application for both static and dynamic considerations. Thus, in the ASSIST environment, a dynamic management of the application can be performed, such that the application adapts itself to the changing availability and performance of the resources, which is specific to grid environments. In the present work, we address the problem of the application deployment by introducing performance models of ASSIST application using process algebra. These models are automatically generated before deployment, and thus they allow us to optimise the static deployment of the application. Some results are shown to prove the efficiency of this approach. Our methodology is presented through an example of a classical Divide-and-Conquer algorithm.

Third International Workshop on High-level Parallel Programming and Applications (HLPP 2005) July 4-5, 2005, Warwick University, Coventry, United Kingdom Accepted Papers

Data-Parallel and BSP Languages and Libraries

Threads and Multitasking

Implementing Algorithmic Skeletons

Algorithmic Skeletons Tools and Libraries

Performance models

Third International Workshop on
High-level Parallel Programming and Applications (HLPP 2005)
July 4-5, 2005, Warwick University, Coventry, United Kingdom
Accepted Papers