HLPP 2003: Second International Workshop on High-Level Parallel Programming and Applications

15-17 June 2003, Paris, France

Invited Lecture

The Real-Time Revolution

Bill McColl
Professor of Computer Science, Oxford University
Founder and CTO, Sychron Inc

Like Moore's Law, the Personal Computer, and the Internet in previous decades, two major new megatrends will shape the development of the mainstream computing industry in the coming decade: (1) the acceleration of business processes to real-time levels of responsiveness in order to achieve and maintain competitive advantage, (2) the shift to a computing model where computer power is delivered on demand, in the form of a utility, from vast server pools. Billions of dollars are now being invested in new R&D initiatives to solve the major challenges that these two massive new requirements (and new opportunities) present. In particular, there is a need to understand how the required global awareness, instantaneous agility, and optimized resource allocation might be achieved across massive server pools. In this talk I will explain how certain ideas from the world of scalable, predictable parallel computing can be adapted to solve these major problems in large scale distributed server pool architectures.

Accepted papers

Eskimo : Experimenting Skeletons on the Shared Address Model
Marco Aldinucci

We discuss the lack of expressivity in some skeleton-based parallel programming frameworks. The problem is further exacerbated when approaching irregular problems and dealing with dynamic data structures. Shared memory programming has been argued to have substantial ease of programming advantages for this class of problems. We present "eskimo" library which represents an attempt to merge the two programming models by introducing skeletons in a shared memory framework.
Pipelining a Skew-insensitive Parallel Join Algorithm
Mostafa Bamha, Matthieu Exbrayat

Most standard parallel join algorithms try to overcome data skews with a relatively static approach. The way they distribute data (and then computation) over nodes depends on a data re-distribution algorithm (hashing or range partitioning) that is determined before the actual join begins. On the contrary we choose to pre-scan data in order to choose an efficient join method for each given value of the join attribute. This approach has already proved to be efficient both theoretically and practically in our previous papers.
In this paper we introduce a new pipelined version of our frequency adaptive join algorithm. The use of pipelining offers flexible strategies for resource allocation while avoiding unnecessary disk input/output of intermediate join result when computing multi-join queries. We present a detailed version of the algorithm and a cost analysis based on the BSP model, showing that our pipelined algorithm achieves noticeable improvements compared to the one-step version. We thus show that the frequency adaptive approach remains efficient for pipelined multi-join queries.
A Portable Implementation for Objective Caml Flight
Emmanuel Chailloux

Caml-Flight is a data-parallel extension for Caml-Light. It is an intermediate, portable and mostly deterministic functional language for distributed memory multiprocessors (DMM). Its current implementation modifies the Caml-Light compiler to introduce new instructions for the language and its virtual machine, and to define a kind of persistence to communicate values. Caml-Light has now evolved to become Objective Caml, including an object extension and a native compiler. To benefit from this new distribution, we propose a new implementation called Objective Caml Flight. To do so we use new libraries and tools included in Objective Caml such as multi-threading, persistence and parsers, i.e. without modifying to the original compiler needed with the Caml-Flight approach. This paper presents an easier way to implement parallel extensions for Objective Caml and an actual realization called Objective Caml Flight. This new implementation is compared to Caml-FLight.
Top-Down Design of Bulk-Synchronous Parallel Programs
Yifeng Chen, J. W. Sanders

This paper studies top-down program development techniques for Bulk-Synchronous Parallelism. We use a specification language called Logic of Global Synchrony for high-level specifications. Reactive processes, unbounded nondeterminism and general recursion are supported. A number of algebraic laws of BSP have been identified, sufficient to transform any finite program to normal form. The transformation from high-level specifications to BSP programs is supported with a number of refinement laws. Any specification that causes potential communication interference is infeasible, a property that is decidable. An important novel feature is the support provided for protection of local variables in BSP. The foregoing techniques are applied to the specification and refinement of the dining-philosopher problem.
A Calculus for Dense Array Distributions
Roberto Di Cosmo, Susanna Pelagatti
We introduce a very simple, yet powerful, distribution calculus aimed at describing different distribution strategies that can be used to distribute multidimensional dense arrays over a set of processor, like what is done in the implementation of the P3L map skeleton. We give a formal semantics, that allows to prove equations between distributions, and we show how to associate a cost to such distributions, allowing to choose between semantically equivalent distribution strategies.
llc: A parallel skeletal language
Antonio J. Dorta, Jesus A. Gonzalez, Casiono Rodriguez and Francisco de Sande

The skeletal approach to the development of parallel applications has been revealed to be one of the most successful and has been widely explored in the recent years. The goal of this approach is to develop a methodology of parallel programming based on a restricted set of parallel constructs.

This paper presents llc, a parallel skeletal language, the theoretical model that gives support to the language and a prototype implementation for its compiler. The language is based on directives, uses a C-like syntax and gives support to the most widely used skeletal constructs. llCoMP is a source to source compiler for the language built on top of MPI. We evaluate the performance of our prototype compiler using four different parallel architectures and five algorithms. We present the results obtained in both shared and distributed memory architectures. Our model guarantees the portability of the language to any platform and its simplicity greatly eases its implementation.
Formal Proofs of Functional BSP Programs
Frederic Gava

The Bulk Synchronous Parallel ML (BSML) is a functional language for BSP programming, a model of computing which allows parallel programs to be ported to a wide range of architectures. It is based on an extension of the ML language by parallel operations on a parallel data structure named parallel vector, which is given by intention. We present a new approach to certifying BSML programs in the context of Type Theory. Given a specification and a program, an incomplete proof of the specification (of which algorithmic contents correspond to the given programs) is built in the Type Theory, of which gaps would correspond to the proof obligation. This development demonstrates the usefulness of higher-order logic in the process of software certification of parallel applications. It also shows that the proof of rather complex parallel algorithms may be done with inductive types without great difficulty by using already existing certified programs. This work has been implemented in the Coq Proof Assistant, applied on non-trivial examples and is the basis of a certified library of BSML programs.
Automatic Skeletons in Template Haskell
Kevin Hammond, Jost Berthold, Rita Loogen

This paper introduces the use of Template-Haskell for automatically selecting appropriate skeleton implementations in the Eden parallel dialect of Haskell. The approach allows implementation parameters to be statically tuned according to architectural cost models, permits us to target a range of parallel architecture classes ranging from hyperthreaded uniprocessors to computational Grids, and allows the use of nested skeletons at the programmer level without these impacting on execution performance.
High Level Parallel Software Development with Python and BSP
Konrad Hinsen

One of the main obstacles to a more widespread use of parallel computing in computational science is the difficulty of implementing, testing, and maintaining parallel programs. The combination of a simple parallel computation model, BSP, and a high-level programming language, Python, simplifies these tasks significantly. It allows the rapid development facilities of Python to be applied to parallel programs, providing interactive development as well as interactive debugging of parallel programs.
Aligned Multithreaded Computations and Performance Guarantees
Xie Yong, Hsu Wen-Jing

This paper considers the problem of scheduling of dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single processor execution. Earlier research in the Cilk project proposed the "strict" computational model, and guaranteed both linear speedup and linear expansion of space. However, Cilk's model has several limitations. Cilk threads are stateless, and the task graph that Cilk language expresses is series-parallel graph, which is a proper subset of arbitrary task graph. Cilk does not support applications with pipelining.
We propose the "aligned" multithreaded computational model, which extends the "strict" computational model in Cilk. In the aligned multithreaded computational model, dependencies can go from arbitrary thread x not only to x's ancestor threads, but also to x's younger brother threads, that are spawned by x's parent thread but after x. We use the same measures of time and space as those used in Cilk: T1 is the time required for executing the computation on 1 processor, T8 is the time required by an infinite number of processors, and S1 is the space required to execute the computation on 1 processor. We show that for any aligned computation, there exists an execution schedule that achieves both efficient time and efficient space. Specifically, we show that for an execution of any aligned multithreaded computation on P processors, the time required is bounded by O(T1/P+T8) and the space required can be loosely bounded by O(h?S1P) where h is the height of the spawn tree of threads, and ? is the maximum number of younger brother threads that have the same parent thread and can be blocked during execution. If we assume ? is a constant, and the space requirements for elder and younger brother threads are the same, then the space required would be bounded by O(S1P).

Based on the aligned multithreaded computational model, we show that the aligned multithreaded computational model supports pipelining applications. Further, we propose a multithreaded programming language and show that it can express arbitrary task graph.
DatTel: A Data-parallel C++ Template Library
Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy

The concept of C++ templates, realized in the Standard Template Library (STL), allows development of generic programs suitable for various concrete data structures. Our aim in this paper is to provide an opportunity for efficient execution of STL programs on parallel machines. We present DatTeL, a data-parallel library, which allows simple, efficient programming for various parallel architectures while staying within the paradigm of classical C++ template programming. We describe the principles of our approach and explain our design decisions and their implementation in the library. The presentation is illustrated with a case study | parallelization of a generic algorithm for carry-lookahead addition. We compare DatTeL to related work and report experimental results for the current version of our library.
Extending OpenMP for Task Parallelism
Ami Marowka

In a wide variety of scientific parallel applications, both task and data parallelism must be exploited to achieve the best possible performance on a multiprocessor machine. These applications induce task-graph parallelism with coarse-grain granularity. Nevertheless, using the available task-graph parallelism and combining it with data parallelism can increase the performance of parallel applications considerably since an additional degree of parallelism is exploited.

The OpenMP standard supports data-parallelism but does not support task-graph parallelism. In this paper we present an integration of task-graph parallelism in OpenMP by extending the parallel sections constructs to include task-index and precedence-relations matrix clauses. There are many ways in which task-graph parallelism can be supported in a programming environment. A fundamental design decision is whether the programmer has to write programs with explicit precedence relations, or if the responsibility of precedence relations generation is delegated to the compiler. One of the benefits provided by parallel programming models like OpenMP is that they liberate the programmer from dealing with the underlying details of communication and synchronization, which are cumbersome and error-prone tasks. If task-graph parallelism is to find acceptance, writing task-graph parallel programs must be no harder than writing data parallel programs, and therefore, in our design, all precedence relations are generated by the compiler.

This paper concludes with a description of several parallel application kernels that were developed to study the practical aspects of task-graph parallelism in OpenMP. The examples demonstrate that exploiting data and task parallelism in a single framework is the key to achieving good performance in a variety of applications.
SaC - From High-Level Programming with Arrays to Efficient Parallel Execution
Clemens Grelck, Sven-Bodo Scholz

SaC is a purely functional array processing language designed with numerical applications in mind. It supports generic, high-level program specifications in the style of APL. However, rather than providing a fixed set of built-in array operations, SaC provides means to specify such operations in the language itself in a way that still allows their application to arrays of any rank and size. This paper illustrates the major steps in compiling generic, rank- and shape-invariant SaC specifications into efficiently executable multithreaded code for parallel execution on shared memory multiprocessors. The effectiveness of the compilation techniques is demonstrated by means of a small case study on the PDE1 benchmark, which implements 3-dimensional red/black successive over-relaxation. Comparisons with HPF and ZPL show that despite the genericity of code, SaC achieves highly competitive runtime performance characteristics.
BSPGRID: Variable Resources Parallel Computation and Multiprogrammed Parallelism
Vasil Vasilev

We introduce a new framework for the design of parallel algorithms that may be executed on multiprogrammed architectures with variable resources. These features, in combination with an implied ability to handle fault tolerance, facilitates environments such as the GRID. A new model, BSPGRID is presented, which exploits the bulk synchronous paradigm to allow existing algorithms to be easily adapted and used. It models computation, communication, external memory accesses (I/O) and synchronization. By combining the communication and I/O operations BSPGRID allows the easy design of portable algorithms whilst permitting them to execute on non-dedicated hardware and/or changing resources, which is typical for machines in a GRID. However, even with this degree of dynamicity, the model still offers a simple and tractable cost model. Each program runs in its own virtual BSPGRID machine. Their emulation on a real computer is demonstrated to show the practicality of the framework. A dense matrix multiplication algorithm and its emulation in a multiprogrammed environment is presented as an example.