See reference paper for description.
The sort function is named bin_sort() instead of upc_all_sort() so it can coexist with the original sorting function in the upc collectives. A compiled version of the upc collectives library is included with the source code.
Implementing Sort in UPC: Performance and Optimization
Technical Report 05-06, Michigan Technological University, Department of
Computer Science (2005)
Author(s): K. Begum and S. Seidel
Abstract: Unified Parallel C (UPC) is a parallel extension of ANSI C that is based on a partitioned
shared address space. It supports development of parallel applications over diverse hardware
platforms and does not require a true shared memory system. It simplifies programming by
providing the same syntax for local references as for remote references. This helps the user
focus on performance, and allows for the exploitation of locality. Our project is to implement
a generic sort function in UPC. It will help programmers focus on design, implementation and
performance of the entire application rather than work on details of sorting.
upc_all_sort is an implementation of a generic sort function in UPC using a bin sort algorithm.
Techniques applied to improve the performance of the sort function include: sampling
of data to ensure more nearly uniform bucket sizes among threads, an all-to-all exchange of
bucket sizes to minimize the amount of dynamically allocated memory, rebalancing data at
the end of the sort to ensure that each thread finishes with the same number of elements as it
started with. In this paper, we analyze and compare the performance of sequential and parallel
sort, the potential of performance improvements over sequential sort and also the bottlenecks
limiting these improvements. This sort function runs 3.9 times faster than the sequential sort
for problem sizes of at least 128K when run on 8 threads. The performance scales up with the
increasing number of threads and bigger problem sizes. It runs 6.8 times faster than sequential
sort for problem size of 512K when run on 16 threads.
(
pdf |
ps )