Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer

Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) GR 70013, Heraklion, Crete, GREECE {apapag,dsn}@ics.forth.gr 3rd MARC Symposium, 2011 Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 1 / 29

Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 2 / 29

and Contributions We are on the transition from multi-core processors to many-core processors Programmers have to deal with: many cores many forms of implicit or explicit communication many forms of synchronization potential lack of cache coherence Contributions of this work: First implementation of a high-level domain-specific parallel programming model (Google s ) on a cache-based many-core processor with no cache coherence, based on explicit communication (SCC) Evaluation showing that the Intel SCC supports effectively: High-level programming models that hide communication, synchronization, parallelization under the hood Scalable execution of data-intensive applications Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 3 / 29

Intel Single-Chip-Cloud Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 4 / 29

DDR MC DDR MC DDR MC DDR MC Intel Single-Chip-Cloud Intel SCC Many-core processor with 24 tiles, 2 IA cores per tile s organized in a 4 6 mesh network with 256 GB/s bisection bandwidth Private L1 instruction cache of 16 KB, private L1 data cache of 16 KB, private unified L2 cache of 256 KB, per core 16 KB message passing buffer (MPB) per tile (only on-chip memory shared between cores) R R R R VRC P54C (16KB each L1) P54C (16KB each L1) R R R R 256KB L2 P54C FSB CC MIU To CC Router 256KB L2 R R R R R R R R R R R System Interface Traffic Gen Message Passing Buffer R R R R R Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 5 / 29

Intel Single-Chip-Cloud Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 6 / 29

Intel Single-Chip-Cloud A framework for large-scale data processing Programming model (API) and runtime system for a variety of parallel architectures Clusters, SMPs, multi-cores, GPUs, among others Based of functional programming language primitives Used extensively in real applications Indexing system, distributed grep, document clustering, machine learning, statistical machine translation Relies heavily on a scalable runtime system Fault-tolerance, parallelization, scheduling, synchronization and communication Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 7 / 29

Intel Single-Chip-Cloud Example Sally sells sea shells by the sea shore Map sally,1 sells, 1 sea, 1 shells, 1 by, 1 the, 1 sea, 1 shore, 1 Group By Key by, 1 sally, 1 sea, 1:1 sells, 1 the, 1 shore, 1 Reduce by, 1 sally, 1 sea, 2 sells, 1 the, 1 shore, 1 Counting word occurrences in a set of documents Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 8 / 29

Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 9 / 29

Seven-stage runtime system for : Map Combine (optional) Partition Group Reduce Sort (optional) Merge (optional) Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 10 / 29

Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 11 / 29

Map Core 0 Core 1 by the the by by the by the by Map Map by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Each core executes the user-defined map function on chunks of input data, located in local memory Map function emits one or more intermediate key-value pairs Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 12 / 29

Map Core 0 Core 1 by the the by by the by the by Map Map by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Intermediate key-value pairs stored in a contiguous buffer Runtime preallocates large chunks of memory (64 MB) for intermediate data buffers More buffering space allocated on demand, if needed Allocation strategy reduces memory management overhead Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 12 / 29

Map Core 0 Core 1 by the the by by the by the by Map Map by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Each core produces as many intermediate data partitions as the total number of cores Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 12 / 29

Combine Core 0 Core 1 by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Combine Combine Combine Combine by,2 the,2 by,3 the,2 Optional stage executed if user provides a combiner function Reduces locally the size of each partition produced during the map stage Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 13 / 29

Combine Core 0 Core 1 by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Combine Combine Combine Combine by,2 the,2 by,3 the,2 Takes as input a key and a list of partially aggregated intermediate values associated with that key Produces a new intermediate key-value pair based on intermediate key and its corresponding list of values Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 13 / 29

Partition Iter. 0 P0 P1 P2 P3 Iter. 1 P0 P1 P2 P3 Iter. 2 P0 P1 P2 P3 Requires an all-to-all exchange between cores Data partitions generated during the map stage may be different in size First execute an all-to-all exchange of the sizes of each partition Knowing the size of each partition, execute a second all-to-all exchange with the actual data Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 14 / 29

Partition Iter. 0 P0 P1 P2 P3 Iter. 1 P0 P1 P2 P3 Iter. 2 P0 P1 P2 P3 Let p be the number of available cores and rank the core ID. This algorithm uses p 1 steps and in each step k, core rank receives data from core rank k and sends data to core rank + k. Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 14 / 29

Group Groups all (key, value) pairs with the same key Use radix sort instead of conventional merge sort Radix sort sorts strings of bytes and can not use a user-defined comparator for sorting If radix sort does not sort native application type, sort the output using a user-specified compare function Conventional sorting algorithms have complexity O(nlogn). Radix sort has complexity O(kn) where k is the size of the key in bytes. Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 15 / 29

Reduce Core 0 Core 1 by,2 by,3 the,2 the,2 Reduce Reduce by,2 the,2 Group stage exports distinct keys with a list of corresponding values Reduce stage executes user-defined aggregation function on each key-list(of values) pair Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 16 / 29

Reduce Core 0 Core 1 by,2 by,3 the,2 the,2 Reduce Reduce by,2 the,2 Reduce function emits one or more output key-value pairs Total output size known prior to reduction, therefore output buffer is preallocated Minimizes memory management overhead Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 16 / 29

Sort and Merge Step 0 P0 P1 P2 P3 Step 1 P0 P2 Step 2 P0 Output Buffer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 17 / 29

Execution Time Breakdowns SCC vs. Cell BE Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 18 / 29

Execution Time Breakdowns SCC vs. Cell BE Histogram (partition-dominated) counts the frequency of occurrences of each RGB color component in an image file Word Count (partition-dominated) counts the number of occurrences of each word in a text file Kmeans (map-dominated) creates clusters from a set of data points Linear Regression (map-dominated) computes a line of best fit for a set of points, given their 2D coordinates Configuration: s run at 533MHz Mesh interconnect runs at 800MHz DRAM runs at 800MHz Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 19 / 29

Execution Time Breakdowns SCC vs. Cell BE 16 16 12 8 4 12 8 4 Histogram WordCount KMeans Linear Regression ideal 0 0 5 10 15 20 25 30 35 40 45 50 Cores Number with Combiner 0 0 5 10 15 20 25 30 35 40 45 50 Cores Number without Combiner Combiner function improves scalability Kmeans and Linear Regression are map-dominated benchmarks Superlinear speedup because complexity of the group stage decreases exponentially with the number of cores Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 21 / 29

Execution Time Breakdowns SCC vs. Cell BE Seconds 8 6 4 Merge Sort Reduce Group Partition Combine Map 2 0 Histogram KMeans WordCount LinearRegression Left bars with combiner, right without combiner Using a combiner function reduces execution time Partition stage does not scale Combiner minimizes total partition time and group time Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 23 / 29

Execution Time Breakdowns SCC vs. Cell BE 50 Seconds 40 30 20 SCC - w combiner SCC - w/o combiner Cell Blade - 1 processor Cell Blade - 2 processors 10 0 0 5 10 15 20 25 30 35 40 45 50 Cores Number QS22 Blade consists of 2 Cell BE Processors at 3.2 GHz Each processor has 8 SPEs (accelerators) WordCount benchmark with 60MB input size Single-SCC nodes outperforms dual-cell blade by up to 1.87 Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 25 / 29

Related Work Other ports of on clusters, SMPs, multicores and GPUs (HPCA07,PACT08,IISWC09,ICPP10) Shared-memory ports based on shared data structures in cache-coherent address space SCC port based on distributed data structures and scalable exchange algorithms, while utilizing caches for fast message exchange Distributed-memory ports based on generic sorting and group algorithms SCC port based on combiners and radix sort algorithm Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 26 / 29

Our implementation of on the Intel SCC demonstrates: Feasibility of implementing high-level, domain-specific parallel programming models that hide explicit communication SCC chip scalability when using optimized chip-specific global communication algorithms Good adaptivity to diverge workloads: map-dominated, partition-dominated, group-dominated Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 27 / 29

Thank you! The research leading to these results has received funding from the European Community s Seventh Framework Programme [FP7/2007-2013] under the I-CORES project, grant agreement n o 224759. Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 28 / 29

Appendix Radix Sort execution time Seconds 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 20000 40000 60000 Number of Items Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 29 / 29