Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer

Similar documents
Bringing hardware affinity information into MPI communication strategies

Autonomic Thread Scaling Library for QoS Management

Deutscher Wetterdienst

Multi/many core in Avionics Systems

International Conference on Integrated Modular Avionics Moscow

Predicting flight routes with a Deep Neural Network in the operational Air Traffic Flow and Capacity Management system

Hiway Gateway Specification and Technical Data

Project Deliverable 4.1.3d Individual City Report - City of La Verne

An Architecture for Combinator Graph Reduction Philip J. Koopman Jr.

A Hitchhiker s Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers

The organisation of the Airbus. A330/340 flight control system. Ian Sommerville 2001 Airbus flight control system Slide 1

Special edition paper Development of a Crew Schedule Data Transfer System

A Survey of Time and Space Partitioning for Space Avionics

ATTEND Analytical Tools To Evaluate Negotiation Difficulty

PROS Inc. Intended positioning on the market

PREFACE. Service frequency; Hours of service; Service coverage; Passenger loading; Reliability, and Transit vs. auto travel time.

EE382V: Embedded System Design and Modeling

Pelican AMR Gateway User Guide

MyTraveler User s Manual

Price-Setting Auctions for Airport Slot Allocation: a Multi-Airport Case Study

EE382V: System-on-a-Chip (SoC) Design

ATM Seminar 2015 OPTIMIZING INTEGRATED ARRIVAL, DEPARTURE AND SURFACE OPERATIONS UNDER UNCERTAINTY. Wednesday, June 24 nd 2015

UM1868. The BlueNRG and BlueNRG-MS information register (IFR) User manual. Introduction

Mathcad 140 Curriculum Guide

Wrapper Instruction Register (WIR) Specifications

Today: using MATLAB to model LTI systems

OpenComRTOS: Formally developed RTOS for Heterogeneous Systems

FOR SMALL AND MEDIUM SIZED AIRPORTS Velocity FIDS

Discrete-Event Simulation of Air Traffic Flow

Big Data Processing using Parallelism Techniques Shazia Zaman MSDS 7333 Quantifying the World, 4/20/2017

PRAJWAL KHADGI Department of Industrial and Systems Engineering Northern Illinois University DeKalb, Illinois, USA

Climate Change Impacts and Adaptation Options in Serbia Results from the ADAGIO Project

Update on STX France. October, 2017

Transportation Timetabling

Mathcad Prime 3.0. Curriculum Guide

Development of SH119 BRT Route Pattern Alternatives for Tier 2 - Service Level and BRT Route Pattern Alternatives

MODAIR. Measure and development of intermodality at AIRport

A Statistical Method for Eliminating False Counts Due to Debris, Using Automated Visual Inspection for Probe Marks

Monitoring & Control Tim Stevenson Yogesh Wadadekar

Seychelles Civil Aviation Authority. Telecomm & Information Services Unit

Applying Integer Linear Programming to the Fleet Assignment Problem

Aircraft Arrival Sequencing: Creating order from disorder

SUPPLEMENT AUGUST CITATION PERFORMANCE CALCULATOR (CPCalc) MODEL THRU FM-S51-00 S51-1 U.S.

"Free at Last" Cage-based Living Geometry

SUPPLEMENT OCTOBER CITATION PERFORMANCE CALCULATOR (CPCalc) MODEL AND ON REVISION 8 68FM-S17-08

SUPPLEMENT 3 11 APRIL CITATION PERFORMANCE CALCULATOR (CPCalc) MODEL AND ON 510FM-S3-00 S3-1 U.S.

Longitudinal Analysis Report. Embry-Riddle Aeronautical University - Worldwide Campus

Longitudinal Analysis Report. Embry-Riddle Aeronautical University - Worldwide Campus

Hitachi GigE Camera. Installation Manual. Version 1.6

ICAO GANP Requirements and Evolution

Evaluation of Alternative Aircraft Types Dr. Peter Belobaba

ESA Telecom Startup Projects ITT AO-5114 SATWAYS 6/023. Final review

Transit Vehicle Scheduling: Problem Description

FDAP Seminar. Miami, October 2016

2012 Performance Framework AFI

Rami El Mawas CE 291

Optimizing trajectories over the 4DWeatherCube

We make it fly. Digital Transformation in the Airspace industry powered by Internet of Things

Lab: ARM Assembly Shellcode

Efficiency and Automation

Cross-border Free Route Airspace Implementation Workshop Conclusions and Recommendations

Mathcad 14.0 Curriculum Guide

Simulation of disturbances and modelling of expected train passenger delays

Platform and Products

Lab: ARM Assembly Shellcode

Daily Estimation of Passenger Flow in Large and Complicated Urban Railway Network. Shuichi Myojo. Railway Technical Research Institute, Tokyo, Japan

Big Data: Architectures and Data Analytics

Supplementary Materials Figures

IASSF: A Simulation For F/A-18 Avionics Software Testing.

EUROCONTROL Call Sign Similarity Project

ANALYSIS OF THE CONTRIUBTION OF FLIGHTPLAN ROUTE SELECTION ON ENROUTE DELAYS USING RAMS

NextGen AeroSciences, LLC Seattle, Washington Williamsburg, Virginia Palo Alto, Santa Cruz, California

Trimble Yuma 2. Product descriptions available at and mcsistore.trimble.com 4-1

SAVOIR industrial perspectives Thales Alenia Space View

Installation Guide. Unisphere Central. Installation. Release number REV 07. October, 2015

# 1 in ease-of-use. Guest Service Interconnectivity. Made by hoteliers, for hoteliers.

Visitor Use Computer Simulation Modeling to Address Transportation Planning and User Capacity Management in Yosemite Valley, Yosemite National Park

Air4All Frequency Group Study on military spectrum allocations required for the Insertion into the GAT of the UAS

DATA APPLICATION CATEGORY 25 FARE BY RULE

Use-Case Power Management Optimization Identifying & Tracking Key Power Indicators

COMMISSION OF THE EUROPEAN COMMUNITIES. Draft. COMMISSION REGULATION (EU) No /2010

UC Berkeley Working Papers

Decision aid methodologies in transportation

Evaluation of Strategic and Tactical Runway Balancing*

IMAV 2015 Micro Aerial Vehicles Competition Security and Safety Regulations

Semantic Representation and Scale-up of Integrated Air Traffic Management Data

Concur Travel User Guide

FLEXIBILITY IN FLIGHT

ADS-B Rule and Installation Guidance

MIT ICAT. Robust Scheduling. Yana Ageeva John-Paul Clarke Massachusetts Institute of Technology International Center for Air Transportation

Figure 1.1 St. John s Location. 2.0 Overview/Structure

COMMISSION IMPLEMENTING REGULATION (EU)

E190 REPLACEMENT & FLEET UPDATE JULY 11, 2018

Realizing the Value of the EFB in Digital Aviation

ICFP programming contest 2017 Lambda punter (1.3)

BlueNRG Guideline From evaluation to production

ATPCO. Intended positioning on the market

Schedule Compression by Fair Allocation Methods

FOR INDUSTRIAL ROBOTIC APPLICATIONS

Mathcad Prime Curriculum Guide

Transcription:

Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) GR 70013, Heraklion, Crete, GREECE {apapag,dsn}@ics.forth.gr 3rd MARC Symposium, 2011 Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 1 / 29

Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 2 / 29

and Contributions We are on the transition from multi-core processors to many-core processors Programmers have to deal with: many cores many forms of implicit or explicit communication many forms of synchronization potential lack of cache coherence Contributions of this work: First implementation of a high-level domain-specific parallel programming model (Google s ) on a cache-based many-core processor with no cache coherence, based on explicit communication (SCC) Evaluation showing that the Intel SCC supports effectively: High-level programming models that hide communication, synchronization, parallelization under the hood Scalable execution of data-intensive applications Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 3 / 29

Intel Single-Chip-Cloud Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 4 / 29

DDR MC DDR MC DDR MC DDR MC Intel Single-Chip-Cloud Intel SCC Many-core processor with 24 tiles, 2 IA cores per tile s organized in a 4 6 mesh network with 256 GB/s bisection bandwidth Private L1 instruction cache of 16 KB, private L1 data cache of 16 KB, private unified L2 cache of 256 KB, per core 16 KB message passing buffer (MPB) per tile (only on-chip memory shared between cores) R R R R VRC P54C (16KB each L1) P54C (16KB each L1) R R R R 256KB L2 P54C FSB CC MIU To CC Router 256KB L2 R R R R R R R R R R R System Interface Traffic Gen Message Passing Buffer R R R R R Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 5 / 29

Intel Single-Chip-Cloud Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 6 / 29

Intel Single-Chip-Cloud A framework for large-scale data processing Programming model (API) and runtime system for a variety of parallel architectures Clusters, SMPs, multi-cores, GPUs, among others Based of functional programming language primitives Used extensively in real applications Indexing system, distributed grep, document clustering, machine learning, statistical machine translation Relies heavily on a scalable runtime system Fault-tolerance, parallelization, scheduling, synchronization and communication Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 7 / 29

Intel Single-Chip-Cloud Example Sally sells sea shells by the sea shore Map sally,1 sells, 1 sea, 1 shells, 1 by, 1 the, 1 sea, 1 shore, 1 Group By Key by, 1 sally, 1 sea, 1:1 sells, 1 the, 1 shore, 1 Reduce by, 1 sally, 1 sea, 2 sells, 1 the, 1 shore, 1 Counting word occurrences in a set of documents Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 8 / 29

Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 9 / 29

Seven-stage runtime system for : Map Combine (optional) Partition Group Reduce Sort (optional) Merge (optional) Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 10 / 29

Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 11 / 29

Map Core 0 Core 1 by the the by by the by the by Map Map by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Each core executes the user-defined map function on chunks of input data, located in local memory Map function emits one or more intermediate key-value pairs Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 12 / 29

Map Core 0 Core 1 by the the by by the by the by Map Map by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Intermediate key-value pairs stored in a contiguous buffer Runtime preallocates large chunks of memory (64 MB) for intermediate data buffers More buffering space allocated on demand, if needed Allocation strategy reduces memory management overhead Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 12 / 29

Map Core 0 Core 1 by the the by by the by the by Map Map by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Each core produces as many intermediate data partitions as the total number of cores Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 12 / 29

Combine Core 0 Core 1 by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Combine Combine Combine Combine by,2 the,2 by,3 the,2 Optional stage executed if user provides a combiner function Reduces locally the size of each partition produced during the map stage Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 13 / 29

Combine Core 0 Core 1 by,1 by,1 the,1 the,1 by,1 by,1 by,1 the,1 the,1 Combine Combine Combine Combine by,2 the,2 by,3 the,2 Takes as input a key and a list of partially aggregated intermediate values associated with that key Produces a new intermediate key-value pair based on intermediate key and its corresponding list of values Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 13 / 29

Partition Iter. 0 P0 P1 P2 P3 Iter. 1 P0 P1 P2 P3 Iter. 2 P0 P1 P2 P3 Requires an all-to-all exchange between cores Data partitions generated during the map stage may be different in size First execute an all-to-all exchange of the sizes of each partition Knowing the size of each partition, execute a second all-to-all exchange with the actual data Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 14 / 29

Partition Iter. 0 P0 P1 P2 P3 Iter. 1 P0 P1 P2 P3 Iter. 2 P0 P1 P2 P3 Let p be the number of available cores and rank the core ID. This algorithm uses p 1 steps and in each step k, core rank receives data from core rank k and sends data to core rank + k. Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 14 / 29

Group Groups all (key, value) pairs with the same key Use radix sort instead of conventional merge sort Radix sort sorts strings of bytes and can not use a user-defined comparator for sorting If radix sort does not sort native application type, sort the output using a user-specified compare function Conventional sorting algorithms have complexity O(nlogn). Radix sort has complexity O(kn) where k is the size of the key in bytes. Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 15 / 29

Reduce Core 0 Core 1 by,2 by,3 the,2 the,2 Reduce Reduce by,2 the,2 Group stage exports distinct keys with a list of corresponding values Reduce stage executes user-defined aggregation function on each key-list(of values) pair Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 16 / 29

Reduce Core 0 Core 1 by,2 by,3 the,2 the,2 Reduce Reduce by,2 the,2 Reduce function emits one or more output key-value pairs Total output size known prior to reduction, therefore output buffer is preallocated Minimizes memory management overhead Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 16 / 29

Sort and Merge Step 0 P0 P1 P2 P3 Step 1 P0 P2 Step 2 P0 Output Buffer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 17 / 29

Execution Time Breakdowns SCC vs. Cell BE Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 18 / 29

Execution Time Breakdowns SCC vs. Cell BE Histogram (partition-dominated) counts the frequency of occurrences of each RGB color component in an image file Word Count (partition-dominated) counts the number of occurrences of each word in a text file Kmeans (map-dominated) creates clusters from a set of data points Linear Regression (map-dominated) computes a line of best fit for a set of points, given their 2D coordinates Configuration: s run at 533MHz Mesh interconnect runs at 800MHz DRAM runs at 800MHz Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 19 / 29

Execution Time Breakdowns SCC vs. Cell BE Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 20 / 29

Execution Time Breakdowns SCC vs. Cell BE 16 16 12 8 4 12 8 4 Histogram WordCount KMeans Linear Regression ideal 0 0 5 10 15 20 25 30 35 40 45 50 Cores Number with Combiner 0 0 5 10 15 20 25 30 35 40 45 50 Cores Number without Combiner Combiner function improves scalability Kmeans and Linear Regression are map-dominated benchmarks Superlinear speedup because complexity of the group stage decreases exponentially with the number of cores Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 21 / 29

Execution Time Breakdowns SCC vs. Cell BE Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 22 / 29

Execution Time Breakdowns SCC vs. Cell BE Seconds 8 6 4 Merge Sort Reduce Group Partition Combine Map 2 0 Histogram KMeans WordCount LinearRegression Left bars with combiner, right without combiner Using a combiner function reduces execution time Partition stage does not scale Combiner minimizes total partition time and group time Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 23 / 29

Execution Time Breakdowns SCC vs. Cell BE Intel Single-Chip-Cloud Execution Time Breakdowns SCC vs. Cell BE Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 24 / 29

Execution Time Breakdowns SCC vs. Cell BE 50 Seconds 40 30 20 SCC - w combiner SCC - w/o combiner Cell Blade - 1 processor Cell Blade - 2 processors 10 0 0 5 10 15 20 25 30 35 40 45 50 Cores Number QS22 Blade consists of 2 Cell BE Processors at 3.2 GHz Each processor has 8 SPEs (accelerators) WordCount benchmark with 60MB input size Single-SCC nodes outperforms dual-cell blade by up to 1.87 Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 25 / 29

Related Work Other ports of on clusters, SMPs, multicores and GPUs (HPCA07,PACT08,IISWC09,ICPP10) Shared-memory ports based on shared data structures in cache-coherent address space SCC port based on distributed data structures and scalable exchange algorithms, while utilizing caches for fast message exchange Distributed-memory ports based on generic sorting and group algorithms SCC port based on combiners and radix sort algorithm Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 26 / 29

Our implementation of on the Intel SCC demonstrates: Feasibility of implementing high-level, domain-specific parallel programming models that hide explicit communication SCC chip scalability when using optimized chip-specific global communication algorithms Good adaptivity to diverge workloads: map-dominated, partition-dominated, group-dominated Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 27 / 29

Thank you! The research leading to these results has received funding from the European Community s Seventh Framework Programme [FP7/2007-2013] under the I-CORES project, grant agreement n o 224759. Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 28 / 29

Appendix Radix Sort execution time Seconds 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 20000 40000 60000 Number of Items Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Scalable on the SCC. MARC3 Symposium. 29 / 29