Bringing hardware affinity information into MPI communication strategies

Similar documents
Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer

OpenComRTOS: Formally developed RTOS for Heterogeneous Systems

Deutscher Wetterdienst

the power of remote GPU virtualization

EE382V: System-on-a-Chip (SoC) Design

An Architecture for Combinator Graph Reduction Philip J. Koopman Jr.

Revenue Management in a Volatile Marketplace. Tom Bacon Revenue Optimization. Lessons from the field. (with a thank you to Himanshu Jain, ICFI)

Multi/many core in Avionics Systems

Integrated Modular Avionics. The way ahead for aircraft computing platforms?

A Survey of Time and Space Partitioning for Space Avionics

Wrapper Instruction Register (WIR) Specifications

TriMet Core Capacity Concepts. TriMet Board January 25, 2017

Milkymist One. A video synthesizer at the forefront of open source hardware. S. Bourdeauducq. Milkymist project. August 2011

PRAJWAL KHADGI Department of Industrial and Systems Engineering Northern Illinois University DeKalb, Illinois, USA

Multicore Processing in the Avionics Industry Needs and Concerns April 21, 2017 Greg Arundale Rockwell Collins

Table of Content. Table of Contents Mobile Experts LLC. All Rights Reserved. 1

Hitachi GigE Camera. Installation Manual. Version 1.6

Measurement Based Analysis of the Handover in a WLAN MIPv6 Scenario

Table of Contents 2015 Mobile Experts LLC. All Rights Reserved. 1

Why does 2010 feel like the dark ages for fare collection? Fred Combs Lextran, Planning and Technology Manager Lexington, Kentucky

RSA SecurID Ready Implementation Guide

Monitoring & Control Tim Stevenson Yogesh Wadadekar

# 1 in ease-of-use. Guest Service Interconnectivity. Made by hoteliers, for hoteliers.

TERMINAL DEVELOPMENT PLAN

Baggage Reconciliation System

Predicting flight routes with a Deep Neural Network in the operational Air Traffic Flow and Capacity Management system

EE382M.20: System-on-Chip (SoC) Design

CDR Joseph Cohn, PhD ONR Code 341 Division Deputy

SIMULATION TECHNOLOGY FOR FREE FLIGHT SYSTEM PERFORMANCE AND SURVIVABILITY ANALYSIS

ARRIVALS REVIEW GATWICK

NETWORK MANAGER - SISG SAFETY STUDY

EE382V: Embedded System Design and Modeling

Use-Case Power Management Optimization Identifying & Tracking Key Power Indicators

Considerations for Facility Consolidation

2017 Digital Grid Customer Summit Session Abstracts

Wake Turbulence Research Modeling

Punt Policing and Monitoring

A summary of Draft Makara Peak Mountain Bike Park Master Plan

Amadeus Jan-Jun 2017 Results. July 28, 2017

Sustaining quality of services through service reliability and availability

EE382V: Embedded System Design and Modeling

Progressive Technology Facilitates Ground-To-Flight-Deck Connectivity

INTERNATIONAL CIVIL AVIATION ORGANIZATION WESTERN AND CENTRAL AFRICA OFFICE. Thirteenth Meeting of the FANS I/A Interoperability Team (SAT/FIT/13)

Overview Net-Enabled Aircraft Design Current Project Status Join the Team! Kristin Yvonne Rozier University of Cincinnati

Autonomic Thread Scaling Library for QoS Management

SPEDESTER Series QUICK REFERENCE GUIDE

Asia/Pacific Region A-CDM Planning

CRITICAL FACTORS FOR THE DEVELOPMENT OF AIRPORT CITIES. Mauro Peneda, Prof. Rosário Macário AIRDEV Seminar IST, 20 October 2011

Migration Solutions for TDC 2000 Hiway Customers

LEAVE NO TRACE CENTER FOR OUTDOOR ETHICS CONSULTING SERVICES

Work Programme of ICAO Panels and Study Groups

Appendix B Ultimate Airport Capacity and Delay Simulation Modeling Analysis

Overbooking in Planning Based Scheduling Systems

COMMUNICATIONS PANEL. WG-I 20 Meeting

PROUDLY BRINGING YOU CANADA AT ITS BEST. Management Planning Program NEWSLETTER #1 OCTOBER, 2000

PROS Inc. Intended positioning on the market

AIRBUS Generic Flight Test Installation

Update of the Airport Master Plan. Initial Runway & Land Use Alternatives

Hiway Gateway Specification and Technical Data

TWELFTH AIR NAVIGATION CONFERENCE

USAF Airworthiness Policy and Process Updates

SPADE-2 - Supporting Platform for Airport Decision-making and Efficiency Analysis Phase 2

Re-Defining Rural Food Deserts by Transportation Networks. Background Food access is critical to rural sustainability.

BAGGAGE HANDLING SYSTEM MAKES FAST CONNECTIONS

ICAO EUR Region Performance Framework

MyTraveler User s Manual

Passenger movement simulation in intermodal air-rail terminal

A Statistical Method for Eliminating False Counts Due to Debris, Using Automated Visual Inspection for Probe Marks

A New Way to Work in the ERCOT Market

AIRPORT OF THE FUTURE

Dallas/Fort Worth International Airport Commercial Development Department. Urban Land Institute Fall Meeting October 18, 2012

Workshop on Advances in Public Transport Control and Operations, Stockholm, June 2017

01 Amadeus at a glance

Thermo Scientific Nalgene Rapid-Flow Filters. the last line of defense. against contamination

Session III Issues for the Future of ATM

ATM Seminar 2015 OPTIMIZING INTEGRATED ARRIVAL, DEPARTURE AND SURFACE OPERATIONS UNDER UNCERTAINTY. Wednesday, June 24 nd 2015

BlueNRG Guideline From evaluation to production

Reducing Departure Delays at LaGuardia Airport with Departure-Sensitive Arrival Spacing (DSAS) Operations

INTERNATIONAL CIVIL AVIATION ORGANIZATION

HOTEL ROOM MANAGEMENT SYSTEM OVERVIEW

Todsanai Chumwatana, and Ichayaporn Chuaychoo Rangsit University, Thailand, {todsanai.c;

Washington Township MASTER PLAN. Addendum: Washington Township Master Plan

National Health Workforce Innovation and Reform Strategic Framework for Action

EUROCONTROL. Centralised Services concept. Joe Sultana Director Network Manager 1 July 2013

E-Enabled Vision & Strategy

Optimizing trajectories over the 4DWeatherCube

TERMS OF REFERENCE. Drone Advisory Committee (DAC) Role Name or Title Organization. Director, UAS Integration Office. Director, UAS Integration Office

SASP Advisory Committee Meeting #2

Agenda: SASP SAC Meeting 3

Environmental Assessment. Runway 14 Smart Tracking Approach Gold Coast Airport

Istanbul Technical University Air Transportation Management, M.Sc. Program Aviation Economics and Financial Analysis Module November 2014

PRIMA Open Online Public Consultation

BusStop Telco 2.0 application supporting public transport in agglomerations

Low-Cost Carrier Passengers at Airports Knowing Their Needs and Expectations to Enhance the Passenger Experience

MIT ICAT. Robust Scheduling. Yana Ageeva John-Paul Clarke Massachusetts Institute of Technology International Center for Air Transportation

Towards New Metrics Assessing Air Traffic Network Interactions

Surface Congestion Management. Hamsa Balakrishnan Massachusetts Institute of Technology

RED ATLAS PRODUCT BROCHURE. From Nevalee Business Solutions

Launceston and Tamar Valley Traffic Vision

Documentation of the Elevation Selected to Model Helicopter Noise at HTO

Transcription:

Bringing hardware affinity information into MPI communication strategies Brice Goglin (and Stéphanie Moreaud) Inria Runtime Team-Project Bordeaux JLPC Rennes 2012/06/14

Hardware is increasingly complex Many nodes Hierarchical interconnects between them Multiple processors per node Many cores per processor Multiple levels of (shared) caches NUMA memory NUIOA peripherals JLPC - Rennes - 2012/06/14-2

Outline 1. Why affinity matters and how to deal with it 2. Affinity-aware intra-node MPI 3. Affinity-aware inter-node MPI 4. Software support for managing hardware affinities 5. Conclusions JLPC - Rennes - 2012/06/14-3

1Why affinity matters and how to deal with it JLPC - Rennes - 2012/06/14-4

Two ways to deal with affinities 1. Adapt placement to affinities Place tasks according to hardware/software affinity 2. Given a placement, optimize the execution Adapt communication strategies to process' locality Ideally we would do both at the same time. JLPC - Rennes - 2012/06/14-5

Affinity-aware placing of tasks 1. Place processes when launching them 2. Reorder process' roles at runtime MPI_Dist_graph_create Define an affinity metric and build a task tree Amount of messages, communication volume, etc. Map the task tree onto the hardware topology tree for MPI and other paradigms (see F. Tessier's talk tomorrow) JLPC - Rennes - 2012/06/14-6

Affinity-aware communication Assuming the task placement is chosen in advance Thread synchronization Hierarchical barriers Data movement inside machines Explicit memory migration or implicit remote memory access in NUMA systems? Map host memory in the GPU or explicit DMA transfer in CUDA? MPI communication (topic of this talk) JLPC - Rennes - 2012/06/14-7

What about MPI communication? Inside a node Locality of communicating processes Between the nodes Locality of communicating process and NICs JLPC - Rennes - 2012/06/14-8

2Affinity-aware intra-node MPI JLPC - Rennes - 2012/06/14-9

(too?) many communication strategies 1. Double-buffering across a shared memory buffer 2. Direct copy between process address-spaces KNEM, CMA, etc. Sender writing or receiving reading 3. Offloading to specialized copy hardware Choose between them depending on hardware locality NUMA distance, shared caches, etc. JLPC - Rennes - 2012/06/14-10

Example Ping-pong on dual-nehalem (shared cache) JLPC - Rennes - 2012/06/14-11

Adapting thresholds to locality Double-buffering likes shared caches Preferred strategy inside sockets Double buffering does not like NUMA distance Direct copy preferred between sockets Copy offload only useful for overlap Is the MPI call asynchronous? Only if the copy hardware is close to the buffers? JLPC - Rennes - 2012/06/14-12

Adapting thresholds to locality (2/2) Past collaborations with ANL (and UTK) The hardware landscape changed a lot since then NUMA is everywhere To be revived Ongoing PhD thesis (B. Putigny) at Inria Bordeaux Modeling data movement performance to ease the choosing of the right strategy JLPC - Rennes - 2012/06/14-13

3Affinity-aware inter-node MPI JLPC - Rennes - 2012/06/14-14

Why affinities matter for inter-node MPI Non Uniform I/O Access (NUIOA) I/O chipset is close to a single socket Data-I/O locality affects I/O throughput and latency Depends a lot on the architecture Impact on DMA throughput can be asymmetrical Mostly matters for high-performance I/O High-bandwidth and/or low latency M P P M M P P M I/O I/O NIC Intel Sandy-Bridge-EP M P P M I/O NIC I/O AMDs JLPC - Rennes - 2012/06/14-15

MPI vs. NUIOA 1. Try to move communication processing near the NIC Collective operation leaders Keep master thread near the NIC in hybrid programs 2. Use the local NIC first Don't blindly use multirail It stresses the internal interconnect more Depends on collective patterns JLPC - Rennes - 2012/06/14-16

NUIOA leader election for collectives Bcast between 8 nodes x 8 cores JLPC - Rennes - 2012/06/14-17

NUIOA NIC selection, point-to-point Quad-Opteron, with two IB NICs JLPC - Rennes - 2012/06/14-18

NUIOA NIC selection, all-to-all Quad-Opteron, with two IB NICs Processes should only use the local NIC if any. Otherwise, send half to each NIC. JLPC - Rennes - 2012/06/14-19

4Software support for managing hardware affinities JLPC - Rennes - 2012/06/14-20

hwloc Portable Hardware Locality Initially developed for affinity-based hierarchical thread scheduling at Inria Bordeaux Extracted as a standalone library for MPI users Merged with Open MPI PLPA Portable to many OSes Wide community, no serious competitor Now used by most MPI implementations, some batch schedulers, parallel libraries, etc. Mostly for binding, based on user-given policies JLPC - Rennes - 2012/06/14-21

hwloc's view of the hardware Tree of ressource objects Sockets, Caches, Cores, Threads, Memory, etc. Logical identifiers Portability is the rule Which caches are shared by which cores? Support for multiple nodes Global view of clusters, etc. Attached I/O objects Which cores/memory are close a NIC or GPU? JLPC - Rennes - 2012/06/14-22

JLPC - Rennes - 2012/06/14-23

The need for quantitative criterias hwloc only provides logical distances Useful for placement decision Not for knowning if placement choices are critical Not for choosing between communication strategies Is the NUMA interconnect fast enough to hide the distance? Is the cache too slow/small to help much? Annotate the hwloc tree with quantitative information Ongoing work with Inria Grenoble JLPC - Rennes - 2012/06/14-24

5Conclusions JLPC - Rennes - 2012/06/14-25

Locality and affinity matter Hardware affinities are everywhere You can live without looking at them The impact on performance depends on lot on the hardware But you can easily get small improvements everywhere in the HPC stack (and hwloc can help you) JLPC - Rennes - 2012/06/14-26

Thank you! Questions? Brice.Goglin@inria.fr http://www.open-mpi.org/projects/hwloc