Big Data: Architectures and Data Analytics

Similar documents
Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer

Airport Runway Location and Orientation. CEE 4674 Airport Planning and Design

Special edition paper Development of a Crew Schedule Data Transfer System

Specialty Cruises. A. 100% Tally and Strip Cruises

Specialty Cruises. 100% Tally and Strip Cruises

SENIOR CERTIFICATE EXAMINATIONS

Table of Contents. Part I Introduction 3 Part II Installation 3. Part III How to Distribute It 3 Part IV Office 2007 &

CASS & Airline User Manual

A Hitchhiker s Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers

Management System for Flight Information

Wilfred S. Manuela Jr., Asian Institute of Management, Makati City, Philippines Mark Friesen, QUINTA Consulting, Frankfurt, Germany

EMC Unisphere 360 for VMAX

Management System for Flight Information

Predicting Flight Delays Using Data Mining Techniques

In-Service Data Program Helps Boeing Design, Build, and Support Airplanes

ultimate traffic Live User Guide

Interacting with HDFS

SPADE-2 - Supporting Platform for Airport Decision-making and Efficiency Analysis Phase 2

A Multilayer and Time-varying Structural Analysis of the Brazilian Air Transportation Network

Product information & MORE. Product Solutions

UNDERSTANDING TOURISM: BASIC GLOSSARY 1

Project: Implications of Congestion for the Configuration of Airport Networks and Airline Networks (AirNets)

Passenger Rebooking - Decision Modeling Challenge

ROADEF 2009 Challenge: Disruption Management for Commercial Aviation

Table of Contents. Part I Introduction 3 Part II Installation 3. Part III How to Distribute It 3 Part IV Office 2007 &

MODAIR. Measure and development of intermodality at AIRport

MODAIR: Measure and development of intermodality at AIRport. INO WORKSHOP EEC, December 6 h 2005

Big Data Processing using Parallelism Techniques Shazia Zaman MSDS 7333 Quantifying the World, 4/20/2017

ANALYSIS OF THE CONTRIUBTION OF FLIGHTPLAN ROUTE SELECTION ON ENROUTE DELAYS USING RAMS

Review of. Boeing B Captain. Produced by Captain Sim

PREFACE. Service frequency; Hours of service; Service coverage; Passenger loading; Reliability, and Transit vs. auto travel time.

Constrained Long-Range Plan for the National Capital Region.

1) Complete the Queuing Diagram by filling in the sequence of departing flights. The grey cells represent the departure slot (10 pts)

ATPCO. Intended positioning on the market

Department of Tourism. Japan International Cooperation Agency

DATA APPLICATION CATEGORY 25 FARE BY RULE

Predicting flight routes with a Deep Neural Network in the operational Air Traffic Flow and Capacity Management system

VAST Challenge 2017 Reviewer Guide: Mini-Challenge 1

Workbook Unit 11: Natural Deduction Proofs (II)

MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 27-28) Visualizing Network

Economic impact of the Athens International Airport

Method to create proposals for PSS business models

THEORY OF CHANGE. Kigali, Rwanda 10 March 2014

Safety and Airspace Regulation Group

Performance Indicator Horizontal Flight Efficiency

Visitor Use Computer Simulation Modeling to Address Transportation Planning and User Capacity Management in Yosemite Valley, Yosemite National Park

Part 1. Part 2. airports100.csv contains a list of 100 US airports.

Regional Differences in International Airline Operating Economics: 2008 and 2009

Airline Monthly Point to Point Guidance Notes

6.0 JET ENGINE WAKE AND NOISE DATA. 6.2 Airport and Community Noise

Release Note

Booking flights At the restaurant Wiki. Triggers. February 24, Grégoire Détrez Tutorial 4

Decision aid methodologies in transportation

< 2016 Summer Workshop > with Moeno Wakamatsu in SOUTH NORMANDY, FRANCE 31 July - 11 August (does not include arrival and departure days)

Making YOUR Industry Data Available

Guyana Civil Aviation Authority. ATR Form M Instructions

Schedule Compression by Fair Allocation Methods

PRIVACY POLICY KEY DEFINITIONS. Aquapark Wrocław Wrocławski Park Wodny S.A. with the registered office in Wrocław, ul. Borowska 99, Wrocław.

ADVANTAGES OF SIMULATION

Concur Travel: Lufthansa Pay As You Fly (PAF)

Analysis of Air Transportation Systems. Airport Capacity

Installation Guide. Unisphere Central. Installation. Release number REV 07. October, 2015

IAB / AIC Joint Meeting, November 4, Douglas Fearing Vikrant Vaze

NETWORK MANAGER - SISG SAFETY STUDY

Efficiency and Automation

TRAFFIC COMMERCIAL AIR CARRIERS

Graphical Forecast for Aviation (GFA) Pat Murphy Warning Coordination Meteorologist, Aviation Weather Center Kansas City, Missouri October 21, 2010

An Econometric Study of Flight Delay Causes at O Hare International Airport Nathan Daniel Boettcher, Dr. Don Thompson*

MEMORANDUM. Lynn Hayes LSA Associates, Inc.

Measure 67: Intermodality for people First page:

Course Project. 1. Let staff make entries when a passenger makes reservations on a flight.

Release Notes Business Rules Version 10x Up to Spring 2019 Release for SIBR/BSAP/RC-BSAP

Fewer air traffic delays in the summer of 2001

Mathcad Prime 3.0. Curriculum Guide

PROS Inc. Intended positioning on the market

CISC 7510X Midterm Exam For the below questions, use the following schema definition.

UC Berkeley Working Papers

A Guide to the ACi europe economic impact online CALCuLAtoR

THIRTEENTH AIR NAVIGATION CONFERENCE

ELOQUA INTEGRATION GUIDE

Price-Setting Auctions for Airport Slot Allocation: a Multi-Airport Case Study

Travel & Tourism Sector Ranking United Kingdom. Summary of Findings, November 2013

Transport Focus Train punctuality the passenger perspective. 2 March 2017 Anthony Smith, Chief Executive

6.0 JET ENGINE WAKE AND NOISE DATA. 6.2 Airport and Community Noise

HOW TO IMPROVE HIGH-FREQUENCY BUS SERVICE RELIABILITY THROUGH SCHEDULING

MIT ICAT M I T I n t e r n a t i o n a l C e n t e r f o r A i r T r a n s p o r t a t i o n

Jeppesen Pairing & Rostering

FLIGHT SCHEDULE PUNCTUALITY CONTROL AND MANAGEMENT: A STOCHASTIC APPROACH

Today: using MATLAB to model LTI systems

Advisory Circular. 1.1 Purpose Applicability Description of Changes... 2

Analysis of Operational Impacts of Continuous Descent Arrivals (CDA) using runwaysimulator

Conditions of access Exhibitors schedule of hours

Project 2 Database Design and ETL

Checking in for a Flight

ELSA. Empirically grounded agent based models for the future ATM scenario. ELSA Project. Toward a complex network approach to ATM delays analysis

Foregone Economic Benefits from Airport Capacity Constraints in EU 28 in 2035

ESTIMATION OF ECONOMIC IMPACTS FOR AIRPORTS IN HAWTHORNE, EUREKA, AND ELY, NEVADA

Today s flight path. 1. WestJet s Story 2. Background 3. Approach 4. Results and Recommendations 5. Questions?

Model Solutions. ENGR 110: Test 2. 2 Oct, 2014

Amadeus Multimodal Content

Transcription:

Big Data: Architectures and Data Analytics September 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer for each question. 1. (2 points) Consider the HDFS folder inputdata containing the following two files: Filename Size Content of the files HDFS Blocks Mark1.txt 9 bytes 21 28 24 Mark2.txt 9 bytes Block ID Content of the block B1 21 28 B2 24 B3 B4 Suppose that you are using a Hadoop cluster that can potentially run up to 10 mappers in parallel and suppose that the HDFS block size is 6 bytes. Suppose that the following MapReduce program is executed by providing the folder inputdata as input folder and the folder results as output folder. /* Driver */ import ; public class DriverBigData extends Configured implements Tool { @Override public int run(string[] args) throws Exception { Configuration conf = this.getconf(); Job job = Job.getInstance(conf); job.setjobname("2017/09/14 - Theory"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setjarbyclass(driverbigdata.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); job.setmapperclass(mapperbigdata.class); job.setmapoutputkeyclass(doublewritable.class); job.setmapoutputvalueclass(nullwritable.class);

job.setnumreducetasks(0); if (job.waitforcompletion(true) == true) return 0; else return 1; public static void main(string args[]) throws Exception { int res = ToolRunner.run(new Configuration(), new DriverBigData(), args); System.exit(res); /* Mapper */ import ; class MapperBigData extends Mapper<LongWritable, Text, DoubleWritable, NullWritable> { Double minimummark; protected void setup(context context) { minimummark = null; protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { Double mark = new Double(value.toString()); if (minimummark == null mark.doublevalue() < minimummark) { minimummark = mark; protected void cleanup(context context) throws IOException, InterruptedException { // emit the content of minimummark context.write(new DoubleWritable(minimumMark), NullWritable.get()); What is the output generated by the execution of the application reported above? a) The output folder contains two files 21 b) The output folder contains only one file The output file contains the following line 21

c) The output folder contains four files 21 24 A fourth file that contains the same content of the previous file, i.e., the following line d) The output folder contains only one file One file that contains the following four lines 21 24 2. (2 points) Consider the HDFS folder logsfolder, which contains two files: logs.txt and logs2.txt. The size of logs.txt is 1048MB and the size of logs2.txt is 1000MB. Suppose that you are using a Hadoop cluster that can potentially run up to 20 mappers in parallel and suppose to execute a MapReduce-based program that selects the rows of the files in logsfolder containing the word WARNING. Which of the following values is a proper HDFS block size if you want to force Hadoop to run exactly 2 mappers in parallel when you execute the application by specifying the folder logsfolder as input? a) Block size: 256MB b) Block size: 512MB c) Block size: 1024MB d) Block size: 2048MB

Part II PoliTravel is a web site that aggregates data of several booking services and allows booking flights. To suggest reliable flights, PoliTravel computes a set of statistics that are used to characterize routes and airports based on number of cancelled flights and delays. The analyses are based on the following input data sets/files. Airports.txt o Airports.txt is a text file containing the information about airports. Each line contains the information about one airport. o Each line of Airports.txt has the following format AirportID,City,Country,AirportName where AirportID is the identifier of the airport, AirportName is its name, and city and country are the city and the country where the airport is located, respectively. For example, the line CDG,Paris,France,Charles de Gaulle means that CDG is the id of the Charles de Gaulle airport, which is located in Paris, France. Flights.txt o Flights.txt is a text file containing the historical information about the flights of the airlines managed by PoliTravel. The number of flights per day is more than 50,000 and Flights.txt contains the historical data about the last 15 years. o Each line of the input file has the following format Flight_number,Airline,date,scheduled_departure_time,scheduled_arriv al_time,departure_airport_id,arrival_airport_id,delay,cancelled,number _of_seats,number_of_booked_seats where Flight_number is the identifier of the flight, Airline is the airline that operated the flight, date is the date of the flight, scheduled_departure_time and scheduled_arrival_time are its scheduled departure and arrival times, departure_airport_id and arrival_airport_id are the identifiers of the departure and arrival airports, delay is the delay in minutes of the flight with respect to the scheduled_arrival_time, and cancelled is a flag that is yes if the flight has been cancelled and no otherwise. number_of_seats is the total amount of seats of the flight while number_of_booked_seats is the number of booked seats. For example, the line LH1103,Lufthansa,2016/06/02,15:35,17:10,CDG,TRN,15,no,120,101

means that the flight LH1103, operated by Lufthansa, from CDG to TRN scheduled for June 2, 2016, scheduled departure time 15:35 - scheduled arrival time 17:10, arrived at the TRN airport 15 minutes late. The flight had 120 seats and only 101 were booked. Exercise 1 MapReduce and Hadoop (9 points) The managers of PoliTravel are interested in selecting the airports characterized by many late arriving flights in year 2016 (i.e., from January 1, 2016 to December 31, 2016). Specifically, the airports with more than 5% of the landing flights arriving at least 15 minutes late must be selected by the application and stored in the output HDFS folder. Design a single application, based on MapReduce and Hadoop, and write the corresponding Java code, to address the following point: A. Destination airports with too many delayed flights in year 2016. Considering only the subset of historical data/flights from January 1, 2016 to December 31, 2016, the application must select the ids of the airports with more than 5% of the flights arriving at least 15 minutes late in year 2016 (i.e., from January 1, 2016 to December 31, 2016). The percentage of flights arriving late for an airport is given by the ratio between the number of flights arriving at least 15 minutes late at that airport and the total number of flights arriving at that airport. Store the result of the analysis in a HDFS folder. The output file contains one line for each of the selected airports. Each line of the output file has the following format arrival_airport_id\tpercentage of delayed flights The name of the output folder is one argument of the application. The other argument is the path of the input file Flights.txt. Pay attention that Flights.txt contains the data of the last 15 years but the analysis is focused only on the flights of year 2016. Fill in the provided template for the Driver of this exercise. Use your papers for the other parts (Mapper and Reducer). Exercise 2 Spark and RDDs (18 points) The managers of PoliTravel are interested in analyzing the amount of cancelled flights for each airline depending on the departure airport by considering only the flights with a departure airport located in France. Specifically, they are interested in counting, for each couple (airline, departure airport), with departure airport located in France, the number of cancelled flights and sorting the results based on the number of cancelled flights. Another analysis of interest is related to the identification of the underused routes between couples of airports. Each couple of airports (departure airport, arrival airport) is a route and a route is an underused route if at least 25% of the flights of that route have at least % of not booked seats and at least 10% of the flights of that route were cancelled. The percentage of not booked seats is given by 100*(number_of_seats -number_of_booked_seats)/number_of_seats

The managers of PoliTravel asked you to develop an application to address all the analyses they are interested in. The application has 4 arguments/parameters: the files Flights.txt and Airports.txt and two output folders (associated with the outputs of the following points A and B, respectively). Specifically, design a single application, based on Spark and RDDs, and write the corresponding Java code, to address the following points: A. (9 points) Airlines with many cancelled flights departing from France. The application must select only the flights with a departure airport located in France and then computes, for each couple (departure airport, airline), the number of cancelled flights. The application stores in the first HDFS output folder the information number of cancelled flights, departure airport name, airline. The results are stored in decreasing order by considering the number of cancelled flights. The output contains one couple (departure airport name, airline), and the associated number of cancelled flights, per line. Note that the application stores the name of the departure airport. You can suppose that the departure airport name is unique (i.e., there are not two airports with the same name). B. (9 points) Underused routes. The application must select the undersused routes. Every couple of airport ids (departure_airport_id,arrival_airport_id) associated with at least one flight is a route 1. A route is an underused route if at least 25% of the flights of that route have at least % of not booked seats and at least 10% of the flights of that route were cancelled, based on the historical data available in Flights.txt. The percentage of not booked seats is given by 100*(number_of_seats - number_of_booked_seats)/number_of_seats. The application stores in the second HDFS output folder the information (departure_airport_id,arrival_airport_id) for the selected routes. Note that the application stores couples of airport ids and not their names. The output contains one line per selected route. 1 Note that the direction is important. For instance, the couple of airport ids (TRN, CDG) is a route and the couple (DCG, TRN) is a different route.

Big Data: Architectures and Data Analytics September 14, 2017 Student ID First Name Last Name Use the following template for the Driver of Exercise 1 Fill in the missing parts. You can strikethrough the second job if you do not need it. import. /* Driver class. */ public class DriverBigData extends Configured implements Tool { public int run(string[] args) throws Exception { Path inputpath = new Path(args[0]); Path outputdir = new Path(args[1]); Configuration conf = this.getconf(); // First job Job job1 = Job.getInstance(conf); job1.setjobname("exercise 1 - Job 1"); // Job 1 - Input path FileInputFormat.addInputPath(job, ); // Job 1 - Output path FileOutputFormat.setOutputPath(job, ); // Job 1 - Driver class job1.setjarbyclass(driverbigdata.class); // Job1 - Input format job1.setinputformatclass( ); // Job1 - Output format job1.setoutputformatclass( ); // Job 1 - Mapper class job1.setmapperclass(mapper1bigdata.class); // Job 1 Mapper: Output key and output value: data types/classes job1.setmapoutputkeyclass( ); job1.setmapoutputvalueclass( ); // Job 1 - Reducer class job.setreducerclass(reducer1bigdata.class); // Job 1 Reducer: Output key and output value: data types/classes job1.setoutputkeyclass( ); job1.setoutputvalueclass( ); // Job 1 - Number of reducers job1.setnumreducetasks( 0[ _ ] or 1[ _ ] or >=1[ _ ] ); /* Select only one of the three options */

// Execute the first job and wait for completion if (job1.waitforcompletion(true)==true) { // Second job Job job2 = Job.getInstance(conf); job2.setjobname("exercise 1 - Job 2"); // Set path of the input folder of the second job FileInputFormat.addInputPath(job2, ); else // Set path of the output folder for the second job FileOutputFormat.setOutputPath(job2, ); // Class of the Driver for this job job2.setjarbyclass(driverbigdata.class); // Set input format job2.setinputformatclass( ); // Set output format job2.setoutputformatclass( ); // Set map class job2.setmapperclass(mapper2bigdata.class); // Set map output key and value classes job2.setmapoutputkeyclass( ); job2.setmapoutputvalueclass( ); // Set reduce class job2.setreducerclass(reducer2bigdata.class); // Set reduce output key and value classes job2.setoutputkeyclass( ); job2.setoutputvalueclass( ); // Set number of reducers of the second job job2.setnumreducetasks( 0[ _ ] or 1[ _ ] or >=1[ _ ] ); /*Select only one of the three options*/ // Execute the job and wait for completion if (job2.waitforcompletion(true)==true) exitcode=0; else exitcode=1; exitcode=1; return exitcode; /* Main of the driver */ public static void main(string args[]) throws Exception { int res = ToolRunner.run(new Configuration(), new DriverBigData(), args); System.exit(res);