Predicting Flight Delays Using Data Mining Techniques

Similar documents
Fuel Burn Impacts of Taxi-out Delay and their Implications for Gate-hold Benefits

IAB / AIC Joint Meeting, November 4, Douglas Fearing Vikrant Vaze

Shazia Zaman MSDS 63712Section 401 Project 2: Data Reduction Page 1 of 9

Big Data Processing using Parallelism Techniques Shazia Zaman MSDS 7333 Quantifying the World, 4/20/2017

Accuracy of Flight Delays Caused by Low Ceilings and Visibilities at Chicago s Midway and O Hare International Airports

An Econometric Study of Flight Delay Causes at O Hare International Airport Nathan Daniel Boettcher, Dr. Don Thompson*

Abstract. Introduction

Predicting a Dramatic Contraction in the 10-Year Passenger Demand

Congestion. Vikrant Vaze Prof. Cynthia Barnhart. Department of Civil and Environmental Engineering Massachusetts Institute of Technology

Fewer air traffic delays in the summer of 2001

3. Aviation Activity Forecasts

Evaluation of Predictability as a Performance Measure

Impact of Landing Fee Policy on Airlines Service Decisions, Financial Performance and Airport Congestion

Estimates of the Economic Importance of Tourism

American Airlines Next Top Model

RENO-TAHOE INTERNATIONAL AIRPORT APRIL 2008 PASSENGER STATISTICS

AUGUST 2008 MONTHLY PASSENGER AND CARGO STATISTICS

Frequent Fliers Rank New York - Los Angeles as the Top Market for Reward Travel in the United States

15:00 minutes of the scheduled arrival time. As a leader in aviation and air travel data insights, we are uniquely positioned to provide an

Quantile Regression Based Estimation of Statistical Contingency Fuel. Lei Kang, Mark Hansen June 29, 2017

The Impact of Baggage Fees on Passenger Demand, Airfares, and Airline Operations in the US

Incentives and Competition in the Airline Industry

The forecasts evaluated in this appendix are prepared for based aircraft, general aviation, military and overall activity.

Proceedings of the 54th Annual Transportation Research Forum

CS229: AUTUMN Application of Machine Learning Algorithms to Predict Flight Arrival Delays

PLANNING A RESILIENT AND SCALABLE AIR TRANSPORTATION SYSTEM IN A CLIMATE-IMPACTED FUTURE

Accommodation Survey: November 2009

SLOW GROWTH OF SOUTHERN NEVADA ECONOMY

POST-IMPLEMENTATION COMMUNITY IMPACT REVIEW

How much did the airline industry recover since September 11, 2001?

Proof of Concept Study for a National Database of Air Passenger Survey Data

Airline Response to Changing Economics and Policy

Estimating Sources of Temporal Deviations from Flight Plans

January 2014 Passenger and Cargo Traffic Statistics Reno-Tahoe International Airport

1. Introduction. 2.2 Surface Movement Radar Data. 2.3 Determining Spot from Radar Data. 2. Data Sources and Processing. 2.1 SMAP and ODAP Data

Juneau Household Waterfront Opinion Survey

Performance monitoring report for first half of 2016

EXECUTIVE SUMMARY. hospitality compensation as a share of total compensation at. Page 1

Managing And Understand The Impact Of Of The Air Air Traffic System: United Airline s Perspective

POST-IMPLEMENTATION COMMUNITY IMPACT REVIEW

Weather Index Project: Investigating the effect of weather on flight delays

Time-series methodologies Market share methodologies Socioeconomic methodologies

The Effects of Schedule Unreliability on Departure Time Choice

LCC Competition in the U.S. and EU: Implications for the Effect of Entry by Foreign Carriers on Fares in U.S. Domestic Markets

A Macroscopic Tool for Measuring Delay Performance in the National Airspace System. Yu Zhang Nagesh Nayak

Measuring the Business of the NAS

Comments on Notice of Proposed Amendment to Policy Statement U.S. Department of Transportation, Federal Aviation Administration

HEATHROW COMMUNITY NOISE FORUM

POST-IMPLEMENTATION COMMUNITY IMPACT REVIEW

Federal Subsidies to Passenger Transportation December 2004

PERFORMANCE REPORT JANUARY Keith A. Clinkscale Performance Manager

Economic Impact Analysis. Tourism on Tasmania s King Island

PERFORMANCE REPORT DECEMBER 2017

Big Data In Airport Operations

Operational Evaluation of a Flight-deck Software Application

Content. Study Results. Next Steps. Background

Efficiency and Automation

FAA Update Society of American Military Engineers

Airport Characterization for the Adaptation of Surface Congestion Management Approaches*

Produced by: Destination Research Sergi Jarques, Director

Have Descents Really Become More Efficient? Presented by: Dan Howell and Rob Dean Date: 6/29/2017

Estimating Domestic U.S. Airline Cost of Delay based on European Model

LCC IMPACT ON THE US AIRPORT S BUSINESS

October 2013 Passenger and Cargo Traffic Statistics Reno-Tahoe International Airport

November 2013 Passenger and Cargo Traffic Statistics Reno-Tahoe International Airport

APPENDIX D MSP Airfield Simulation Analysis

Thank you for participating in the financial results for fiscal 2014.

December 2013 Passenger and Cargo Traffic Statistics Reno-Tahoe International Airport

Optimized Maintenance Program (OMP)

Testimony of Greg Principato President, Airports Council International-North America. before the

AIRLINES decisions on route selection are, along with fleet planning and schedule development, the most important

Modeling Airline Fares

New Mexico Tourism Department 2016 Annual Report

QUALITY OF SERVICE INDEX Advanced

NOTES ON COST AND COST ESTIMATION by D. Gillen

Schedule Compression by Fair Allocation Methods

The Economic Impacts of the Open Skies Initiative: Past and Future

Exploration of Sequestration Impacts on U.S. Air Traffic Delays

Analysis of Operational Impacts of Continuous Descent Arrivals (CDA) using runwaysimulator

Unit Activity Answer Sheet

Methodology and coverage of the survey. Background

THE EFFECTIVENESS OF DUTCH AIR TRANSPORT POLICY

Fly Quiet Report. 3 rd Quarter November 27, Prepared by:

Reno-Tahoe Airport Authority U.S. DOMESTIC INDUSTRY OVERVIEW FOR FEBRUARY

Data Analysis and Simula/on Tools Prof. Hamsa Balakrishnan

The Economic Impact of Tourism Brighton & Hove Prepared by: Tourism South East Research Unit 40 Chamberlayne Road Eastleigh Hampshire SO50 5JH

2009 Muskoka Airport Economic Impact Study

Appendix B Ultimate Airport Capacity and Delay Simulation Modeling Analysis

Produced by: Destination Research Sergi Jarques, Director

2004 SOUTH DAKOTA MOTEL AND CAMPGROUND OCCUPANCY REPORT and INTERNATIONAL VISITOR SURVEY

Aboriginal and Torres Strait Islander Life Expectancy and Mortality Trend Reporting to 2014

Civil Aviation, Monthly Key Operating Statistics, Major Canadian Air Carriers

July 2012 Passenger and Cargo Traffic Statistics Reno-Tahoe International Airport

NAPA VALLEY VISITOR INDUSTRY 2012 Economic Impact Report

The Fall of Frequent Flier Mileage Values in the U.S. Market - Industry Analysis from IdeaWorks

YARTS ON-BOARD SURVEY MEMORANDUM

Airport Profile Pensacola International

1. Air Traffic Statistics Suvarnabhumi Airport: Monday, 11 Sunday, 17 June Actual Daily Traffic & Runway Utilization. (Wed) 13 Jun.

DATA-DRIVEN STAFFING RECOMMENDATIONS FOR AIR TRAFFIC CONTROL TOWERS

REVIEW OF THE STATE EXECUTIVE AIRCRAFT POOL

Transcription:

Todd Keech CSC 600 Project Report Background Predicting Flight Delays Using Data Mining Techniques According to the FAA, air carriers operating in the US in 2012 carried 837.2 million passengers and the aviation industry supported 11.8 million jobs resulting in a 5.4 percent contribution to the US gross domestic product. Despite positive contributions to the US economy, nearly all commercial airlines experience flight delays and cancellations which can negatively impact economic growth. A 2010 study by researchers at University of California, Berkley concluded that flight delays, cancellations and missed connections cost the US economy approximately $32.9 billion resulting in a $4 billion reduction to the US gross domestic product. Researchers found that $16.7 billion of the $32.9 billion cost is borne by passengers in the form of lost time and additional expenses such as food and lodging. Problem The purpose of this project is to use publicly available data and data mining techniques to construct an analytical model to predict flight delays based on flight attributes such as origin, destination, date/time, distance, etc. Additional models will be created to determine the most likely cause of a flight delay and to predict the approximate length of the delay. Motivation There a number of practical uses for flight delay modeling. The FAA and commercial carriers can use flight delay analytical models to identify the most frequently congested network paths and determine appropriate solutions to address the most common delay causes. Commercial carriers can use predictive models as an input into scheduling optimization algorithms which seek to minimize aircraft ground-time and maximize flight time. Retailers and consumers can also benefit from predictive flight delay modeling. Retailers such as Kayak and Travelocity may provide flight analysis information to end consumers as a means to empower informed decision-making. For example, a customer booking a flight on Kayak may reconsider travel arrangements based on the predicted likelihood of a delay. Retailers looking to increase their customer base can offer information services to enable consumer to make better decisions. Consumers can use this information to avoid the costs and inconvenience of flight delays. Fewer flight reservations across congested paths will force carriers to adjust prices downward or discontinue such flights resulting in greater efficiency in the overall airline system. Data Data for this project is available on the Bureau of Transportation Statistics (BTS) website. The BTS On- Time Performance database contains flight information, including scheduled and actual arrival times, as reported by US commercial airlines in CSV-formatted data files. Each row in the data set represents an individual flight. Data files are available by month dating back to 1987. Each monthly file contains approximately 500,000 observations. The following attributes are included in each observation: Flight date Quarter Month Carrier Flight Distance CRS elapsed time

CRS arrival time CRS departure time Day of month Day of week Origin Destination Destination city Destination state Origin city Origin state Arrival delay CARRIER_DELAY LATE_AIRCRAFT_DELAY NAS_DELAY WEATHER_DELAY SECURITY_DELAY In the BTS data set, flights are characterized as on-time or delayed based on the arrival delay attribute. This attribute contains the total number of delay minutes relative to the scheduled arrival time. The data set also contains five delay category attributes that quantify the reason for the delay: CARRIER_DELAY Delays due to circumstances under the carrier s control (mechanical, administrative, etc.) LATE_AIRCRAFT_DELAY The aircraft scheduled to service the route was late in arriving NAS_DELAY National Airline System delays (volume, air traffic, control, etc.) SECURITY_DELAY Security-related incidents resulting in a late departure or arrival WEATHER_DELAY Weather-related delays Each of these fields contains the total number of delay minutes for each category. Since a flight can be delayed for multiple reasons, Arrival delay is the summation of the five delay categories. In some cases, the delay cause is not reported. In addition to the On-time Performance data set, this study utilized two additional data sets. The BTS holiday travel data set is used to identify peak travel periods based on airline industry standards. The NOAA storm events database dataset is used to identify severe weather periods. Peak travel periods and sever weather are both used as predictors for determining flight delays, delays lengths and delay causes. Preprocessing The BTS On-Time Performance data contains over 10 million observations based on flight data dating back to 1987. To limit the amount of data used in this study, observations were limited to 2014 domestic flight data, including flights to US territories. This project also limits the study area to flights arriving or departing Philadelphia International Airport (PHL) via Southwest Airlines. The rationale behind targeting a narrow band of locational and carrier-specific flight data was to isolate local delay patterns with the intention of improving the accuracy of the models. A data set covering a large geographic area and multiple carriers may be more generalizable, but local delay patterns may be lost in the noise resulting in a less accurate model. Preprocessing consisted of the following steps: Create a binary field, IS_DELAYED, to indicate whether or not a flight was delayed based on the arrival delay attribute Map NOAA severe weather data to flight data based on flight date. The end result was a binary indicator, SEVERE_WEATHER, for each flight record that specified whether or not the flight

occurred on a severe weather day. This study included severe weather that occurred in the greater Philadelphia area. Map peak travel data to flight data based on flight date. The end result was a binary indicator, PEAK_TRAVEL, for each flight record that specified whether or not the flight occurred on a peak travel day Create a categorical attribute, DELAY_TYPE, to capture the leading flight delay cause. o DELAY_TYPE = max(carrier_delay, LATE_AIRCRAFT_DELAY, NAS_DELAY, SECURITY_DELAY, WEATHER_DELAY). o In the event of a tie, the delay type was chosen randomly. o In some instances, all values for CARRIER_DELAY, LATE_AIRCRAFT_DELAY, NAS_DELAY, SECURITY_DELAY, WEATHER_DELAY were 0 even though the Arrival delay field was non-zero. In these instances, DELAY_TYPE= UNKNOWN. Create a new fields to capture the CRS arrival hour and CRS departure hour based on the CRS arrival time and CRS departure time fields After preprocessing, the total number of observations used in the study was 17,477. The data was randomly sampled to produce a training set containing 13,107 observations and a test set containing 4,370 records. Data Exploration Exploration of the training set revealed that 47.2% of Southwest flights destined for or originating from PHL were delayed in 2014. Further exploration of the data revealed that flight date and flight origin may be good predictors of flight delay given the difference in overall delay percentage compared to the population.

Examination of flight delay length revealed that the majority of flight delays are relatively short. 61.9% of delays were less than 30 minutes in length. The following graph shows the breakdown of flight delay reasons. Learning Algorithm Summary Predicting Flight Delays This project employed the use of several different machine learning techniques to predict whether or not a flight would be delayed. The accuracy, precision and recall scores for each model are given in the table below. Method Accuracy % Precision Recall Simple 53.0 0 0 GLM 66.1 64.8 61.4 Tree 65.0 61.2 70.2 Bagging 67.5 66.0 63.6 AdaBoost 67.3 65.6 64.5 Random Forest 65 61.6 68 Naïve Bayes 66.2 64.1 64.3 K-Nearest Neighbor 70 68.6 71.4 The Simple model predicted no flights would be delayed resulting in a 53% accuracy rate on the test data set. The simple predictor is used as a reference for comparing more complicated models. Based on data exploration, the Naïve Bayes model was hypothesized to produce the best model while GLM was hypothesized to produce the lowest quality model. The author presumed that a non-linear

model between flight delays and the predictor variables would produce the best-fit. Overall, the linear and non-linear models performed similarly with accuracy rates ranging between 65% and 70%. The ROC curve comparison between the GLM model and the Naïve Bayes model actually indicates that the GLM model is a slightly better fit since the area under the curve is greater by.003. Surprisingly, the k-nearest neighbor model performed slightly better than other models in terms accuracy rate and precision and recall scores. The k-nearest neighbor model used the nearest three neighboring points (Euclidean distance) to determine whether or not a flight would be delayed. The accuracy of the model may be explained due to clustering of the flight date, departure time and origin attributes. The Adabag R package contains an importance function which graphs the relative importance of each variable in the model. The importance function used the decrease in GINI as the metric to determine the impact of a variable on the model. Based on the importance graph, flight date and departure hour are the most important predictors in the model. Removing predictors such as severe weather and peak travel did not impact the model significantly. Predicting Flight Delay Length The Random Forest algorithm produced the best-fitting model for predicting the length of a flight delay. Logistic regression resulted in a model with an adjusted R-squared of 20, meaning that the model could only explain 20% of the variability in flight delay length based on the predictors. Similarly, a standard rpart regression tree resulted in a high RSE relative to the mean error indicating a poorly fit model. Method RSE Mean Error Adj R-Squared Linear 38.5 13.5 20

Regression Tree 35.7 13.35 NA Random Forest 26.81 9.75 NA Predicting Flight Delay Reason The Random Forest model produced an accuracy rate of 55% for predicting the cause of a flight delay. K- nearest neighbor was slightly less accurate with a 54% accuracy rate. A simple model that predicted no flight delays had an accuracy rate of 52%. Method Accuracy % Simple 52 Random Forest 55 K-Nearest Neighbor 54 Conclusion In this study, flight attributes were used to predict flight delays, flight delay lengths and the flight delay causes. Several non-flight attributes such as weather and peak travel data were also used in each model, although removing these predictors did not change the overall model drastically. Flight delay models resulted in 65-70% accuracy rates. Flight delay length and delay type models were far less accurate. One of the goals of this study was to create a forward looking flight prediction model. The models that resulted in the highest accuracy rates relied on flight date as an important predictor. Flight date, however, is a backward looking predictor in that it relies on historical events to predict future events. When flight date was removed from the model, accuracy rates fell by 4-5%. Extension The commercial aviation system is complex. Events occurring in one area of the system can cause ripple effects that impact events in other areas. Consequently, a model that relies solely on flight attributes will fail to account for complex interactions across the system. As an extension to this study, future researchers may choose to include several additional pieces of data to account for these factors: Weather This study used local Philadelphia weather to predict flight delay impacts at PHL. Inclusion of weather conditions at the origin, destination and along the flight path may better capture the impacts of weather. Aircraft age Aircraft age might be used to determine the likelihood of a mechanical issue as the cause of a delay. System and airport congestion Airport capacity metrics and scheduled flight data may be used to calculate expected system and airport congestion level as a means of predicting delays. Project Data Sources Bureau of Transportation Statistics, On-Time Performance Database, http://www.transtats.bts.gov/dl_selectfields.asp?table_id=236&db_short_name=on-time Bureau of Transportation Statistics, On-Time Performance Database, http://www.transtats.bts.gov/holidaydelay.asp National Oceanic and Atmospheric Administration, Storm Events Database, http://www.ncdc.noaa.gov/stormevents/ References Flight delays cost $32.9 billion, passengers foot half the bill, University of California, Berkley, October 2010, http://newscenter.berkeley.edu/2010/10/18/flight_delays/

The Impact of Civil Aviation on the US Economy, Federal Aviation Administartion, June 2014, https://www.faa.gov/air_traffic/publications/media/2014-economic-impact-report.pdf