Anomaly Detection in airlines schedules Asmaa Fillatre Data Scientist, Amadeus
AMADEUS PRESENTATION 1. IT company that develops business solutions for the travel and tourism industry 2. Operates globally in the travel and technology market Travel buyers Consumers/ General public Corporate travel departments Travel providers 711 airlines (over 420 bookable) 24 Insurance companies 50+ cruise and ferry lines 207 tour operators 110,000+ hotel properties 30 car rental companies 95 railways IT SOLUTIONS Including direct distribution technology Common / overlapping platforms & applications Common data centre Common customers Common sales & marketing infrastructure DISTRIBUTION BUSINESS Provision of indirect distribution services Travel agencies Travel Management companies Business travel agencies Leisure travel agencies Online travel agencies Consolidators Single-site agency Travel search companies Airline sale offices and airline websites connected to Amadeus direct sell technology Page 2
1 Airline Schedules
Airline schedules Flight Connections & Network Analysis Market Capacity Evolution and Trend Airline schedules Airport Hub Analysis Market Competition Analysis Page 4
Airline schedules data 110.000 daily flights One year = 40.150.000 flights Flight number Departure time Arrival time Aircraft type Departure airport Arrival airport Airline code Flight schedules Airline code Airline country Airlines Airport code Airport name Airport location Longitude latitude Airports Aircraft code Aircraft capacity Aircrafts 5
Motivations 1. The airline schedules contain many errors. 2. It is important to identify outliers prior to modelling and analysis. 3. Detect anomalies automatically 4. Overcome the issue of non prior knowledge (no ground truth) Page 6
Anomalies examples (1) Airlines use wrong IATA airport codes Airlines missing Merger between two companies Flown distance much higher than aircraft average Flown distance much higher than the aircraft average Elapsed time/distance not appropriate New routes traffic Sports event (OG, FIFA World Cup, etc) Sudden grow in monthly Aircraft capacity for United Airlines Page 7
Anomalies examples (2) Page 8
4 Unsupervised Anomaly detection Goal: Process unlabelled data and detect anomalies
Machine learning Labeled data (normal/abnormal) Direct feedback Predict outcome/future Supervised No labels No feedback Find hidden structure Unsupervised Learning Semisupervised Some labelled data : Supervised learning + additional unlabelled data Unsupervised learning + additional labelled data Page 10
Residuals-based anomaly detection in three steps Input data Low-rank approximation Residuals generation Anomaly detection 20 20 18 16 14 12 10 8 Page 11 6 4 0 0.2 0.4 0.6 0.8 1 20 4 0 0.2 0.4 0.6 0.8 1 20 18 16 14 12 10 8 6 4 0 0.2 0.4 0.6 0.8 1 10 18 18 8 16 14 12 10 8 6 16 14 12 10 8 6 6 4 2 0-2 -4-6 -8-10 0 0.2 0.4 0.6 0.8 1 10 8 6 4 2 0-2 -4-6 -8-10 0 0.2 0.4 0.6 0.8 1 No Anomaly 4 0 0.2 0.4 0.6 0.8 1 Anomaly
Residual and Anomaly Detection Residual R i = Input Reconstrucion Residual normalization Residual thresholding Z i = (R i μ) σ Three sigma rule Any data sample outside the interval μ 3σ, μ + 3σ is considered to be potential anomaly Page 12
6 Deep learning: Stacked Autoencoder Goal: Learn the internal structure and features of the data itself
Autoencoder One hidden layer Minimize X X w.r.t. all W e (l), Wd (l) and be (l), bd (l) Trained with Backpropagation Self-supervised technique Learn a meaningful representation of the data in some other dimensionality where and Encoding Decoding Page 14
PCA Input W b + Output Autoencoder Introduce non linearity Input W b + W b +. Output Page 15
Regularization Deep Autoencoder or stacked autoencoder Cost function Average sum of squared error Weight decay Sparsity Penality Constraints on the activation ρ Regularization by λ which should be close to ρ Page 16
Stacked Autoencoder training Training one hidden layer at a time Example with 2 hidden layer Page 17
Input images Hello world of deep learning Anomaly Detection on MNIST Autoencoder lowest reconstruction error highest reconstruction error Learned features Output images Page 18
7 Autoencoder based Anomaly detection for airlines schedules
Normalized number of flights Raw data: multivariate time series airlines R2R nb flights weeks Preprocessing For more natural representations of data The Autoencoder can learn some patterns Some region to region time series 2012 week 10 Page 20
Autoencoder for time series Anomaly detection Data preparation Data transformation Autoencoder configuration TS preprocessing Data normalization Training set Train Autoencoder 175 100 50 Testing set W, B 100 Reconstruction error thresholding 175 + β, λ and ρ Outlier detection Page 21
8 United Airlines (UA) schedules data processing Goal: highlight how does the Autoencoder perform in practice
UA anomaly detection (1) World normalization of Input data World Min Max Normalization. Autoencoder Outpuṯ Residuals R2Ri. 2010 to 2016 Input from UA Page 23
UA anomaly detection (2) Regional normalization of Input data Min Max Normalization per region. Autoencoder Outpuṯ Residuals R2Ri. 2010 to 2016 Input from UA Page 24
8 Air France (AF) Goal: highlight how does the Autoencoder perform in practice
AF anomaly detection (1) World normalization of Input data World Min Max Normalization. Autoencoder Outpuṯ Residuals R2Ri. 2010 to 2016 Input from AF Page 26
AF anomaly detection (1) Regional normalization of Input data Min Max Normalization per region. Autoencoder Outpuṯ Residuals R2Ri. 2010 to 2016 Input from AF Page 27
Autoencoder pros and cons Pros Cons Page 28
Conclusion Unsupervised machine learning (no ground truth) o Well adapted to the absence of labels o Hard to interpret: the review process of outliers relies on domain experts Deep learning/feature engineering Page 29
Thanks for your attention