Controlling the False Discovery Rate in Bayesian Network Structure Learning

Similar documents
ONLINE DELAY MANAGEMENT IN RAILWAYS - SIMULATION OF A TRAIN TIMETABLE

UC Berkeley Working Papers

Predicting a Dramatic Contraction in the 10-Year Passenger Demand

Abstract. Introduction

Cross-sectional time-series analysis of airspace capacity in Europe

A Statistical Method for Eliminating False Counts Due to Debris, Using Automated Visual Inspection for Probe Marks

An Analytical Approach to the BFS vs. DFS Algorithm Selection Problem 1

B.S. PROGRAM IN AVIATION TECHNOLOGY MANAGEMENT Course Descriptions

Airline Scheduling Optimization ( Chapter 7 I)

A Duality Based Approach for Network Revenue Management in Airline Alliances

Simulation of disturbances and modelling of expected train passenger delays

American Airlines Next Top Model

CHAPTER 5 SIMULATION MODEL TO DETERMINE FREQUENCY OF A SINGLE BUS ROUTE WITH SINGLE AND MULTIPLE HEADWAYS

Towards New Metrics Assessing Air Traffic Network Interactions

Including Linear Holding in Air Traffic Flow Management for Flexible Delay Handling

HOW TO IMPROVE HIGH-FREQUENCY BUS SERVICE RELIABILITY THROUGH SCHEDULING

A RECURSION EVENT-DRIVEN MODEL TO SOLVE THE SINGLE AIRPORT GROUND-HOLDING PROBLEM

COMPARATIVE STUDY ON GROWTH AND FINANCIAL PERFORMANCE OF JET AIRWAYS, INDIGO AIRLINES & SPICEJET AIRLINES COMPANIES IN INDIA

ADVANTAGES OF SIMULATION

Identification of Waves in IGC files

Genetic Algorithm in Python. Data mining lab 6

Quantile Regression Based Estimation of Statistical Contingency Fuel. Lei Kang, Mark Hansen June 29, 2017

The Effectiveness of JetBlue if Allowed to Manage More of its Resources

ATM Seminar 2015 OPTIMIZING INTEGRATED ARRIVAL, DEPARTURE AND SURFACE OPERATIONS UNDER UNCERTAINTY. Wednesday, June 24 nd 2015

AQME 10 System Description

Three Essays on the Introduction and Impact of Baggage Fees in the U.S. Airline Industry

AIRLINES decisions on route selection are, along with fleet planning and schedule development, the most important

Decision aid methodologies in transportation

Heuristic technique for tour package models

A. Karakasidis 1, V. S. Verykios 2 and P. Christen 3

Do Not Write Below Question Maximum Possible Points Score Total Points = 100

Congestion. Vikrant Vaze Prof. Cynthia Barnhart. Department of Civil and Environmental Engineering Massachusetts Institute of Technology

An Examination of the Effect of Multiple Supervisors on Flight Trainees' Performance

Comparative Densities of Tigers (Panthera tigris tigris) between Tourism and Non Tourism Zone of Pench Tiger Reserve, Madhya Pradesh- A brief report

= Coordination with Direct Communication

White Paper: Assessment of 1-to-Many matching in the airport departure process

Transportation Timetabling

Best schedule to utilize the Big Long River

Schedule Compression by Fair Allocation Methods

An Econometric Study of Flight Delay Causes at O Hare International Airport Nathan Daniel Boettcher, Dr. Don Thompson*

15:00 minutes of the scheduled arrival time. As a leader in aviation and air travel data insights, we are uniquely positioned to provide an

Research on Pilots Development Planning

PERFORMANCE MEASURE INFORMATION SHEET #16

Two Major Problems Problems Crew Pairing Problem (CPP) Find a set of legal pairin Find gs (each pairing

SERVICE NETWORK DESIGN: APPLICATIONS IN TRANSPORTATION AND LOGISTICS

Don t Sit on the Fence

ATTEND Analytical Tools To Evaluate Negotiation Difficulty

Impact of Landing Fee Policy on Airlines Service Decisions, Financial Performance and Airport Congestion

Airspace Complexity Measurement: An Air Traffic Control Simulation Analysis

Predicting flight routes with a Deep Neural Network in the operational Air Traffic Flow and Capacity Management system

Thanksgiving Holiday Period Traffic Fatality Estimate, 2017

Arash Yousefi George L. Donohue, Ph.D. Chun-Hung Chen, Ph.D.

Foregone Economic Benefits from Airport Capacity Constraints in EU 28 in 2035

Performance Indicator Horizontal Flight Efficiency

Tactical and Operational Planning of Scheduled Maintenance for Per-Seat, On-Demand Air Transportation

Evaluation of Predictability as a Performance Measure

Perth & Kinross Council. Community Planning Partnership Report June 2016

Logic Control Summer Semester Assignment: Modeling and Logic Controller Design 1

Semantic Representation and Scale-up of Integrated Air Traffic Management Data

Assignment of Arrival Slots

Using PBN for Terminal and Extended Terminal Operations

Quantitative Analysis of the Adapted Physical Education Employment Market in Higher Education

Research Article Study on Fleet Assignment Problem Model and Algorithm

CONNECT Events: Flight Optimization

Bioinformatics of Protein Domains: New Computational Approach for the Detection of Protein Domains

Activity Template. Drexel-SDP GK-12 ACTIVITY. Subject Area(s): Sound Associated Unit: Associated Lesson: None

Response of U.S. Air Carriers to On-Time Disclosure Rule

Cluster A.2: Linear Functions, Equations, and Inequalities

Estimating the Risk of a New Launch Vehicle Using Historical Design Element Data

OPTIMAL PUSHBACK TIME WITH EXISTING UNCERTAINTIES AT BUSY AIRPORT

Optimal assignment of incoming flights to baggage carousels at airports

Operational Performance of High-Occupancy Vehicle (HOV) Facilities:

Chapter 9 Validation Experiments

RUNWAY OPERATIONS: Computing Runway Arrival Capacity

Reducing Garbage-In for Discrete Choice Model Estimation

Demand, Load and Spill Analysis Dr. Peter Belobaba

PHY 133 Lab 6 - Conservation of Momentum

MIT ICAT M I T I n t e r n a t i o n a l C e n t e r f o r A i r T r a n s p o r t a t i o n

Research on Aviation Security*

A comparison of two methods for reducing take-off delay at London Heathrow airport

Ticket reservation posts on train platforms: an assessment using the microscopic pedestrian simulation tool Nomad

IPSOS / REUTERS POLL DATA Prepared by Ipsos Public Affairs

Motion 2. 1 Purpose. 2 Theory

ESD Working Paper Series

Runway Length Analysis Prescott Municipal Airport

ChangiNOW: A mobile application for efficient taxi allocation at airports

RECEDING HORIZON CONTROL FOR AIRPORT CAPACITY MANAGEMENT

Semi - Annual Report. April 2, From September 21, 2003 to March 20, 2004

(2, 3) 2 B1 cao. 1 B1 cao

Analysis and Evaluation of the Slugging Form of Ridesharing*

Modelling Transportation Networks with Octave

Price-Setting Auctions for Airport Slot Allocation: a Multi-Airport Case Study

IMPROVING THE ROBUSTNESS OF FLIGHT SCHEDULE BY FLIGHT RE-TIMING AND IMPOSING A NEW CREW BASE

Estimating Sources of Temporal Deviations from Flight Plans

Unit 4: Location-Scale-Based Parametric Distributions

CAPAN Methodology Sector Capacity Assessment

Construction of Conflict Free Routes for Aircraft in Case of Free Routing with Genetic Algorithms.

The Effects of GPS and Moving Map Displays on Pilot Navigational Awareness While Flying Under VFR

Traffic Forecasts. CHAOUKI MUSTAPHA, Economist, International Civil Aviation Organization

MAT 115: Precalculus Mathematics Homework Exercises Textbook: A Graphical Approach to Precalculus with Limits: A Unit Circle Approach, Sixth Edition

Transcription:

False Discovery Rate in Bayesian Network Structure Learning Ioannis Tsamardinos Asnt Prof., CSD, Univ. of Crete ICS, FORTH Laura E. Brown DBMI, Vanderbilt Univ. Sofia Triantafylloy CSD, Univ. of Crete ICS, FORTH May 29, 2008 in BN Learning Ioannis Tsamardinos, Univ. of Crete 1 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 2 / 78

State-of-the-art BN Learning Algorithms exist that scale up to problems with thousands of variables [7] Decent quality of learning in BN Learning Ioannis Tsamardinos, Univ. of Crete 3 / 78

Theory of Learning-Quality Asymptotic (to sample size) proofs of consistency Under the Faithfulness assumption (e.g., PC [6], GES [2]) Rate of convergence Under specific distributional assumptions such as normality [4] in BN Learning Ioannis Tsamardinos, Univ. of Crete 4 / 78

Quality Assessment Empirical evaluation of the networks based on Learn Multiple Networks and estimate how reliably a feature (e.g., edge) appears [3] Relatively inefficient, does not scale up Each sampling requires learning the complete network Hinders practical applications in BN Learning Ioannis Tsamardinos, Univ. of Crete 5 / 78

Lack of Statistical Interpretation of Output Lack of efficiently computable, quality-estimating methods deters practitioners from using BN learning technology in BN Learning Ioannis Tsamardinos, Univ. of Crete 6 / 78

Local Structure Learning Problem N(T) the set of neighbors of variable T in any network that faithfully captures the data Well defined problem: N(T) is the same in all networks faithful to the data distribution in BN Learning Ioannis Tsamardinos, Univ. of Crete 7 / 78

Importance of inducing N(T) Variable Selection N(T) is part of the Markov Blanket of T Bayesian Network Structure Induction It can be used to learn the complete skeleton on of an Region of Interest Causal Discovery N(T) includes the Direct Causes and Direct Effects of T in BN Learning Ioannis Tsamardinos, Univ. of Crete 8 / 78

Problem Definition Produce an estimate N(T, a) of N(T) from finite data with a False Discovery Rate less than a given threshold α FDR = E(V/R) V the number of variables in N(T,a) not in N(T) R the number of variables in N(T,a) expected (on average) percentage of false positives in the output expected (on average) percentage of false positives in the output in BN Learning Ioannis Tsamardinos, Univ. of Crete 9 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 10 / 78

Certificate of exclusion All variables V = {V i } I(V i ; T Z k ) : V i is independent of T given Z k A Z s.t. I(V i ; T Z) is called a certificate of exclusion of V i from N(T) C(V i, T,S) Z S\{V i, T }, s.t.,i(v i ; T Z) True if there is a certificate of exclusion of V i from T within S Requires 2 S tests of independence to determine truth value in the worst case in BN Learning Ioannis Tsamardinos, Univ. of Crete 11 / 78

Learning N(T) Theorem 1. [6] If P(V ) is faithful, then V i / N(T) C(V i, T,V) Theorem 2. [8] If P(V ) is faithful, then V i / N(T) C(V i, T, E (T)) or C(V i, T, E (V i)) where E (T) N(T) and E (V i ) N(V i ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 12 / 78

Extended Neighbor Sets E(T) any subset of V s.t., 1. is a superset of N(T) 2. V i E(T), C(V i,t,e(t)), i.e., for all variables in E(T), there is no certificate of exclusion within E(T) Contains N(T) plus the variables that cannot be d-separated from T conditioned on some subset of N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 13 / 78

Learning E(T) Algorithm 1 Algorithm-E(T) 1: procedure E(T ) 2: Initialize E(T) V \T,R = V \(E(T) T) 3: repeat (arbitrarily alternate between (4) and (5)) 4: Select and move a variable in R into E(T) 5: if C(V i,t,e(t)) for some V i E(T) then 6: Remove V i from E(T) 7: end if 8: until E(T) is stable and R = 9: return E(T) 10: end procedure in BN Learning Ioannis Tsamardinos, Univ. of Crete 14 / 78

Learning E(T) E(T) any subset of V set s.t., 1. is a superset of N(T) 2. V i E(T), C(V i,t,e(t)) Algorithm-E(T) will return an E(T) All variables are considered and will enter E(T) eventually A variable is removed only if a certificate of exclusion is found, thus no variable in N(T) will be removed once it enters E(T) (superset of N(T)) Condition V i E(T), C(V i,t,e(t)) is directly enforced in BN Learning Ioannis Tsamardinos, Univ. of Crete 15 / 78

Learning E(T) Algorithm-E(T) time complexity assuming a near-perfect heuristic that will select the N(T) first: V 2 E(T), tests of independence for sparse networks quite effective heuristics exist in BN Learning Ioannis Tsamardinos, Univ. of Crete 16 / 78

Learning N(T) Theorem 2: V i / N(T) C(V i,t,e (T)) or C(V i,t,e (V i)) Algorithm 2 Algorithm-N(T) 1: procedure N(T ) 2: E(T) = Algorithm-E(T) 3: N(T) = E(T) neighbors plus false positives 4: for all V i E(T) do 5: E(V i ) = Algorithm-E(V i ) 6: if T / E(V i ) then 7: remove V i from N(T) 8: end if 9: end for 10: return N(T) 11: end procedure Complexity of N(T) : V E(T) 2 E(T) assuming E(T) E(V i ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 17 / 78

Instantiations of the General Template Max-Min Parents and Children Start with E(T) = Selection heuristic: the variable in R that maximizes the minimum association with T, conditioned on all subsets of the current E(T) HITON Start with E(T) = Selection heuristic: the variable in R that maximizes pairwise association with T Slightly different strategy for searching for certificates of exclusion in BN Learning Ioannis Tsamardinos, Univ. of Crete 18 / 78

Asymptotic Correctness Algorithm-N(T) is correct Under faithfulness We can determine whether I(X;T Z) from data exactly in BN Learning Ioannis Tsamardinos, Univ. of Crete 19 / 78

of the Algorithms in BN Learning Ioannis Tsamardinos, Univ. of Crete 20 / 78

Practical Implementation T(X; Y Z) denote testing the respective independence T(X; Y Z) are typically implemented using statistical hypotheses tests of conditional independence Primitive Hypothesis: H i,j,k : I(V i ; V j Z k ) Let p i,j,k be the p-value returned by the test T(V i ; V j Z k ) for hypothesis H i,j,k in BN Learning Ioannis Tsamardinos, Univ. of Crete 21 / 78

Practical Implementation Thresholding on p i,j,k to reject or accept independencies We assume I(V i ;T Z k ) p i,t,k > t Define finite version of C: C(V i, T, V, t) Z V \{V i, T }, s.t., p i,t,k > t in BN Learning Ioannis Tsamardinos, Univ. of Crete 22 / 78

Finite sample: Algorithm E(T,t) Algorithm 3 Algorithm-E(T, t) 1: procedure E(T, t) 2: Initialize E(T) V \{T }, R = V \(E(T) {T }) 3: repeat (arbitrarily alternate between (4) and (5)) 4: Select and move a variable in R into E(T) 5: if C(V i, T, E(T), t) for some V i E(T) then 6: remove V i from E(T) 7: end if 8: until E(T) is stable and R = 9: return E(T) 10: end procedure Call the output E(T, t) to show dependency on t in BN Learning Ioannis Tsamardinos, Univ. of Crete 23 / 78

Finite Sample:Algorithm-N(T, t) Algorithm 4 Algorithm-N(T) 1: procedure N(T ) 2: E(T) = Algorithm-E(T, t) 3: N(T) = E(T) neighbors plus false positives 4: for all V i E(T) do 5: E(V i ) = Algorithm-E(V i, t) 6: if T / E(V i ) then 7: remove V i from N(T) 8: end if 9: end for 10: return N(T) 11: end procedure Call the output N(T, t) to show dependence on t in BN Learning Ioannis Tsamardinos, Univ. of Crete 24 / 78

Finite Sample Learning Quality What is the False Discovery Rate of Algorithm-N(T, t)? That is, what is the average proportion of false positives within the returned N(T, t)? How can we control the FDR? in BN Learning Ioannis Tsamardinos, Univ. of Crete 25 / 78

Statistical Error in BN Learning Ioannis Tsamardinos, Univ. of Crete 26 / 78

A Strategy of Controlling Error in Structure Learning 1. Express a BN-learning task as a multiple hypotheses testing problem 2. Approximate or bound the p-value of the hypotheses 3. Use statistical methods for controlling the error in BN Learning Ioannis Tsamardinos, Univ. of Crete 27 / 78

N(T) as a Multiple Hypotheses Testing Problem Define complex null hypotheses: H i,t : V i / N(T), equivalently (Theorem 2) H i,t : C(V i,t,e(t)) or C(V i,t,e(v i )) Rejecting H i,t implies we accept V i as a member of N(T) A False Discovery occurs when H i,t is rejected, but V i / N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 28 / 78

A Strategy of Controlling Error in Structure Learning 1. Express a BN-learning task as a multiple hypotheses testing problem 2. Approximate or bound the p-value of the hypotheses 3. Use statistical methods for controlling the error in BN Learning Ioannis Tsamardinos, Univ. of Crete 29 / 78

Bound the p-value of H i,t p i,t is hard to compute exactly... But, can be bounded from above Theorem 3. [8] Let p i,t be the p-value of H i,t. Then, where p i,t p i,t, p i,t = max{p i,t,k, where Z k E(T) or Z k E(V i )} in BN Learning Ioannis Tsamardinos, Univ. of Crete 30 / 78

Bound the p-value of H i,t The p-value of a complex hypothesis H i,t is bounded by the maximum p-value of the primitive hypotheses H i,t,k : I(V i, T Z k ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 31 / 78

A Strategy of Controlling Error in Structure Learning 1. Express a BN-learning task as a multiple hypotheses testing problem 2. Approximate or bound the p-value of the hypotheses 3. Use statistical methods for controlling the error in BN Learning Ioannis Tsamardinos, Univ. of Crete 32 / 78

Control FDR FDR(d): the FDR we obtain when we reject all hypotheses H i,t, such that p i,t d in BN Learning Ioannis Tsamardinos, Univ. of Crete 33 / 78

Control FDR Control FDR problem: Given P = {p i,t } and a desired maximum FDR level a, select as large as possible threshold d, s.t. FDR(d) a Call such a procedure F(P,a) Inverse: given threshold d, find a small as possible FDR level a guaranteed Call such a procedure F 1 (P,d) Such procedures exist, e.g., [1] in BN Learning Ioannis Tsamardinos, Univ. of Crete 34 / 78

Reality Creeps In H i,t : C(V i, T, E(T)) or C(V i, T, E(V i )) How do we obtain E(T) and E(V i ) to produce p i,t? Use Algorithm-E(T, t) to obtain an approximation If E(T,t) is a superset of E(T), H i,t are still tested correctly (Theorem 2) We just need to assume that we have enough power to detect dependencies at the t level for all members of N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 35 / 78

Assumption About Statistical Power Assumption 1: the network is t-faithful(t ), i.e., V i N(T), then Z k V \ {T, V i }, p i,t,k < t A form of (local) Faithfulless assumption for finite sample We call t the power threshold in BN Learning Ioannis Tsamardinos, Univ. of Crete 36 / 78

C-FDR(T, t, a): Control FDR Control the FDR of estimating N(T) at the a-level, assuming the network is t-faithful(t ) Algorithm 5 C-FDR(T, t, a) 1: procedure C-FDR(T, t, a) 2: N(T) = Algorithm-N(T, t) 3: for all V i N(T) do 4: E(V i )= Algorithm-E(V i, t) 5: p i,t = max{p i,t,k,z k N(T) or Z k E(V i )} 6: end for 7: d = F({p i,t }, a) call d the cut-off threshold 8: return {V i : p i,t d} 9: end procedure Call the output N(T,t,a) to show dependence on t and a in BN Learning Ioannis Tsamardinos, Univ. of Crete 37 / 78

C-FDR(T, t, a) Remarks We only calculate the p-value bounds p i,t for V i N(T, t) We assume the network is t-faithful(t ) and so all N(T) N(T, t) no need to consider the other variables in BN Learning Ioannis Tsamardinos, Univ. of Crete 38 / 78

B-FDR(T, t, d): Bound FDR Bound the FDR of estimating N(T) assuming the network is t-faithful(t ) and rejecting a complex hypothesis when p i,t d Algorithm 6 B-FDR(T, t, d) 1: procedure B-FDR(T, t, d) 2: N(T) = Algorithm-N(T,t) 3: for all V i N(T) do 4: E(V i )= Algorithm-E(V i,t) 5: p i,t = max{p i,t,k,z k N(T) or Z k E(V i )} 6: end for 7: a = F 1 ({p i,t },d) 8: return a 9: end procedure in BN Learning Ioannis Tsamardinos, Univ. of Crete 39 / 78

N(T) FDR Procedures C-FDR(T, t, a) can be used to learn an estimate of N(T) with a maximum predefined FDR level a B-FDR(T, t, d) can be used to bound the FDR level of an algorithm returning N(T) using a cut-off threshold d in BN Learning Ioannis Tsamardinos, Univ. of Crete 40 / 78

N(T) FDR Procedures Time complexity: same as running Algorithm-N(T, t) all p-values required are calculated and can be cached by Algorithm-N(T, t) F and F 1 procedures are linear Correctly control/bound the FDR under the assumptions: A1: the network is t-faithful(t ) A2: the FDR procedures F and F 1 are correct in BN Learning Ioannis Tsamardinos, Univ. of Crete 41 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 42 / 78

V (d): # of false positives when we reject all hypotheses with a p-values less than d R(d): # of positives (rejections) when we reject all hypotheses with a p-value less than d FDR(d) = Ex(V (d)/r(d)) in BN Learning Ioannis Tsamardinos, Univ. of Crete 43 / 78

Approximating FDR(d) Approximation 1: Ex(V (d)/r(d)) Ex(V (d))/e(r(d)) Approximation 2: Ex(V (d)) d m, where m = V \{T } Approximation 3: Ex(R(d)) R o (d), where R o (d) is the observed number of rejections (i.e. N(T, t, d) ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 44 / 78

Approximating FDR(d) Ex(V (d)) d m, where m = V \{T } in the worst case all m hypotheses are truly null P(p i,t d null is true) = d So, on average in the worst case we will reject d m truly nulls in BN Learning Ioannis Tsamardinos, Univ. of Crete 45 / 78

FDR Approximation FDR(d) =Ex(V (d)/r(d)) a m d/r 0 (d) a m d/#{p i,t d} a F({p i,t }, a) = p k,t where p k,t is the largest p-value such that FDR(p k,t ) a F 1 ({p i,t }, d) = FDR(p k,t ) FDR(d) where p k,t is the largest p-value such that p k,t d in BN Learning Ioannis Tsamardinos, Univ. of Crete 46 / 78

C-FDR(T, t,a): Control FDR Algorithm 7 C-FDR(T, t, a) 1: procedure C-FDR(T, t, a) 2: N(T) = Algorithm-N(T,t) 3: for all do V i N(T) 4: E(V i )= Algorithm-E(V i,t) 5: p i,t = max{p i,t,k,z k N(T) or Z k E(V i )} 6: end for 7: 8: d = F({p i,t },a) return {V i : p i,t d} and FDR(d) 9: end procedure FDR(d) a: FDR(d) the largest FDR smaller than a that could be achieved (called Estimated Alpha) in BN Learning Ioannis Tsamardinos, Univ. of Crete 47 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 48 / 78

Experimental setup Network Total Vars Selected Alarm 37 37 Child 20 20 Insurance 27 27 Pigs 801 37 Gene 441 37 158 158 variables 5 samplings = 790 runs per sample size and threshold setting in BN Learning Ioannis Tsamardinos, Univ. of Crete 49 / 78

Algorithm Evaluated B-FDR(T, t, t) Returns a bound on the FDR for a given target and dataset Bound is provided assuming t-faithfulness(t ) and rejecting a hypothesis if p i,t < t Instantiation of Algorithm-N(T, t) used is the Max-Min Parents and Children in BN Learning Ioannis Tsamardinos, Univ. of Crete 50 / 78

Results: Sample Size Effect Empirical FDR on MMPC Output 0.8 0.6 0.4 0.2 Sample Size 1000, Alpha = 0.05 1 0 0 0.5 1 (a) 0.8 0.6 0.4 0.2 Sample Size 5000, Alpha = 0.05 1 0 0 0.5 1 (b) Child Insurance Alarm Pigs Gene Estimated Bound on FDR 0.8 0.6 0.4 0.2 Sample Size 10000, Alpha = 0.05 1 0 0 0.5 1 (c) Average bound over samplings vs average true FDR in BN Learning Ioannis Tsamardinos, Univ. of Crete 51 / 78

Results: Threshold Effect t Sample Size 5000, Alpha = 0.01 1 Sample Size 5000, Alpha = 0.05 1 Empirical FDR on MMPC Output 0.8 0.6 0.4 0.2 0 0 0.5 1 (a) Sample Size 5000, Alpha = 0.10 1 0.8 0.6 0.4 0.8 0.6 0.4 0.2 0 0 0.5 1 (b) Sample Size 5000, Alpha = 0.15 1 0.8 0.6 0.4 Child Insurance Alarm Pigs Gene 0.2 0.2 0 0 0.5 1 (c) Estimated Bound on FDR 0 0 0.5 1 (d) in BN Learning Ioannis Tsamardinos, Univ. of Crete 52 / 78

Quantitative Results SS Alpha Num UE Avg. Error Avg Slack 1000 0.05 59 0.018 0.068 5000 0.05 18 0.005 0.034 10000 0.05 22 0.006 0.025 5000 0.01 12 0.006 0.005 5000 0.05 18 0.005 0.034 5000 0.10 20 0.006 0.064 5000 0.15 16 0.004 0.087 Error = max(0, true-fdr FDR-bound) Slack = max(0, FDR-bound true-fdr) UE: experiment where bound was underestimated in BN Learning Ioannis Tsamardinos, Univ. of Crete 53 / 78

Algorithm Evaluated C-FDR(T, t, a) Instantiation of Algorithm-N(T, t) used is the Max-Min Parents and Children Threshold t set to 0.15 a ranged in { 10%, 20%,..., 90% } We compare the Estimated Alpha returned with the true FDR achieved in BN Learning Ioannis Tsamardinos, Univ. of Crete 54 / 78

Results: Alpha Effect in BN Learning Ioannis Tsamardinos, Univ. of Crete 55 / 78

Execution Time Average time to run Algorithm-N(T, t), B-FDR(T, t, d), C-FDR(T, t, a) 33 seconds Excluding the Pigs network the average drops to 12 seconds in BN Learning Ioannis Tsamardinos, Univ. of Crete 56 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 57 / 78

Effect of Small R o (d) FDR(d) =Ex(V (d)/r(d)) a m d/r 0 (d) a m d/#{p i,t d} a When R o (d) is small, a small difference dramatically changes the FDR estimate Increases variance in BN Learning Ioannis Tsamardinos, Univ. of Crete 58 / 78

Variance vs R 0 (d) STD(Bound FDR Empirical FDR) 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 1 2 3 4 5 6 7 8 9 Number of Rejections in BN Learning Ioannis Tsamardinos, Univ. of Crete 59 / 78

Dealing with small R o (d) Make no claim when R o (d) is small Increase R o (d) somehow Solve a learning problem where R o (d) is relatively large Learn the whole skeleton of the network (all the edges) Similar theory, similar algorithms Or, learn a Region of Interest around T, e.g., the region at most 3 edges away from T in BN Learning Ioannis Tsamardinos, Univ. of Crete 60 / 78

C-FDR(t,a): control FDR Algorithm 8 C-FDR(t, a) 1: procedure C-FDR(t, a) 2: for all V i V do 3: E(V i )=Algorithm-N(V i, T) 4: end for 5: for all V i and V j E(V j ) do 6: p i,j = max{p i,j,k,z k E(V i ) or Z k E(V j )} 7: end for 8: 9: d = F(p i,j, i < j, a) % because of symmetry return {edge(v i, V j ) : p i,j d} 10: end procedure The number of tests/hypotheses are n(n 1)/2, i.e., the number of possible edges in BN Learning Ioannis Tsamardinos, Univ. of Crete 61 / 78

B-FDR(t, d): Bound FDR Bound the FDR of learning the edges in the network assuming t-faithful(t ) and rejecting a complex hypothesis when p i,j d Algorithm 9 B-FDR(t, a) 1: procedure B-FDR(t, a) 2: for all variable V i do 3: E(V i )=Algorithm-N(V i, T) 4: end for 5: for all V i and V j E(V j ) do 6: p i,j = max{p i,j,k,z k E(V i ) or Z k E(V j )} 7: end for 8: a = F 1 (p i,j, i < j, d) 9: return a 10: end procedure in BN Learning Ioannis Tsamardinos, Univ. of Crete 62 / 78

a Test in BN Learning Ioannis Tsamardinos, Univ. of Crete 63 / 78

The Reality About Tests of Independence When p i,j,k is large it may be that The independence I(V i ; V j Z k ) holds indeed There is not enough sample to obtain a statistically significant results (low power) The larger the Z k the lower the power If we perform all possible tests, some will return a high p i,j,k because of low power and we will believe we have found a certificate of exclusion in BN Learning Ioannis Tsamardinos, Univ. of Crete 64 / 78

The Reality About Tests of Independence All algorithms based on tests of independence only perform them when they consider they believe they have enough statistical power PC/MMPC Only perform a test T(V i ; V j Z k ) when there are at least 10/5 samples per cell of the contingency tables It is possible, an important test T(V i ; V j Z k ) not to be performed in BN Learning Ioannis Tsamardinos, Univ. of Crete 65 / 78

Assumptions Assumption 1: the network is t-faithful(t ) Assumption 2: the FDR procedure is correct Assumption 3: All tests required by the algorithms and containing certificates of exclusion tests will be performed in BN Learning Ioannis Tsamardinos, Univ. of Crete 66 / 78

Sample Size 1000 Experiment Empirical FDR on MMPC output 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Sample Size 1000, Alpha Threshold = 0.05 Child Insurance Alarm Pigs Gene 0 0 0.2 0.4 0.6 0.8 1 Estimated Bound on FDR Empirical FDR on MMPC output 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Filtered Sample Size 1000 Child Insurance Alarm Pigs Gene 0 0 0.2 0.4 0.6 0.8 1 Estimated Bound on FDR SS Alpha Num UE Avg. Error Avg. Slack Original 1000 0.05 59 0.018 0.068 Filtered 1000 0.05 26 0.009 0.068 Experiments where Assumption 3 is violated are filtered out in BN Learning Ioannis Tsamardinos, Univ. of Crete 67 / 78

Guarantees in Small Sample Size Suppose that there is not enough sample to condition on large enough sets Then, we expect some structural false positives within the returned estimate of N(T) What can we do? Provide looser guarantees Change the definition of False Discovery in BN Learning Ioannis Tsamardinos, Univ. of Crete 68 / 78

False Discovery Definition False Discovery: V i is returned in N(T, t, a) but V i / N(T) FD(T, k) set of V i such that V i V, Z N(T) or Z N(V i ) with Z k, I(V i, T Z) When k max( N(T), N(V i) ), then FD(T, k) = N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 69 / 78

False Discovery Definition Before: Now: False Discovery: V i is returned in N(T,t,a) but V i / N(T) False Discovery: V i is returned in N(T,t,a) but V i / FD(T,k) in BN Learning Ioannis Tsamardinos, Univ. of Crete 70 / 78

Using FD(T,k) Assume the maximum conditioning size is 1 We can d-separate V 3 from T conditioned on 2 variables, but not just 1 variable V 3 is a false discovery with our previous definition V 3 FD(T, 1) and so is not a false discovery 1 using the new definition in BN Learning Ioannis Tsamardinos, Univ. of Crete 71 / 78

Using FD(T,k) Suppose we determine that with the given dataset we can reliably condition on at most k variables Assuming: the network is t-faithful(t) the FDR procedure is correct We guarantee that FDR of N(T, t, a) a, where a False Discovery is defined as V i / FD(T, k) in BN Learning Ioannis Tsamardinos, Univ. of Crete 72 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 73 / 78

Ideas For Extensions Approximation 2: Ex(V (d)) d m Ex(V (d)) d m 0 m 0 is an estimate of the true null hypotheses; one could use Storey s procedures to estimate m 0 Estimate Ex(V (d)) by permutation testing Permute the data of T, run the Algorithm-N(T,t) and measure the size of N(T,t) which are all false positives by construction Measure the average size of N(T,t) in all permutations and use it as Ex(V (t)) in BN Learning Ioannis Tsamardinos, Univ. of Crete 74 / 78

Ideas For Extensions What about false negatives? there are statistical procedures that could be applied What about other tasks? Orientation of edges based on independence testing Discovery of hidden variables in BN Learning Ioannis Tsamardinos, Univ. of Crete 75 / 78

Discussion A step towards a theory of causal discovery and BN learning in the finite sample Algorithms work well when the assumptions hold and sample is relatively large Statistical guarantees produced should be intuitive to practitioners in BN Learning Ioannis Tsamardinos, Univ. of Crete 76 / 78

Limitations p-values are only approximated and FDR procedures used are conservative Leads to reduced power of the complex hypotheses (increased false negatives) Further evaluation is needed especially for untested algorithms and extensions over a larger range of networks and parameters with particular attention to the variance of the FDR bounds comparison with other related work [5] in BN Learning Ioannis Tsamardinos, Univ. of Crete 77 / 78

References [1] Y. Benjamini and Y. Hochberg. false discovery rate - a practical and powerful approach to multiple testing. J Roy Stat Soc B Methods, 57(1):289 300, 1995. [2] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, pages 507 554, 2002. [3] Nir Friedman and Daphne Koller. Being bayesian about network structure. pages 201 210. [4] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J. Mach. Learn. Res., 8:613 636, 2007. [5] J. Listgarten and D. Heckerman. Determining the number of non-spurious arcs in a learned dag model: Investigation of a bayesian and a frequentist approach. In 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [6] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2000. [7] I Tsamardinos, LE Brown, and CF Aliferis. The Max-Min Hill-Climbing bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006. [8] Ioannis Tsamardinos and Laura E. Brown. Bounding the false discovery rate in local bayesian in BN Learning Ioannis Tsamardinos, Univ. of Crete 78 / 78 network learning. In AAAI, 2008.