Controlling the False Discovery Rate in Bayesian Network Structure Learning

False Discovery Rate in Bayesian Network Structure Learning Ioannis Tsamardinos Asnt Prof., CSD, Univ. of Crete ICS, FORTH Laura E. Brown DBMI, Vanderbilt Univ. Sofia Triantafylloy CSD, Univ. of Crete ICS, FORTH May 29, 2008 in BN Learning Ioannis Tsamardinos, Univ. of Crete 1 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 2 / 78

State-of-the-art BN Learning Algorithms exist that scale up to problems with thousands of variables [7] Decent quality of learning in BN Learning Ioannis Tsamardinos, Univ. of Crete 3 / 78

Theory of Learning-Quality Asymptotic (to sample size) proofs of consistency Under the Faithfulness assumption (e.g., PC [6], GES [2]) Rate of convergence Under specific distributional assumptions such as normality [4] in BN Learning Ioannis Tsamardinos, Univ. of Crete 4 / 78

Quality Assessment Empirical evaluation of the networks based on Learn Multiple Networks and estimate how reliably a feature (e.g., edge) appears [3] Relatively inefficient, does not scale up Each sampling requires learning the complete network Hinders practical applications in BN Learning Ioannis Tsamardinos, Univ. of Crete 5 / 78

Lack of Statistical Interpretation of Output Lack of efficiently computable, quality-estimating methods deters practitioners from using BN learning technology in BN Learning Ioannis Tsamardinos, Univ. of Crete 6 / 78

Local Structure Learning Problem N(T) the set of neighbors of variable T in any network that faithfully captures the data Well defined problem: N(T) is the same in all networks faithful to the data distribution in BN Learning Ioannis Tsamardinos, Univ. of Crete 7 / 78

Importance of inducing N(T) Variable Selection N(T) is part of the Markov Blanket of T Bayesian Network Structure Induction It can be used to learn the complete skeleton on of an Region of Interest Causal Discovery N(T) includes the Direct Causes and Direct Effects of T in BN Learning Ioannis Tsamardinos, Univ. of Crete 8 / 78

Problem Definition Produce an estimate N(T, a) of N(T) from finite data with a False Discovery Rate less than a given threshold α FDR = E(V/R) V the number of variables in N(T,a) not in N(T) R the number of variables in N(T,a) expected (on average) percentage of false positives in the output expected (on average) percentage of false positives in the output in BN Learning Ioannis Tsamardinos, Univ. of Crete 9 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 10 / 78

Certificate of exclusion All variables V = {V i } I(V i ; T Z k ) : V i is independent of T given Z k A Z s.t. I(V i ; T Z) is called a certificate of exclusion of V i from N(T) C(V i, T,S) Z S\{V i, T }, s.t.,i(v i ; T Z) True if there is a certificate of exclusion of V i from T within S Requires 2 S tests of independence to determine truth value in the worst case in BN Learning Ioannis Tsamardinos, Univ. of Crete 11 / 78

Learning N(T) Theorem 1. [6] If P(V ) is faithful, then V i / N(T) C(V i, T,V) Theorem 2. [8] If P(V ) is faithful, then V i / N(T) C(V i, T, E (T)) or C(V i, T, E (V i)) where E (T) N(T) and E (V i ) N(V i ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 12 / 78

Extended Neighbor Sets E(T) any subset of V s.t., 1. is a superset of N(T) 2. V i E(T), C(V i,t,e(t)), i.e., for all variables in E(T), there is no certificate of exclusion within E(T) Contains N(T) plus the variables that cannot be d-separated from T conditioned on some subset of N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 13 / 78

Learning E(T) Algorithm 1 Algorithm-E(T) 1: procedure E(T ) 2: Initialize E(T) V \T,R = V \(E(T) T) 3: repeat (arbitrarily alternate between (4) and (5)) 4: Select and move a variable in R into E(T) 5: if C(V i,t,e(t)) for some V i E(T) then 6: Remove V i from E(T) 7: end if 8: until E(T) is stable and R = 9: return E(T) 10: end procedure in BN Learning Ioannis Tsamardinos, Univ. of Crete 14 / 78

Learning E(T) E(T) any subset of V set s.t., 1. is a superset of N(T) 2. V i E(T), C(V i,t,e(t)) Algorithm-E(T) will return an E(T) All variables are considered and will enter E(T) eventually A variable is removed only if a certificate of exclusion is found, thus no variable in N(T) will be removed once it enters E(T) (superset of N(T)) Condition V i E(T), C(V i,t,e(t)) is directly enforced in BN Learning Ioannis Tsamardinos, Univ. of Crete 15 / 78

Learning E(T) Algorithm-E(T) time complexity assuming a near-perfect heuristic that will select the N(T) first: V 2 E(T), tests of independence for sparse networks quite effective heuristics exist in BN Learning Ioannis Tsamardinos, Univ. of Crete 16 / 78

Learning N(T) Theorem 2: V i / N(T) C(V i,t,e (T)) or C(V i,t,e (V i)) Algorithm 2 Algorithm-N(T) 1: procedure N(T ) 2: E(T) = Algorithm-E(T) 3: N(T) = E(T) neighbors plus false positives 4: for all V i E(T) do 5: E(V i ) = Algorithm-E(V i ) 6: if T / E(V i ) then 7: remove V i from N(T) 8: end if 9: end for 10: return N(T) 11: end procedure Complexity of N(T) : V E(T) 2 E(T) assuming E(T) E(V i ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 17 / 78

Instantiations of the General Template Max-Min Parents and Children Start with E(T) = Selection heuristic: the variable in R that maximizes the minimum association with T, conditioned on all subsets of the current E(T) HITON Start with E(T) = Selection heuristic: the variable in R that maximizes pairwise association with T Slightly different strategy for searching for certificates of exclusion in BN Learning Ioannis Tsamardinos, Univ. of Crete 18 / 78

Asymptotic Correctness Algorithm-N(T) is correct Under faithfulness We can determine whether I(X;T Z) from data exactly in BN Learning Ioannis Tsamardinos, Univ. of Crete 19 / 78

of the Algorithms in BN Learning Ioannis Tsamardinos, Univ. of Crete 20 / 78

Practical Implementation T(X; Y Z) denote testing the respective independence T(X; Y Z) are typically implemented using statistical hypotheses tests of conditional independence Primitive Hypothesis: H i,j,k : I(V i ; V j Z k ) Let p i,j,k be the p-value returned by the test T(V i ; V j Z k ) for hypothesis H i,j,k in BN Learning Ioannis Tsamardinos, Univ. of Crete 21 / 78

Practical Implementation Thresholding on p i,j,k to reject or accept independencies We assume I(V i ;T Z k ) p i,t,k > t Define finite version of C: C(V i, T, V, t) Z V \{V i, T }, s.t., p i,t,k > t in BN Learning Ioannis Tsamardinos, Univ. of Crete 22 / 78

Finite sample: Algorithm E(T,t) Algorithm 3 Algorithm-E(T, t) 1: procedure E(T, t) 2: Initialize E(T) V \{T }, R = V \(E(T) {T }) 3: repeat (arbitrarily alternate between (4) and (5)) 4: Select and move a variable in R into E(T) 5: if C(V i, T, E(T), t) for some V i E(T) then 6: remove V i from E(T) 7: end if 8: until E(T) is stable and R = 9: return E(T) 10: end procedure Call the output E(T, t) to show dependency on t in BN Learning Ioannis Tsamardinos, Univ. of Crete 23 / 78

Finite Sample:Algorithm-N(T, t) Algorithm 4 Algorithm-N(T) 1: procedure N(T ) 2: E(T) = Algorithm-E(T, t) 3: N(T) = E(T) neighbors plus false positives 4: for all V i E(T) do 5: E(V i ) = Algorithm-E(V i, t) 6: if T / E(V i ) then 7: remove V i from N(T) 8: end if 9: end for 10: return N(T) 11: end procedure Call the output N(T, t) to show dependence on t in BN Learning Ioannis Tsamardinos, Univ. of Crete 24 / 78

Finite Sample Learning Quality What is the False Discovery Rate of Algorithm-N(T, t)? That is, what is the average proportion of false positives within the returned N(T, t)? How can we control the FDR? in BN Learning Ioannis Tsamardinos, Univ. of Crete 25 / 78

Statistical Error in BN Learning Ioannis Tsamardinos, Univ. of Crete 26 / 78

A Strategy of Controlling Error in Structure Learning 1. Express a BN-learning task as a multiple hypotheses testing problem 2. Approximate or bound the p-value of the hypotheses 3. Use statistical methods for controlling the error in BN Learning Ioannis Tsamardinos, Univ. of Crete 27 / 78

N(T) as a Multiple Hypotheses Testing Problem Define complex null hypotheses: H i,t : V i / N(T), equivalently (Theorem 2) H i,t : C(V i,t,e(t)) or C(V i,t,e(v i )) Rejecting H i,t implies we accept V i as a member of N(T) A False Discovery occurs when H i,t is rejected, but V i / N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 28 / 78

Bound the p-value of H i,t p i,t is hard to compute exactly... But, can be bounded from above Theorem 3. [8] Let p i,t be the p-value of H i,t. Then, where p i,t p i,t, p i,t = max{p i,t,k, where Z k E(T) or Z k E(V i )} in BN Learning Ioannis Tsamardinos, Univ. of Crete 30 / 78

Bound the p-value of H i,t The p-value of a complex hypothesis H i,t is bounded by the maximum p-value of the primitive hypotheses H i,t,k : I(V i, T Z k ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 31 / 78

Control FDR FDR(d): the FDR we obtain when we reject all hypotheses H i,t, such that p i,t d in BN Learning Ioannis Tsamardinos, Univ. of Crete 33 / 78

Control FDR Control FDR problem: Given P = {p i,t } and a desired maximum FDR level a, select as large as possible threshold d, s.t. FDR(d) a Call such a procedure F(P,a) Inverse: given threshold d, find a small as possible FDR level a guaranteed Call such a procedure F 1 (P,d) Such procedures exist, e.g., [1] in BN Learning Ioannis Tsamardinos, Univ. of Crete 34 / 78

Reality Creeps In H i,t : C(V i, T, E(T)) or C(V i, T, E(V i )) How do we obtain E(T) and E(V i ) to produce p i,t? Use Algorithm-E(T, t) to obtain an approximation If E(T,t) is a superset of E(T), H i,t are still tested correctly (Theorem 2) We just need to assume that we have enough power to detect dependencies at the t level for all members of N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 35 / 78

Assumption About Statistical Power Assumption 1: the network is t-faithful(t ), i.e., V i N(T), then Z k V \ {T, V i }, p i,t,k < t A form of (local) Faithfulless assumption for finite sample We call t the power threshold in BN Learning Ioannis Tsamardinos, Univ. of Crete 36 / 78

C-FDR(T, t, a): Control FDR Control the FDR of estimating N(T) at the a-level, assuming the network is t-faithful(t ) Algorithm 5 C-FDR(T, t, a) 1: procedure C-FDR(T, t, a) 2: N(T) = Algorithm-N(T, t) 3: for all V i N(T) do 4: E(V i )= Algorithm-E(V i, t) 5: p i,t = max{p i,t,k,z k N(T) or Z k E(V i )} 6: end for 7: d = F({p i,t }, a) call d the cut-off threshold 8: return {V i : p i,t d} 9: end procedure Call the output N(T,t,a) to show dependence on t and a in BN Learning Ioannis Tsamardinos, Univ. of Crete 37 / 78

C-FDR(T, t, a) Remarks We only calculate the p-value bounds p i,t for V i N(T, t) We assume the network is t-faithful(t ) and so all N(T) N(T, t) no need to consider the other variables in BN Learning Ioannis Tsamardinos, Univ. of Crete 38 / 78

B-FDR(T, t, d): Bound FDR Bound the FDR of estimating N(T) assuming the network is t-faithful(t ) and rejecting a complex hypothesis when p i,t d Algorithm 6 B-FDR(T, t, d) 1: procedure B-FDR(T, t, d) 2: N(T) = Algorithm-N(T,t) 3: for all V i N(T) do 4: E(V i )= Algorithm-E(V i,t) 5: p i,t = max{p i,t,k,z k N(T) or Z k E(V i )} 6: end for 7: a = F 1 ({p i,t },d) 8: return a 9: end procedure in BN Learning Ioannis Tsamardinos, Univ. of Crete 39 / 78

N(T) FDR Procedures C-FDR(T, t, a) can be used to learn an estimate of N(T) with a maximum predefined FDR level a B-FDR(T, t, d) can be used to bound the FDR level of an algorithm returning N(T) using a cut-off threshold d in BN Learning Ioannis Tsamardinos, Univ. of Crete 40 / 78

N(T) FDR Procedures Time complexity: same as running Algorithm-N(T, t) all p-values required are calculated and can be cached by Algorithm-N(T, t) F and F 1 procedures are linear Correctly control/bound the FDR under the assumptions: A1: the network is t-faithful(t ) A2: the FDR procedures F and F 1 are correct in BN Learning Ioannis Tsamardinos, Univ. of Crete 41 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 42 / 78

V (d): # of false positives when we reject all hypotheses with a p-values less than d R(d): # of positives (rejections) when we reject all hypotheses with a p-value less than d FDR(d) = Ex(V (d)/r(d)) in BN Learning Ioannis Tsamardinos, Univ. of Crete 43 / 78

Approximating FDR(d) Approximation 1: Ex(V (d)/r(d)) Ex(V (d))/e(r(d)) Approximation 2: Ex(V (d)) d m, where m = V \{T } Approximation 3: Ex(R(d)) R o (d), where R o (d) is the observed number of rejections (i.e. N(T, t, d) ) in BN Learning Ioannis Tsamardinos, Univ. of Crete 44 / 78

Approximating FDR(d) Ex(V (d)) d m, where m = V \{T } in the worst case all m hypotheses are truly null P(p i,t d null is true) = d So, on average in the worst case we will reject d m truly nulls in BN Learning Ioannis Tsamardinos, Univ. of Crete 45 / 78

FDR Approximation FDR(d) =Ex(V (d)/r(d)) a m d/r 0 (d) a m d/#{p i,t d} a F({p i,t }, a) = p k,t where p k,t is the largest p-value such that FDR(p k,t ) a F 1 ({p i,t }, d) = FDR(p k,t ) FDR(d) where p k,t is the largest p-value such that p k,t d in BN Learning Ioannis Tsamardinos, Univ. of Crete 46 / 78

C-FDR(T, t,a): Control FDR Algorithm 7 C-FDR(T, t, a) 1: procedure C-FDR(T, t, a) 2: N(T) = Algorithm-N(T,t) 3: for all do V i N(T) 4: E(V i )= Algorithm-E(V i,t) 5: p i,t = max{p i,t,k,z k N(T) or Z k E(V i )} 6: end for 7: 8: d = F({p i,t },a) return {V i : p i,t d} and FDR(d) 9: end procedure FDR(d) a: FDR(d) the largest FDR smaller than a that could be achieved (called Estimated Alpha) in BN Learning Ioannis Tsamardinos, Univ. of Crete 47 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 48 / 78

Experimental setup Network Total Vars Selected Alarm 37 37 Child 20 20 Insurance 27 27 Pigs 801 37 Gene 441 37 158 158 variables 5 samplings = 790 runs per sample size and threshold setting in BN Learning Ioannis Tsamardinos, Univ. of Crete 49 / 78

Algorithm Evaluated B-FDR(T, t, t) Returns a bound on the FDR for a given target and dataset Bound is provided assuming t-faithfulness(t ) and rejecting a hypothesis if p i,t < t Instantiation of Algorithm-N(T, t) used is the Max-Min Parents and Children in BN Learning Ioannis Tsamardinos, Univ. of Crete 50 / 78

Results: Sample Size Effect Empirical FDR on MMPC Output 0.8 0.6 0.4 0.2 Sample Size 1000, Alpha = 0.05 1 0 0 0.5 1 (a) 0.8 0.6 0.4 0.2 Sample Size 5000, Alpha = 0.05 1 0 0 0.5 1 (b) Child Insurance Alarm Pigs Gene Estimated Bound on FDR 0.8 0.6 0.4 0.2 Sample Size 10000, Alpha = 0.05 1 0 0 0.5 1 (c) Average bound over samplings vs average true FDR in BN Learning Ioannis Tsamardinos, Univ. of Crete 51 / 78

Results: Threshold Effect t Sample Size 5000, Alpha = 0.01 1 Sample Size 5000, Alpha = 0.05 1 Empirical FDR on MMPC Output 0.8 0.6 0.4 0.2 0 0 0.5 1 (a) Sample Size 5000, Alpha = 0.10 1 0.8 0.6 0.4 0.8 0.6 0.4 0.2 0 0 0.5 1 (b) Sample Size 5000, Alpha = 0.15 1 0.8 0.6 0.4 Child Insurance Alarm Pigs Gene 0.2 0.2 0 0 0.5 1 (c) Estimated Bound on FDR 0 0 0.5 1 (d) in BN Learning Ioannis Tsamardinos, Univ. of Crete 52 / 78

Quantitative Results SS Alpha Num UE Avg. Error Avg Slack 1000 0.05 59 0.018 0.068 5000 0.05 18 0.005 0.034 10000 0.05 22 0.006 0.025 5000 0.01 12 0.006 0.005 5000 0.05 18 0.005 0.034 5000 0.10 20 0.006 0.064 5000 0.15 16 0.004 0.087 Error = max(0, true-fdr FDR-bound) Slack = max(0, FDR-bound true-fdr) UE: experiment where bound was underestimated in BN Learning Ioannis Tsamardinos, Univ. of Crete 53 / 78

Algorithm Evaluated C-FDR(T, t, a) Instantiation of Algorithm-N(T, t) used is the Max-Min Parents and Children Threshold t set to 0.15 a ranged in { 10%, 20%,..., 90% } We compare the Estimated Alpha returned with the true FDR achieved in BN Learning Ioannis Tsamardinos, Univ. of Crete 54 / 78

Results: Alpha Effect in BN Learning Ioannis Tsamardinos, Univ. of Crete 55 / 78

Execution Time Average time to run Algorithm-N(T, t), B-FDR(T, t, d), C-FDR(T, t, a) 33 seconds Excluding the Pigs network the average drops to 12 seconds in BN Learning Ioannis Tsamardinos, Univ. of Crete 56 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 57 / 78

Effect of Small R o (d) FDR(d) =Ex(V (d)/r(d)) a m d/r 0 (d) a m d/#{p i,t d} a When R o (d) is small, a small difference dramatically changes the FDR estimate Increases variance in BN Learning Ioannis Tsamardinos, Univ. of Crete 58 / 78

Variance vs R 0 (d) STD(Bound FDR Empirical FDR) 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 1 2 3 4 5 6 7 8 9 Number of Rejections in BN Learning Ioannis Tsamardinos, Univ. of Crete 59 / 78

Dealing with small R o (d) Make no claim when R o (d) is small Increase R o (d) somehow Solve a learning problem where R o (d) is relatively large Learn the whole skeleton of the network (all the edges) Similar theory, similar algorithms Or, learn a Region of Interest around T, e.g., the region at most 3 edges away from T in BN Learning Ioannis Tsamardinos, Univ. of Crete 60 / 78

C-FDR(t,a): control FDR Algorithm 8 C-FDR(t, a) 1: procedure C-FDR(t, a) 2: for all V i V do 3: E(V i )=Algorithm-N(V i, T) 4: end for 5: for all V i and V j E(V j ) do 6: p i,j = max{p i,j,k,z k E(V i ) or Z k E(V j )} 7: end for 8: 9: d = F(p i,j, i < j, a) % because of symmetry return {edge(v i, V j ) : p i,j d} 10: end procedure The number of tests/hypotheses are n(n 1)/2, i.e., the number of possible edges in BN Learning Ioannis Tsamardinos, Univ. of Crete 61 / 78

B-FDR(t, d): Bound FDR Bound the FDR of learning the edges in the network assuming t-faithful(t ) and rejecting a complex hypothesis when p i,j d Algorithm 9 B-FDR(t, a) 1: procedure B-FDR(t, a) 2: for all variable V i do 3: E(V i )=Algorithm-N(V i, T) 4: end for 5: for all V i and V j E(V j ) do 6: p i,j = max{p i,j,k,z k E(V i ) or Z k E(V j )} 7: end for 8: a = F 1 (p i,j, i < j, d) 9: return a 10: end procedure in BN Learning Ioannis Tsamardinos, Univ. of Crete 62 / 78

a Test in BN Learning Ioannis Tsamardinos, Univ. of Crete 63 / 78

The Reality About Tests of Independence When p i,j,k is large it may be that The independence I(V i ; V j Z k ) holds indeed There is not enough sample to obtain a statistically significant results (low power) The larger the Z k the lower the power If we perform all possible tests, some will return a high p i,j,k because of low power and we will believe we have found a certificate of exclusion in BN Learning Ioannis Tsamardinos, Univ. of Crete 64 / 78

The Reality About Tests of Independence All algorithms based on tests of independence only perform them when they consider they believe they have enough statistical power PC/MMPC Only perform a test T(V i ; V j Z k ) when there are at least 10/5 samples per cell of the contingency tables It is possible, an important test T(V i ; V j Z k ) not to be performed in BN Learning Ioannis Tsamardinos, Univ. of Crete 65 / 78

Assumptions Assumption 1: the network is t-faithful(t ) Assumption 2: the FDR procedure is correct Assumption 3: All tests required by the algorithms and containing certificates of exclusion tests will be performed in BN Learning Ioannis Tsamardinos, Univ. of Crete 66 / 78

Sample Size 1000 Experiment Empirical FDR on MMPC output 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Sample Size 1000, Alpha Threshold = 0.05 Child Insurance Alarm Pigs Gene 0 0 0.2 0.4 0.6 0.8 1 Estimated Bound on FDR Empirical FDR on MMPC output 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Filtered Sample Size 1000 Child Insurance Alarm Pigs Gene 0 0 0.2 0.4 0.6 0.8 1 Estimated Bound on FDR SS Alpha Num UE Avg. Error Avg. Slack Original 1000 0.05 59 0.018 0.068 Filtered 1000 0.05 26 0.009 0.068 Experiments where Assumption 3 is violated are filtered out in BN Learning Ioannis Tsamardinos, Univ. of Crete 67 / 78

Guarantees in Small Sample Size Suppose that there is not enough sample to condition on large enough sets Then, we expect some structural false positives within the returned estimate of N(T) What can we do? Provide looser guarantees Change the definition of False Discovery in BN Learning Ioannis Tsamardinos, Univ. of Crete 68 / 78

False Discovery Definition False Discovery: V i is returned in N(T, t, a) but V i / N(T) FD(T, k) set of V i such that V i V, Z N(T) or Z N(V i ) with Z k, I(V i, T Z) When k max( N(T), N(V i) ), then FD(T, k) = N(T) in BN Learning Ioannis Tsamardinos, Univ. of Crete 69 / 78

False Discovery Definition Before: Now: False Discovery: V i is returned in N(T,t,a) but V i / N(T) False Discovery: V i is returned in N(T,t,a) but V i / FD(T,k) in BN Learning Ioannis Tsamardinos, Univ. of Crete 70 / 78

Using FD(T,k) Assume the maximum conditioning size is 1 We can d-separate V 3 from T conditioned on 2 variables, but not just 1 variable V 3 is a false discovery with our previous definition V 3 FD(T, 1) and so is not a false discovery 1 using the new definition in BN Learning Ioannis Tsamardinos, Univ. of Crete 71 / 78

Using FD(T,k) Suppose we determine that with the given dataset we can reliably condition on at most k variables Assuming: the network is t-faithful(t) the FDR procedure is correct We guarantee that FDR of N(T, t, a) a, where a False Discovery is defined as V i / FD(T, k) in BN Learning Ioannis Tsamardinos, Univ. of Crete 72 / 78

in BN Learning Ioannis Tsamardinos, Univ. of Crete 73 / 78

Ideas For Extensions Approximation 2: Ex(V (d)) d m Ex(V (d)) d m 0 m 0 is an estimate of the true null hypotheses; one could use Storey s procedures to estimate m 0 Estimate Ex(V (d)) by permutation testing Permute the data of T, run the Algorithm-N(T,t) and measure the size of N(T,t) which are all false positives by construction Measure the average size of N(T,t) in all permutations and use it as Ex(V (t)) in BN Learning Ioannis Tsamardinos, Univ. of Crete 74 / 78

Ideas For Extensions What about false negatives? there are statistical procedures that could be applied What about other tasks? Orientation of edges based on independence testing Discovery of hidden variables in BN Learning Ioannis Tsamardinos, Univ. of Crete 75 / 78

Discussion A step towards a theory of causal discovery and BN learning in the finite sample Algorithms work well when the assumptions hold and sample is relatively large Statistical guarantees produced should be intuitive to practitioners in BN Learning Ioannis Tsamardinos, Univ. of Crete 76 / 78

Limitations p-values are only approximated and FDR procedures used are conservative Leads to reduced power of the complex hypotheses (increased false negatives) Further evaluation is needed especially for untested algorithms and extensions over a larger range of networks and parameters with particular attention to the variance of the FDR bounds comparison with other related work [5] in BN Learning Ioannis Tsamardinos, Univ. of Crete 77 / 78

References [1] Y. Benjamini and Y. Hochberg. false discovery rate - a practical and powerful approach to multiple testing. J Roy Stat Soc B Methods, 57(1):289 300, 1995. [2] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, pages 507 554, 2002. [3] Nir Friedman and Daphne Koller. Being bayesian about network structure. pages 201 210. [4] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J. Mach. Learn. Res., 8:613 636, 2007. [5] J. Listgarten and D. Heckerman. Determining the number of non-spurious arcs in a learned dag model: Investigation of a bayesian and a frequentist approach. In 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [6] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2000. [7] I Tsamardinos, LE Brown, and CF Aliferis. The Max-Min Hill-Climbing bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006. [8] Ioannis Tsamardinos and Laura E. Brown. Bounding the false discovery rate in local bayesian in BN Learning Ioannis Tsamardinos, Univ. of Crete 78 / 78 network learning. In AAAI, 2008.