Bioinformatics of Protein Domains: New Computational Approach for the Detection of Protein Domains

Similar documents
Predicting a Dramatic Contraction in the 10-Year Passenger Demand

Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models

15:00 minutes of the scheduled arrival time. As a leader in aviation and air travel data insights, we are uniquely positioned to provide an

Towards New Metrics Assessing Air Traffic Network Interactions

Authentic Assessment in Algebra NCCTM Undersea Treasure. Jeffrey Williams. Wake Forest University.

A Review of Airport Runway Scheduling

Aircraft Noise. Why Aircraft Noise Calculations? Aircraft Noise. SoundPLAN s Aircraft Noise Module

Performance Indicator Horizontal Flight Efficiency

ADS-B via Low Earth Orbiting Satellites Benefits Assessment

Digital twin for life predictions in civil aerospace

Unit 4: Location-Scale-Based Parametric Distributions

Simulation of disturbances and modelling of expected train passenger delays

Economic Impacts of Campgrounds in New York State

= Coordination with Direct Communication

Analysis of Air Transportation Systems. Airport Capacity

Blending Methods and Other Improvements for Exemplar-based Image Inpainting Techniques

PRESENTATION OVERVIEW

Analysis of ATM Performance during Equipment Outages

An Analytical Approach to the BFS vs. DFS Algorithm Selection Problem 1

Enter here your Presentation Title 1

Original Research Paper DETERMINATION OF HAND FROM A FINGERPRINT

Fair Allocation Concepts in Air Traffic Management

In-Service Data Program Helps Boeing Design, Build, and Support Airplanes

Time Benefits of Free-Flight for a Commercial Aircraft

Developing an Aircraft Weight Database for AEDT

Cross-sectional time-series analysis of airspace capacity in Europe

A hybrid genetic algorithm for multi-depot and periodic vehicle routing problems

Monitoring & Control Tim Stevenson Yogesh Wadadekar

NextGen AeroSciences, LLC Seattle, Washington Williamsburg, Virginia Palo Alto, Santa Cruz, California

A Primer on Fatigue Damage Spectrum for Accelerated and Reliability Testing

American Airlines Next Top Model

Genetic Algorithm in Python. Data mining lab 6

ATTEND Analytical Tools To Evaluate Negotiation Difficulty

Optimising throughput of rail dump stations, via simulation and control system changes. Rob Angus BMT WBM Pty Ltd Brisbane 5 June 2013

White Paper: Assessment of 1-to-Many matching in the airport departure process

FLIGHT TRANSPORTATION LABORATORY REPORT R87-5 AN AIR TRAFFIC CONTROL SIMULATOR FOR THE EVALUATION OF FLOW MANAGEMENT STRATEGIES JAMES FRANKLIN BUTLER

Predicting Flight Delays Using Data Mining Techniques

Combining Control by CTA and Dynamic En Route Speed Adjustment to Improve Ground Delay Program Performance

Platform and Products

New Approach to Search for Gliders in Cellular Automata

A RECURSION EVENT-DRIVEN MODEL TO SOLVE THE SINGLE AIRPORT GROUND-HOLDING PROBLEM

ADVANTAGES OF SIMULATION

David Controle, Analytics Accelerator Airbus. Why Invest in AI and Deep Learning NVIDIA GTC

Research in Coastal Infrastructure Reliability: Rerouting Intercity Flows in the Wake of a Port Outage

APN/CEF Capacity Enhancement Function. Capacity Assessment & Planning Guidance. An overview of the European Network Capacity Planning Process

Today: using MATLAB to model LTI systems

PRAJWAL KHADGI Department of Industrial and Systems Engineering Northern Illinois University DeKalb, Illinois, USA

I need the best deals

Department of Textile Technology

Modeling Visitor Movement in Theme Parks

Where is tourists next destination

QUALITY OF SERVICE INDEX Advanced

Measuring Productivity for Car Booking Solutions

Proceedings of the 54th Annual Transportation Research Forum

ABES. Company Presentation. March ABES Pircher & Partner GmbH Research & Development

Approximate Network Delays Model

Real-Time Control Strategies for Rail Transit

Assessment of the 3D-separation of Air Traffic Flows

Aviation Operations. Program Learning Outcomes. Program Description. Career Options

Decision aid methodologies in transportation

Including Linear Holding in Air Traffic Flow Management for Flexible Delay Handling

ECOLABELLING of Portable Rechargeable Batteries

Advancing FTD technologies and the opportunity to the pilot training journey. L3 Proprietary

A Study of Tradeoffs in Airport Coordinated Surface Operations

Deutscher Wetterdienst

Kings Dominion Coaster Mania Building Contest 2017

Transit Vehicle Scheduling: Problem Description

Creative Industries in Greece

AIRPORT OF THE FUTURE

CS229: AUTUMN Application of Machine Learning Algorithms to Predict Flight Arrival Delays

CAPAN Methodology Sector Capacity Assessment

A Coevolutionary Simulation of Real-Time Airport Gate Scheduling

AI in a SMART AIrport

The Effects of GPS and Moving Map Displays on Pilot Navigational Awareness While Flying Under VFR

Lecture 2: Image Classification pipeline. Fei-Fei Li & Andrej Karpathy Lecture 2-1

Flight management during Concordiasi campaign

Worldwide Passenger Flows Estimation

Towards Autoomous ISR by a Team of Coopera;ng Gliders

Hydrological study for the operation of Aposelemis reservoir Extended abstract

ICCE Update 11 years on.

Massey Hall. 178 Victoria St, Toronto, ON M5B 1T7. CAP Index, Inc. REPORT CONTENTS. About CAP Index, Inc. 3-Mile Methodology. 3 Tract Map.

Integrated Optimization of Arrival, Departure, and Surface Operations

QUALITY OF SERVICE INDEX

RECEDING HORIZON CONTROL FOR AIRPORT CAPACITY MANAGEMENT

Cops and Robbers Las Vegas Style

Construction of Conflict Free Routes for Aircraft in Case of Free Routing with Genetic Algorithms.

Analysis of rainless periods within the DriDanube project

A Macroscopic Tool for Measuring Delay Performance in the National Airspace System. Yu Zhang Nagesh Nayak

A New Way to Work in the ERCOT Market

Introduction Runways delay analysis Runways scheduling integration Results Conclusion. Raphaël Deau, Jean-Baptiste Gotteland, Nicolas Durand

Future Automation Scenarios

Guidelines for Snow Avalanche Risk Determination and Mapping. David McClung University of British Columbia

Estimating Sources of Temporal Deviations from Flight Plans

Wake Turbulence Research Modeling

An Analysis of Dynamic Actions on the Big Long River

A Statistical Method for Eliminating False Counts Due to Debris, Using Automated Visual Inspection for Probe Marks

MBTA-REALTIME API FOR PERFORMANCE DATA DOCUMENTATION (V 0.9)

Appendix 8: Coding of Interchanges for PTSS

1. Introduction. 2.2 Surface Movement Radar Data. 2.3 Determining Spot from Radar Data. 2. Data Sources and Processing. 2.1 SMAP and ODAP Data

PREFERENCES FOR NIGERIAN DOMESTIC PASSENGER AIRLINE INDUSTRY: A CONJOINT ANALYSIS

Transcription:

Bioinformatics of Protein Domains: New Computational Approach for the Detection of Protein Domains Maricel Kann Assistant Professor University of Maryland, Baltimore County mkann@umbc.edu Maricel Kann. Feb-08

The Human Genome Project GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAG GGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAG GGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAA ATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGA AAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCG CAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCT AGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTA CTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAAT ATCTTCCTCGAAGGCTAATCGATAACTGACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTA ATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTAT GCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGAT TATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGAC AGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAA TCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCA CGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACT ATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGG GTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGG CCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTT GCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATC GTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTG AATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCC ATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTA CAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGA GTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAA CAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAG TCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGA ACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACA TCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCC AAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTA CTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGT ACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGAT TTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGC TGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCA CCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATA CTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCT ATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGA AAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAA 2 Maricel Kann. Feb-08 AAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAA

3 Sidney Harris Maricel Kann. Feb-08

A L I QUERY G N M E N T Protein Classification Alignment Algorithm Scoring Function Accurate Statistics Set of related sequences or protein family from database A L I G - N M E N T A L I A L I A L I G G N M E N G N M E N T G G N M E N A L I G G N M E N - 4 3 4 7 1 2-2 0 0 score=19 4 PAM: Dayhoff et al. (1978); BLOSUM: Henikoff & Henikoff (1992); OPTIMA:Kann et al. (2000). Maricel Kann. Feb-08

Significance of a score Estimated number of non-related sequences in the database that score higher than the query D= size of database E = ps ( < S) D Q R 5 Maricel Kann. Feb-08

# of alignments with score S S S Q random scores Alignments scores ps ( < S) = 1 exp[ KMNe λ S R ] Q R 6 Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Definition of protein domain. Main features of the Conserved domain database (CDD) Position specific scoring matrices (PSSM) Classification of alignment methods Current methods for protein domain searches Our approach (Global Blocks Aligned Locally) Results 7 Maricel Kann. Feb-08

The term protein domain (or domain) refers to a region of the protein with compact structure, usually with a hydrophobic core. 8 Maricel Kann. Feb-08

Conserved Domains In 1974 Michael Rossman recognized the NADH binding domain in several dehydrogenases (named after him). Conserved domains are determined by sequence comparative analysis. Molecular evolution uses such domains as building blocks They may be recombined in different arrangements to make proteins with different functions. Most proteins contain multiple domains (65% euk, 40% prok), giving rise to a variety of combinations of domains. 9 Maricel Kann. Feb-08

CDD: a collection of domain multiple alignments linked to protein 3D structure 10 Maricel Kann. Feb-08

heme-binding site It combines information about protein sequence, their conservation patterns across evolution and the protein structure and provide useful functional annotation. 11 Marchler-Bauer et al (2003) NAR 383:387 Maricel Kann. Feb-08

QUERY Protein Classification Alignment Algorithm Scoring Function Accurate Statistics PSSM can be derived from the MSA Set of related sequences or protein family from database A PSSM, or Position-Specific Scoring Matrix (or profile), is a type of scoring matrix in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. 12 Maricel Kann. Feb-08

MSA contains conserved blocks 13 Maricel Kann. Feb-08

Protein Sequence Conservation Occurs in Blocks with Intervening Gaps Protein Structure Alignment α-helix red sequence β-strand loops Subsequences corresponding to secondary structure elements (SSEs: α- helices and β-strands) are more conserved than the intervening loops. blue sequence 14 Maricel Kann. Feb-08

CDD representation 1 2 gap gap CDD footprint 15 Maricel Kann. Feb-08

Sequence-PSSM alignment A L I G N M E N T 16 Maricel Kann. Feb-08

Sequence-PSSM alignment query Gaps in PSSM block block Gaps in Query block PSSM 17 Maricel Kann. Feb-08

Three Types of Sequence Alignments Local Alignment Subsequence To Subsequence Semi-Global Alignment Subsequence Onto Sequence Global Alignment Sequence To Sequence 18 BW Erickson & P Sellars (1983) Time Warps, String Edits, and Macromolecules, p. 55 Maricel Kann. Feb-08

Semi-global Alignment Finding a complete domain in the query, semiglobal, is the natural choice in the context of the protein structure, function and evolution query sequence 19 Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Current methods for protein domain searches RPS-BLAST HMMer SALTO (Global Blocks Aligned Locally) Derivation of Statistics Results 20 Maricel Kann. Feb-08

Reverse Position-Specific BLAST(RPS-BLAST) BLAST) query block block block PSSM The role of the PSSM has changed from being the query in PSI- BLAST to subject, hence the term reverse in RPS-BLAST (Reversed-Position Specific) rpsblast doesn t t incorporate the concept of block 21 A Schaffer et al (1999) Bioinformatics 15:1000 Maricel Kann. Feb-08

HMM HMMer is trained on the CDD sequences. HMMer does not specifically incorporate the concept of block. 22 Maricel Kann. Feb-08

HMMer s Statistics are a (Poor) Empirical Fit HMMer fits the EVD distribution parameters λ and K to simulated sequences with a Gaussian length distribution. HMMer_semi-global Gumbel E-value approximation is sometimes very inaccurate. 23 ftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_current/userguide.pdf Maricel Kann. Feb-08

SALTO Structure-based ALignment TOol gap gap SALTO G-SALTO 24 Kann MG et al. Bioinformatics, 21(8):1451-6. (2005) Maricel Kann. Feb-08

Properties of an Ideal Alignment Method Semi-global alignment method is intrinsically the right tool for searching for domains within proteins. Local alignment methods match only a portion of a domain against a query. Reverse Position-Specific BLAST (rpsblast) Screening a database for matches needs to be fast. HMMs have no intrinsic heuristics to speed computation. The word heuristics in rpsblast speed screening and are available for any local alignment method. Accurate Statistics. 25 Maricel Kann. Feb-08

GLOBAL (Glo( Global Blocks Aligned Locally) A semi-global Alignment Method for Querying A Database of Protein Domains with Accurate Statistics 26 M G. Kann et al (2007) NAR, 35(14):4678-4685 4685. Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Current methods for protein domain searches Method (Global Blocks Aligned Locally) Algorithm and scoring scheme Derivation of Statistics Results 27 Maricel Kann. Feb-08

GLOBAL: aligns blocks locally gap gap G-SALTO GLOBAL 28 Maricel Kann. Feb-08

GLOBAL QUERY PSSM Global Algorithm: Global Algorithm: Uses dynamic programming (DP) to find the alignment of a protein query sequence to all blocks of the PSSM (in order). Penalty=0 both for unaligned regions of the PSSM at the ends of the blocks and unaligned regions of the queries between blocks. 29 Maricel Kann. Feb-08

GLOBAL: statistics for b blocks For b blocks, the total alignment score T is: T = Mˆ i ( Leff ) i= 1, b QUERY L eff =effective length n=size of query PSSM ( 1!/ ) (( 1)!!) 1/ L = n+ b n b b=number of blocks in the PSSM Assuming the score for each block is independent eff of each other, GLOBAL estimates total the alignment p-value e.g., n=160, by convolution b=3, L algorithm eff =89 b 30 Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Current methods for protein domain searches Method (Global Blocks Aligned Locally) Results Benchmarking database ROC (L-ROC) curves P-value Accuracy 31 Maricel Kann. Feb-08

Benchmarking test set Database of queries: ~ 10,000 sequences with known structure (from MMDB database). To define true relationships to a CDD entry a query sequence need to be a structure neighbor (using VAST) of a CD s protein from for which the structure is known The resulting test has >300 families with almost 30,000 known true positives. 32 Maricel Kann. Feb-08

Benchmarking test set 2000 Number of true positives 1800 1600 1400 1200 1000 800 600 400 Number of true positives 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 20 40 60 80 100 Percentage of sequence identity w/vast 200 0 20 40 60 80 100 Percentage of sequence identity (VAST) 33 Maricel Kann. Feb-08

ROC curve for GLOBAL 0.40 *LROC 10000 LROC 50000 LROC 200000 GLOBAL 0.181 0.224 0.313 HMMer semiglobal 0.185 0.224 0.299 HMMer local 0.169 0.194 0.239 rpsblast 0.168 0.192 0.229 0.35 Fraction of true positives 0.30 0.25 0.20 0.15 0.10 0.05 GLOBAL HMMer-semi-global HMMer-local RPS-BLAST 0.00 0.00 0.01 0.02 0.03 0.04 0.05 Fraction of false positives 34 *LROC:Swensson RG: Med Phys 1996, 23(10):1709-25 25 Maricel Kann. Feb-08

P-value accuracy GLOBAL HMMer 1000 1000 100 100 Estimated P-value / True P-value 10 1 Estimated P-value / True P-value 10 1 0.1 0.1 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 True P-value True P-value Cd00030 Cd00083 Cd00288 1,000,000 simulations using random sequences of length 350 35 Maricel Kann. Feb-08

Conclusions The GLOBAL algorithm and p-value provides a flexible format for semi-global sequence alignments. GLOBAL respect block structure but adds flexibility at the ends of each block. The GLOBAL p-value is based on local alignment p-values. BLAST heuristics from local alignment therefore apply to GLOBAL. While the overall performance is similar to that of HMMer semi-global, GLOBAL has more accurate statistics and the possibility to implement heuristics similar to those used in local methods could make it orders of magnitude faster. 36 Maricel Kann. Feb-08

Future work Implementation of GLOBAL: Blockalizer : creates blocks within the MSA. Heuristics to increase the speed. Optimization of domain discovery: Can we mix and match methods/cds? 37 Maricel Kann. Feb-08

Acknowledgments SALTO: Stephen Altschul, Anna Panchenko, Paul Thiessen and Steve Bryant. GLOBAL: John Spouge, Sergey Sheetlin and Yonil Park. PROTEIN INTERACTIONS: Predicting protein-protein interaction by searching evolutionary tree automorphism space Teresa Przytycka and Raja Jothi. Predicting protein domain interactions from co-evolution of conserved regions: Teresa Przytycka, Praveen Cherukuri and Raja Jothi. UMBC Computational Biology lab team. 38 Maricel Kann. Feb-08

Kann s Computational Biology lab. 39 Maricel Kann. Feb-08