Bioinformatics of Protein Domains: New Computational Approach for the Detection of Protein Domains

Bioinformatics of Protein Domains: New Computational Approach for the Detection of Protein Domains Maricel Kann Assistant Professor University of Maryland, Baltimore County mkann@umbc.edu Maricel Kann. Feb-08

The Human Genome Project GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAG GGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAG GGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAA ATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGA AAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCG CAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCT AGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTA CTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAAT ATCTTCCTCGAAGGCTAATCGATAACTGACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTA ATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTAT GCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGAT TATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGAC AGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAA TCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCA CGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACT ATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGG GTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGG CCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTT GCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATC GTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTG AATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCC ATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTA CAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGA GTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAA CAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAG TCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGA ACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACA TCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCC AAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTA CTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGT ACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGAT TTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGC TGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCA CCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATA CTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCT ATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGA AAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAA 2 Maricel Kann. Feb-08 AAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAA

3 Sidney Harris Maricel Kann. Feb-08

A L I QUERY G N M E N T Protein Classification Alignment Algorithm Scoring Function Accurate Statistics Set of related sequences or protein family from database A L I G - N M E N T A L I A L I A L I G G N M E N G N M E N T G G N M E N A L I G G N M E N - 4 3 4 7 1 2-2 0 0 score=19 4 PAM: Dayhoff et al. (1978); BLOSUM: Henikoff & Henikoff (1992); OPTIMA:Kann et al. (2000). Maricel Kann. Feb-08

Significance of a score Estimated number of non-related sequences in the database that score higher than the query D= size of database E = ps ( < S) D Q R 5 Maricel Kann. Feb-08

# of alignments with score S S S Q random scores Alignments scores ps ( < S) = 1 exp[ KMNe λ S R ] Q R 6 Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Definition of protein domain. Main features of the Conserved domain database (CDD) Position specific scoring matrices (PSSM) Classification of alignment methods Current methods for protein domain searches Our approach (Global Blocks Aligned Locally) Results 7 Maricel Kann. Feb-08

The term protein domain (or domain) refers to a region of the protein with compact structure, usually with a hydrophobic core. 8 Maricel Kann. Feb-08

Conserved Domains In 1974 Michael Rossman recognized the NADH binding domain in several dehydrogenases (named after him). Conserved domains are determined by sequence comparative analysis. Molecular evolution uses such domains as building blocks They may be recombined in different arrangements to make proteins with different functions. Most proteins contain multiple domains (65% euk, 40% prok), giving rise to a variety of combinations of domains. 9 Maricel Kann. Feb-08

CDD: a collection of domain multiple alignments linked to protein 3D structure 10 Maricel Kann. Feb-08

heme-binding site It combines information about protein sequence, their conservation patterns across evolution and the protein structure and provide useful functional annotation. 11 Marchler-Bauer et al (2003) NAR 383:387 Maricel Kann. Feb-08

QUERY Protein Classification Alignment Algorithm Scoring Function Accurate Statistics PSSM can be derived from the MSA Set of related sequences or protein family from database A PSSM, or Position-Specific Scoring Matrix (or profile), is a type of scoring matrix in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. 12 Maricel Kann. Feb-08

MSA contains conserved blocks 13 Maricel Kann. Feb-08

Protein Sequence Conservation Occurs in Blocks with Intervening Gaps Protein Structure Alignment α-helix red sequence β-strand loops Subsequences corresponding to secondary structure elements (SSEs: α- helices and β-strands) are more conserved than the intervening loops. blue sequence 14 Maricel Kann. Feb-08

CDD representation 1 2 gap gap CDD footprint 15 Maricel Kann. Feb-08

Sequence-PSSM alignment A L I G N M E N T 16 Maricel Kann. Feb-08

Sequence-PSSM alignment query Gaps in PSSM block block Gaps in Query block PSSM 17 Maricel Kann. Feb-08

Three Types of Sequence Alignments Local Alignment Subsequence To Subsequence Semi-Global Alignment Subsequence Onto Sequence Global Alignment Sequence To Sequence 18 BW Erickson & P Sellars (1983) Time Warps, String Edits, and Macromolecules, p. 55 Maricel Kann. Feb-08

Semi-global Alignment Finding a complete domain in the query, semiglobal, is the natural choice in the context of the protein structure, function and evolution query sequence 19 Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Current methods for protein domain searches RPS-BLAST HMMer SALTO (Global Blocks Aligned Locally) Derivation of Statistics Results 20 Maricel Kann. Feb-08

Reverse Position-Specific BLAST(RPS-BLAST) BLAST) query block block block PSSM The role of the PSSM has changed from being the query in PSI- BLAST to subject, hence the term reverse in RPS-BLAST (Reversed-Position Specific) rpsblast doesn t t incorporate the concept of block 21 A Schaffer et al (1999) Bioinformatics 15:1000 Maricel Kann. Feb-08

HMM HMMer is trained on the CDD sequences. HMMer does not specifically incorporate the concept of block. 22 Maricel Kann. Feb-08

HMMer s Statistics are a (Poor) Empirical Fit HMMer fits the EVD distribution parameters λ and K to simulated sequences with a Gaussian length distribution. HMMer_semi-global Gumbel E-value approximation is sometimes very inaccurate. 23 ftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_current/userguide.pdf Maricel Kann. Feb-08

SALTO Structure-based ALignment TOol gap gap SALTO G-SALTO 24 Kann MG et al. Bioinformatics, 21(8):1451-6. (2005) Maricel Kann. Feb-08

Properties of an Ideal Alignment Method Semi-global alignment method is intrinsically the right tool for searching for domains within proteins. Local alignment methods match only a portion of a domain against a query. Reverse Position-Specific BLAST (rpsblast) Screening a database for matches needs to be fast. HMMs have no intrinsic heuristics to speed computation. The word heuristics in rpsblast speed screening and are available for any local alignment method. Accurate Statistics. 25 Maricel Kann. Feb-08

GLOBAL (Glo( Global Blocks Aligned Locally) A semi-global Alignment Method for Querying A Database of Protein Domains with Accurate Statistics 26 M G. Kann et al (2007) NAR, 35(14):4678-4685 4685. Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Current methods for protein domain searches Method (Global Blocks Aligned Locally) Algorithm and scoring scheme Derivation of Statistics Results 27 Maricel Kann. Feb-08

GLOBAL: aligns blocks locally gap gap G-SALTO GLOBAL 28 Maricel Kann. Feb-08

GLOBAL QUERY PSSM Global Algorithm: Global Algorithm: Uses dynamic programming (DP) to find the alignment of a protein query sequence to all blocks of the PSSM (in order). Penalty=0 both for unaligned regions of the PSSM at the ends of the blocks and unaligned regions of the queries between blocks. 29 Maricel Kann. Feb-08

GLOBAL: statistics for b blocks For b blocks, the total alignment score T is: T = Mˆ i ( Leff ) i= 1, b QUERY L eff =effective length n=size of query PSSM ( 1!/ ) (( 1)!!) 1/ L = n+ b n b b=number of blocks in the PSSM Assuming the score for each block is independent eff of each other, GLOBAL estimates total the alignment p-value e.g., n=160, by convolution b=3, L algorithm eff =89 b 30 Maricel Kann. Feb-08

Outline A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics. Introduction Current methods for protein domain searches Method (Global Blocks Aligned Locally) Results Benchmarking database ROC (L-ROC) curves P-value Accuracy 31 Maricel Kann. Feb-08

Benchmarking test set Database of queries: ~ 10,000 sequences with known structure (from MMDB database). To define true relationships to a CDD entry a query sequence need to be a structure neighbor (using VAST) of a CD s protein from for which the structure is known The resulting test has >300 families with almost 30,000 known true positives. 32 Maricel Kann. Feb-08

Benchmarking test set 2000 Number of true positives 1800 1600 1400 1200 1000 800 600 400 Number of true positives 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 20 40 60 80 100 Percentage of sequence identity w/vast 200 0 20 40 60 80 100 Percentage of sequence identity (VAST) 33 Maricel Kann. Feb-08

ROC curve for GLOBAL 0.40 *LROC 10000 LROC 50000 LROC 200000 GLOBAL 0.181 0.224 0.313 HMMer semiglobal 0.185 0.224 0.299 HMMer local 0.169 0.194 0.239 rpsblast 0.168 0.192 0.229 0.35 Fraction of true positives 0.30 0.25 0.20 0.15 0.10 0.05 GLOBAL HMMer-semi-global HMMer-local RPS-BLAST 0.00 0.00 0.01 0.02 0.03 0.04 0.05 Fraction of false positives 34 *LROC:Swensson RG: Med Phys 1996, 23(10):1709-25 25 Maricel Kann. Feb-08

P-value accuracy GLOBAL HMMer 1000 1000 100 100 Estimated P-value / True P-value 10 1 Estimated P-value / True P-value 10 1 0.1 0.1 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 True P-value True P-value Cd00030 Cd00083 Cd00288 1,000,000 simulations using random sequences of length 350 35 Maricel Kann. Feb-08

Conclusions The GLOBAL algorithm and p-value provides a flexible format for semi-global sequence alignments. GLOBAL respect block structure but adds flexibility at the ends of each block. The GLOBAL p-value is based on local alignment p-values. BLAST heuristics from local alignment therefore apply to GLOBAL. While the overall performance is similar to that of HMMer semi-global, GLOBAL has more accurate statistics and the possibility to implement heuristics similar to those used in local methods could make it orders of magnitude faster. 36 Maricel Kann. Feb-08

Future work Implementation of GLOBAL: Blockalizer : creates blocks within the MSA. Heuristics to increase the speed. Optimization of domain discovery: Can we mix and match methods/cds? 37 Maricel Kann. Feb-08

Acknowledgments SALTO: Stephen Altschul, Anna Panchenko, Paul Thiessen and Steve Bryant. GLOBAL: John Spouge, Sergey Sheetlin and Yonil Park. PROTEIN INTERACTIONS: Predicting protein-protein interaction by searching evolutionary tree automorphism space Teresa Przytycka and Raja Jothi. Predicting protein domain interactions from co-evolution of conserved regions: Teresa Przytycka, Praveen Cherukuri and Raja Jothi. UMBC Computational Biology lab team. 38 Maricel Kann. Feb-08

Kann s Computational Biology lab. 39 Maricel Kann. Feb-08