Accuracy of PHDtopology
PHDtopology: Refined prediction of the location and topology for transmembrane helices
****************************************************************************
* *
* PredictProtein@EMBL-Heidelberg.DE *
* Prediction of helical transmembrane regions by PHDhtm *
* *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* Refined prediction of the location and topology for *
* transmembrane helices by PHDtopology *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Author: Burkhard Rost *
* EMBL, Heidelberg, FRG *
* Meyerhofstrasse 1, 69 117 Heidelberg *
* Internet: Rost@EMBL-Heidelberg.DE *
* *
* All rights reserved. *
* *
* *
****************************************************************************
* *
* Please quote *
* ~~~~~~~~~~~~ *
* *
* The PredictProtein mail server is described in: *
* B Rost: PHD: predicting one-dimensional protein structure by pro- *
* file based neural networks. Meth. in Enzym., 1996, 266, 525-539. *
* (Text) *
* *
* Additionally to be quoted for publications of PHDtopology output: *
* B Rost, R Casadio & P Fariselli: Refining neural network predic- *
* tions for helical transmembrane proteins by dynamic programming. *
* In: D States et al. (eds.) "The fourth international conference *
* Intelligent Systems for Molecular Biology (ISMB)", St. Louis, *
* U.S.A., Jun 1996, Menlo Park, CA: AAAI Press, in press. *
* (Abstract) *
* *
* A more thorough evaluation of PHDtopology is to be found in: *
* B Rost, P Fariselli & R Casadio: Topology prediction for helical *
* transmembrane proteins at 86% accuracy. Preprint, EMBL, 69012 *
* Germany, PDG-03/96, 1996. *
* (Abstract) *
* *
* *
****************************************************************************
* *
* Definition of topology *
* ~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The topology of integral membrane proteins with transmembrane helices *
* describes the orientation of the helices with respect to the membrane: *
* OUT: first residue (N-term) starting extra-cytoplasmic, i.e. outside *
* of the membrane *
* IN: first residue starting intra-cytoplasmic, i.e. inside. *
* *
* *
****************************************************************************
* *
* Estimated Accuracy of Prediction *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The method was evaluated on 131 helical transmembrane proteins in *
* cross-validation experiments, i.e., such that no protein used for *
* setting up the method had more than 25% sequence identity to any *
* protein used for deriving the estimates for performance accuracy. *
* For all integral membrane proteins used for the evaluation the *
* knowledge about the helix locations and the topology were known by *
* experiment. *
* *
* Results of test on 131 proteins: *
* *
* +----------+------------------------------------------------------+ *
* | 539 | number of transmembrane helices (HTM's) observed | *
* | 552 | number of HTM's predicted | *
* | 533 | number of HTM's predicted correctly, i.e. with an | *
* | | overlap of more than 3 residues to observed HTM's | *
* +----------+------------------------------------------------------+ *
* | 99% | percentage of residues correctly predicted/observed | *
* | 97% | percentage of residues correctly predicted/predicted | *
* +----------+------------------------------------------------------+ *
* *
* ++========++------------------------------------------------------+ *
* || 89% || percentage of proteins for which all HTM's were | *
* || || predicted correctly | *
* ++========++------------------------------------------------------+ *
* || 86% || percentage of proteins with correctly predicted | *
* || || topology | *
* ++========++------------------------------------------------------+ *
* *
* Note: The error for the estimates of correctly predicting all HTM's *
* (89%) and for correctly predicting topology (86%) have an ex- *
* pected error of 6% (two standard deviations of binomial dis- *
* tribution). In other words, given your protein, you can *
* estimate your chance that the prediction is correct for all *
* HTM's as 83%-95%; and that the prediction of topology is cor- *
* rect as 81%-91%. *
* *
*..........................................................................*
* *
* Eukaryotes: *
* The expected accuracy is higher than average for eukaryotic proteins: *
* 94% correct prediction of all HTM's, *
* 90% correct prediction of topology. *
* *
* Prokaryotes: *
* The expected accuracy is lower than average for prokaryotic proteins: *
* 76% correct prediction of all HTM's, *
* 73% correct prediction of topology. *
* *
* Viral proteins: *
* We evaluated PHDtopology only on five viral proteins. For all five *
* prediction accuracy was 100%. *
* *
* Note: The estimates for prokaryotes are based on fewer proteins, thus *
* the estimated error is 18% (two standard deviations). *
* The result for the five viral proteins can, at best, be seen as *
* a trend, as five proteins are much too few for deriving general *
* estimates for prediction accuracy. *
* *
*..........................................................................*
* *
* Average length of transmembrane helices: *
* *
* | +------------+----------+ *
* | | predicted | observed | *
* +-----------+------------+----------+ *
* | Lhelix = | 20.5 | 22.3 | *
* +-----------+------------+----------+ *
* *
* *
****************************************************************************
* *
* Protein-specific reliability indices *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* We empirically favoured the definition of two indices for the reliabi- *
* lity of the correctness of the prediction for all helices and the pre- *
* diction of topology. Both indices are normalised to integer values *
* between 0 (low) and 9 (high). The following results are based on 131 *
* proteins. *
* *
* Reliability of predicting all HTM's correctly: *
* *
* +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | Ri(model) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 | *
* | | | | | | | | | | | *
* | Nprot | 131 | 117 | 83 | 66 | 56 | 40 | 25 | 17 | 9 | *
* | Ncorr | 117 | 108 | 79 | 65 | 55 | 39 | 25 | 17 | 9 | *
* | | | | | | | | | | | *
* | %prot | 100 | 89 | 63 | 50 | 42 | 30 | 19 | 12 | 6 | *
* | %correct | 89 | 92 | 95 | 98 | 98 | 97 | 100 | 100 | 100 | *
* +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+ *
* *
* Reliability of correctly predicting topology
* *
* +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | Ri(top) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 | *
* | | | | | | | | | | | *
* | Nprot | 131 | 124 | 109 | 97 | 83 | 52 | 21 | 12 | 5 | *
* | Ncorr | 113 | 110 | 99 | 89 | 79 | 49 | 19 | 12 | 5 | *
* | | | | | | | | | | | *
* | %prot | 100 | 94 | 83 | 74 | 63 | 39 | 16 | 9 | 3 | *
* | %correct | 86 | 88 | 90 | 91 | 95 | 94 | 90 | 100 | 100 | *
* +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+ *
* *
* Abbreviations: *
* Nprot cumulative number of proteins predicted at a reliability *
* index larger or equal n, with n = 0, ..., 9. *
* Ncorr cumulative number of proteins predicted correctly at a *
* reliability index larger or equal n, with n = 0, ..., 9. *
* %prot =100*(Nprot/131), i.e. percentage of proteins predicted. *
* %corr =100*(Ncorr/131), i.e. percentage of proteins predicted *
* correctly. *
* *
* The table above gives the cumulative results, e.g. 50% of all proteins *
* are predicted at a reliability index Ri(model) >= 3; for 98% of these *
* all transmembrane helices are predicted correctly. Similarly, 63% of *
* the proteins were predicted with an index Ri(top) >= 4; for 95% of *
* these the prediction for topology was correct. *
* *
* Ri(model) and Ri(top) are combined in the following sense. In our test *
* analysis proteins for which the topology prediction was wrong despite *
* a relatively high value for the reliability index (Ri(top)>3), were in *
* almost all cases predicted with the wrong number of HTM's. *
* *
* *
****************************************************************************