Accuracy of PHDtopology

PHDtopology: Refined prediction of the location and topology for transmembrane helices

****************************************************************************
*                                                                          *
*      PredictProtein@EMBL-Heidelberg.DE                                   *
*      Prediction of helical transmembrane regions by PHDhtm		   *
*                                                                          *
*      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~               *
*       Refined prediction of the location and topology for		   *
*             transmembrane helices by PHDtopology			   *
*      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~               *
*                                                                          *
*      Author:             Burkhard Rost		                   *
*                          EMBL, Heidelberg, FRG                           *
*                          Meyerhofstrasse 1, 69 117 Heidelberg            *
*                          Internet: Rost@EMBL-Heidelberg.DE 		   *
*                                                                          *
*      All rights reserved.                                                *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*  Please quote								   *
*  ~~~~~~~~~~~~			                                           *
*                                                                          *
*  The PredictProtein mail server is described in:                         *
*     B Rost:  PHD: predicting one-dimensional  protein structure by pro-  *
*        file based neural networks. Meth. in Enzym., 1996, 266, 525-539.  *
*        (Text)                                                            *
*                                                                          *
*  Additionally to be quoted for publications of PHDtopology output:	   *
*     B Rost, R Casadio & P Fariselli:  Refining neural network predic-    *
*        tions for helical transmembrane proteins by dynamic programming.  *
*        In: D States et al. (eds.) "The fourth international conference   *
*        Intelligent Systems for Molecular Biology (ISMB)", St. Louis,     *
*        U.S.A., Jun 1996, Menlo Park, CA: AAAI Press, in press.           *
*        (Abstract)                                                        *
*                                                                          *
*  A more thorough evaluation of PHDtopology is to be found in:            *
*     B Rost, P Fariselli & R Casadio:  Topology prediction for helical	   *
*        transmembrane proteins at 86% accuracy.  Preprint, EMBL, 69012	   *
*        Germany, PDG-03/96, 1996.					   *
*        (Abstract)                                                        *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*  Definition of topology		                                   *
*  ~~~~~~~~~~~~~~~~~~~~~~		                                   *
*                                                                          *
*  The topology of integral membrane proteins with transmembrane helices   *
*  describes the orientation of the helices with respect to the membrane:  *
*  OUT:	first residue (N-term) starting extra-cytoplasmic, i.e. outside	   *
*	of the membrane							   *
*  IN:	first residue starting intra-cytoplasmic, i.e. inside.	 	   *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*  Estimated Accuracy of Prediction                                        *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                       *
*                                                                          *
*  The method was evaluated on 131 helical transmembrane proteins in 	   *
*  cross-validation experiments, i.e., such that no protein used for 	   *
*  setting up the method had more than 25% sequence identity to any 	   *
*  protein used for deriving the estimates for performance accuracy. 	   *
*  For all integral membrane proteins used for the evaluation the 	   *
*  knowledge about the helix locations and the topology were known by	   *
*  experiment.							 	   *
*                                                                          *
*  Results of test on 131 proteins:				 	   *
*                                                                          *
*  +----------+------------------------------------------------------+	   *
*  |  539     |	number of transmembrane helices (HTM's) observed     |	   *
*  |  552     |	number of HTM's predicted			     |	   *
*  |  533     |	number of HTM's predicted correctly, i.e. with an    |	   *
*  |	      |	overlap of more than 3 residues to observed HTM's    |	   *
*  +----------+------------------------------------------------------+	   *
*  |   99%    |	percentage of residues correctly predicted/observed  |     *
*  |   97%    |	percentage of residues correctly predicted/predicted |     *
*  +----------+------------------------------------------------------+	   *
*                                                                          *
*  ++========++------------------------------------------------------+ 	   *
*  ||  89%   || percentage of proteins for which all HTM's were      |	   *
*  ||        || predicted correctly				     |	   *
*  ++========++------------------------------------------------------+ 	   *
*  ||  86%   || percentage of proteins with correctly predicted      |	   *
*  ||        || topology					     |	   *
*  ++========++------------------------------------------------------+ 	   *
*                                                                          *
*  Note: The error for the estimates of correctly predicting all HTM's 	   *
*   	 (89%) and for correctly predicting topology (86%) have an ex-	   *
*        pected error of 6% (two standard deviations of binomial dis-	   *
*        tribution).  In other words, given your protein, you can 	   *
*        estimate your chance that the prediction is correct for all	   *
*        HTM's as 83%-95%; and that the prediction of topology is cor-	   *
*        rect as 81%-91%.			 			   *
*                                                                          *
*..........................................................................*
*                                                                          *
*  Eukaryotes:                                                             *
*  The expected accuracy is higher than average for eukaryotic proteins:   *
*      94%	correct prediction of all HTM's,			   *
*      90%	correct prediction of topology.				   *
*                                                                          *
*  Prokaryotes:                                                            *
*  The expected accuracy is lower than average for prokaryotic proteins:   *
*      76%	correct prediction of all HTM's,			   *
*      73%	correct prediction of topology.				   *
*                                                                          *
*  Viral proteins:							   *
*  We evaluated PHDtopology only on five viral proteins.  For all five	   *
*  prediction accuracy was 100%.					   *
*                                                                          *
*  Note: The estimates for prokaryotes are based on fewer proteins, thus   *
*        the estimated error is 18% (two standard deviations).		   *
*        The result for the five viral proteins can, at best, be seen as   *
*        a trend, as five proteins are much too few for deriving general   *
*        estimates for prediction accuracy.				   *
*                                                                          *
*..........................................................................*
*                                                                          *
*  Average length of transmembrane helices: 			           *
*                                                                          *
*  |           +------------+----------+                                   *
*  |           |  predicted | observed |                                   *
*  +-----------+------------+----------+                                   *
*  | Lhelix  = |    20.5    |   22.3   |                                   *
*  +-----------+------------+----------+                                   *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*  Protein-specific reliability indices                                    *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                    *
*                                                                          *
*  We empirically favoured the definition of two indices for the reliabi-  *
*  lity of the correctness of the prediction for all helices and the pre-  *
*  diction of topology.  Both indices are normalised to integer values     *
*  between 0 (low) and 9 (high).  The following results are based on 131   *
*  proteins.								   *
*                                                                          *
*  Reliability of predicting all HTM's correctly:                          *
*                                                                          *
*  +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+	   *
*  | Ri(model) |    0 |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   9 |	   *
*  |           |      |     |     |     |     |     |     |     |     |	   *
*  | Nprot     |  131 | 117 |  83 |  66 |  56 |  40 |  25 |  17 |   9 |	   *
*  | Ncorr     |  117 | 108 |  79 |  65 |  55 |  39 |  25 |  17 |   9 |	   *
*  |           |      |     |     |     |     |     |     |     |     |	   *
*  | %prot     |  100 |  89 |  63 |  50 |  42 |  30 |  19 |  12 |   6 |	   *
*  | %correct  |   89 |  92 |  95 |  98 |  98 |  97 | 100 | 100 | 100 |	   *
*  +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+	   *
*                                                                          *
*  Reliability of correctly predicting topology
*                                                                          *
*  +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+	   *
*  | Ri(top)   |    0 |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   9 |	   *
*  |           |      |     |     |     |     |     |     |     |     |	   *
*  | Nprot     |  131 | 124 | 109 |  97 |  83 |  52 |  21 |  12 |   5 |	   *
*  | Ncorr     |  113 | 110 |  99 |  89 |  79 |  49 |  19 |  12 |   5 |	   *
*  |           |      |     |     |     |     |     |     |     |     |	   *
*  | %prot     |  100 |  94 |  83 |  74 |  63 |  39 |  16 |   9 |   3 |	   *
*  | %correct  |   86 |  88 |  90 |  91 |  95 |  94 |  90 | 100 | 100 |	   *
*  +-----------+------+-----+-----+-----+-----+-----+-----+-----+-----+	   *
*                                                                          *
*  Abbreviations:							   *
*  	Nprot	cumulative number of proteins predicted at a reliability   *
*		index larger or equal n, with n = 0, ..., 9.		   *
*  	Ncorr	cumulative number of proteins predicted correctly at a     *
*		reliability index larger or equal n, with n = 0, ..., 9.   *
*  	%prot	=100*(Nprot/131), i.e. percentage of proteins predicted.   *
*	%corr 	=100*(Ncorr/131), i.e. percentage of proteins predicted    *
*		correctly.						   *
*                                                                          *
*  The table above gives the cumulative results, e.g. 50% of all proteins  *
*  are predicted at a reliability index Ri(model) >= 3; for 98% of these   *
*  all transmembrane helices are predicted correctly.  Similarly, 63% of   *
*  the proteins were predicted with an index Ri(top) >= 4; for 95% of	   *
*  these the prediction for topology was correct.			   *
*                                                                          *
*  Ri(model) and Ri(top) are combined in the following sense.  In our test *
*  analysis proteins for which the topology prediction was wrong despite   *
*  a relatively high value for the reliability index (Ri(top)>3), were in  *
*  almost all cases predicted with the wrong number of HTM's.              *
*                                                                          *
*                                                                          *
****************************************************************************