The following information has been received by the server:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
________________________________________________________________________________
b.rost
EMBL, 69012 Heidelberg, Europe
rost@embl-heidelberg.de
# FASTA list
>name 1
EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRG
AGGAPTLPETLNV
>name 2
EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRG
PETLNV
>name 3
AAEDQENVKKPEKAAPAQQPRTRAGLAVLRAGNSRG
PETLNV
The sequence had been interpreted as being:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
________________________________________________________________________________
>P1;
name 1
EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRG
AGGAPTLPETLNV
>P1;
name 2
EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRG
PETLNV
>P1;
name 3
AAEDQENVKKPEKAAPAQQPRTRAGLAVLRAGNSRG
PETLNV
________________________________________________________________________________
The alignment that has been used as input to the network is:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- ------------------------------------------------------------
--- multiple sequence alignment
--- ------------------------------------------------------------
--- MAXHOM ALIGNMENT HEADER: ABBREVIATIONS FOR SUMMARY
--- ID : identifier of aligned (homologous) protein
--- STRID : PDB identifier (only for known structures)
--- PIDE : percentage of pairwise sequence identity
--- WSIM : percentage of weighted similarity
--- LALI : number of residues aligned
--- NGAP : number of insertions and deletions (indels)
--- LGAP : number of residues in all indels
--- LSEQ2 : length of aligned sequence
--- ACCNUM : SwissProt accession number
--- NAME : one-line description of aligned protein
---
--- ALIGNMENT HEADER: SUMMARY
ID STRID IDE WSIM LALI NGAP LGAP LEN2 ACCNUM NAME
t-pir-fast_1 100 96 42 1 7 42 P1; name 2
t-pir-fast_2 97 91 39 2 8 42 P1; name 3
________________________________________________________________________________
---
--- MAXHOM ALIGNMENT: IN MSF FORMAT
MSF of: /home/phd/tmp/t-pir-fast_16596.hssp from: 1 to: 49
/home/phd/tmp/t-pir-fast_16596.ret_msf MSF: 49 Type: P 15-Nov-95 05:49:5 Check: 3563 ..
Name: t-pir-fast_1 Len: 49 Check: 3035 Weight: 1.00
Name: t-pir-fast_1 Len: 49 Check: 5367 Weight: 1.00
Name: t-pir-fast_12 Len: 49 Check: 5161 Weight: 1.00
//
1 49
t-pir-fast_1 EFQEDQENVN PEKAAPAQQP RTRAGLAVLR AGNSRGAGGA PTLPETLNV
t-pir-fast_1 EFQEDQENVN PEKAAPAQQP RTRAGLAVLR AGNSRG.... ...PETLNV
t-pir-fast_12...EDQENvk PEKAAPAQQP RTRAGLAVLR AGNSRG.... ...PETLNV
________________________________________________________________________________
****************************************************************************
* *
* *
* PredictProtein@EMBL-Heidelberg.DE *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Prediction of: *
* *
* - secondary structure, by PHDsec *
* - solvent accessibility, by PHDacc *
* - and helical transmembrane regions, by PHDhtm *
* *
* PHD: Profile fed neural network systems from HeiDelberg *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Author: Burkhard Rost *
* EMBL, Heidelberg, FRG *
* Meyerhofstrasse 1, 69 117 Heidelberg *
* Internet: Predict-Help@EMBL-Heidelberg.DE *
* *
* All rights reserved. *
* *
* *
****************************************************************************
* *
* *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* Secondary structure prediction by PHDsec: *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Author: Burkhard Rost *
* EMBL, Heidelberg, FRG *
* Meyerhofstrasse 1, 69 117 Heidelberg *
* Internet: Rost@EMBL-Heidelberg.DE *
* *
* All rights reserved. *
* *
* *
****************************************************************************
* *
* About the network method *
* ~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The network procedure is described in detail in: *
* 1) Rost, Burkhard; Sander, Chris: *
* Prediction of protein structure at better than 70% accuracy. *
* J. Mol. Biol., 1993, 232, 584-599. *
* *
* A brief description is given in: *
* Rost, Burkhard; Sander, Chris: *
* Improved prediction of protein secondary structure by use of se- *
* quence profiles and neural networks. *
* Proc. Natl. Acad. Sci. U.S.A., 1993, 90, 7558-7562. *
* *
* The PHD mail server is described in: *
* 2) Rost, Burkhard; Sander, Chris; Schneider, Reinhard: *
* PHD - an automatic mail server for protein secondary structure *
* prediction. *
* CABIOS, 1994, 10, 53-60. *
* *
* The latest improvement steps (up to 72%) are explained in: *
* 3) Rost, Burkhard; Sander, Chris: *
* Combining evolutionary information and neural networks to predict *
* protein secondary structure. *
* Proteins, 1994, 19, 55-72. *
* *
* To be quoted for publications of PHD output: *
* Papers 1-3 for the prediction of secondary structure and the pre- *
* diction server. *
* *
****************************************************************************
* *
* About the input to the network *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The prediction is performed by a system of neural networks. *
* The input is a multiple sequence alignment. It is taken from an HSSP *
* file (produced by the program MaxHom: *
* Sander, Chris & Schneider, Reinhard: Database of Homology-Derived *
* Structures and the Structural Meaning of Sequence Alignment. *
* Proteins, 1991, 9, 56-68. *
* *
* For optimal results the alignment should contain sequences with varying *
* degrees of sequence similarity relative to the input protein. *
* The following is an ideal situation: *
* *
* +-----------------+----------------------+ *
* | sequence: | sequence identity | *
* +-----------------+----------------------+ *
* | target sequence | 100 % | *
* | aligned seq. 1 | 90 % | *
* | aligned seq. 2 | 80 % | *
* | ... | ... | *
* | aligned seq. 7 | 30 % | *
* +-----------------+----------------------+ *
* *
****************************************************************************
* *
* Estimated Accuracy of Prediction *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* A careful cross validation test on some 250 protein chains (in total *
* about 55,000 residues) with less than 25% pairwise sequence identity *
* gave the following results: *
* *
* ++================++-----------------------------------------+ *
* || Qtotal = 72.1% || ("overall three state accuracy") | *
* ++================++-----------------------------------------+ *
* *
* +----------------------------+-----------------------------+ *
* | Qhelix (% of observed)=70% | Qhelix (% of predicted)=77% | *
* | Qstrand(% of observed)=62% | Qstrand(% of predicted)=64% | *
* | Qloop (% of observed)=79% | Qloop (% of predicted)=72% | *
* +----------------------------+-----------------------------+ *
*..........................................................................*
* *
* These percentages are defined by: *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* | number of correctly predicted residues *
* |Qtotal = --------------------------------------- (*100)*
* | number of all residues *
* | *
* | no of res correctly predicted to be in helix *
* |Qhelix (% of obs) = -------------------------------------------- (*100)*
* | no of all res observed to be in helix *
* | *
* | *
* | no of res correctly predicted to be in helix *
* |Qhelix (% of pred)= -------------------------------------------- (*100)*
* | no of all residues predicted to be in helix *
* *
*..........................................................................*
* *
* Averaging over single chains *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The most reasonable way to compute the overall accuracies is the above *
* quoted percentage of correctly predicted residues. However, since the *
* user is mainly interested in the expected performance of the prediction *
* for a particular protein, the mean value when averaging over protein *
* chains might be of help as well. Computing first the three state *
* accuracy for each protein chain, and then averaging over 250 chains *
* yields the following average: *
* *
* +-------------------------------====--+ *
* | Qtotal/averaged over chains = 72.2% | *
* +-------------------------------====--+ *
* | standard deviation = 9.3% | *
* +-------------------------------------+ *
* *
*..........................................................................*
* *
* Further measures of performance *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Matthews correlation coefficient: *
* *
* +---------------------------------------------+ *
* | Chelix = 0.63, Cstrand = 0.53, Cloop = 0.52 | *
* +---------------------------------------------+ *
*..........................................................................*
* *
* Average length of predicted secondary structure segments: *
* *
* . +------------+----------+ *
* . | predicted | observed | *
* +-----------+------------+----------+ *
* | Lhelix = | 10.3 | 9.3 | *
* | Lstrand = | 5.0 | 5.3 | *
* | Lloop = | 7.2 | 5.9 | *
* +-----------+------------+----------+ *
*..........................................................................*
* *
* The accuracy matrix in detail: *
* *
* +---------------------------------------+ *
* | number of residues with H, E, L | *
* +---------+------+------+------+--------+ *
* | |net H |net E |net L |sum obs | *
* +---------+------+------+------+--------+ *
* | obs H |12447 | 1255 | 3990 | 17692 | *
* | obs E | 949 | 7493 | 3750 | 12192 | *
* | obs L | 2604 | 2875 |19962 | 25441 | *
* +---------+------+------+------+--------+ *
* | sum Net |16000 |11623 |27702 | 55325 | *
* +---------+------+------+------+--------+ *
* *
* Note: This table is to be read in the following manner: *
* 12447 of all residues predicted to be in helix, were observed to *
* be in helix, 949 however belong to observed strands, 2604 to *
* observed loop regions. The term "observed" refers to the DSSP *
* assignment of secondary structure calculated from 3D coordinates *
* of experimentally determined structures (Dictionary of Secondary *
* Structure of Proteins: Kabsch & Sander (1983) Biopolymers, 22, *
* 2577-2637). *
* *
****************************************************************************
* *
* Position-specific reliability index *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The network predicts the three secondary structure types using real *
* numbers from the output units. The prediction is assigned by choosing *
* the maximal unit ("winner takes all"). However, the real numbers *
* contain additional information. *
* E.g. the difference between the maximal and the second largest output *
* unit can be used to derive a "reliability index". This index is given *
* for each residue along with the prediction. The index is scaled to *
* have values between 0 (lowest reliability), and 9 (highest). *
* The accuracies (Qtot) to be expected for residues with values above a *
* particular value of the index are given below as well as the fraction *
* of such residues (%res).: *
* *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | index| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | *
* | %res |100.0| 99.2| 90.4| 80.9| 71.6| 62.5| 52.8| 42.3| 29.8| 14.1| *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | | | | | | | | | | | | *
* | Qtot | 72.1| 72.3| 74.8| 77.7| 80.3| 82.9| 85.7| 88.5| 91.1| 94.2| *
* | | | | | | | | | | | | *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | H%obs| 70.4| 70.6| 73.7| 77.1| 80.1| 83.1| 86.0| 89.3| 92.5| 96.4| *
* | E%obs| 61.5| 61.7| 63.7| 66.6| 69.1| 71.7| 74.6| 77.0| 77.8| 68.1| *
* | | | | | | | | | | | | *
* | H%prd| 77.8| 78.0| 80.0| 82.6| 84.7| 86.9| 89.2| 91.3| 93.1| 95.4| *
* | E%prd| 64.5| 64.7| 67.8| 71.0| 74.2| 77.6| 81.4| 85.1| 89.8| 93.5| *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* *
* The above table gives the cumulative results, e.g. 62.5% of all *
* residues have a reliability of at least 5. The overall three-state *
* accuracy for this subset of almost two thirds of all residues is 82.9%. *
* For this subset, e.g., 83.1% of the observed helices are correctly *
* predicted, and 86.9% of all residues predicted to be in helix are *
* correct. *
* *
*..........................................................................*
* *
* The following table gives the non-cumulative quantities, i.e. the *
* values per reliability index range. These numbers answer the question: *
* how reliable is the prediction for all residues labeled with the *
* particular index i. *
* *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | index| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | *
* | %res | 8.8| 9.5| 9.3| 9.1| 9.7| 10.5| 12.5| 15.7| 14.1| *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | | | | | | | | | | | *
* | Qtot | 46.6| 50.6| 57.7| 62.6| 67.9| 74.2| 82.2| 88.3| 94.2| *
* | | | | | | | | | | | *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | H%obs| 36.8| 42.3| 49.5| 55.2| 61.7| 69.9| 78.8| 87.4| 96.4| *
* | E%obs| 44.7| 44.5| 52.1| 55.4| 60.9| 68.0| 75.9| 81.0| 68.1| *
* | | | | | | | | | | | *
* | H%prd| 49.9| 52.5| 60.3| 64.2| 69.2| 77.5| 85.4| 89.9| 95.4| *
* | E%prd| 41.7| 47.1| 53.6| 57.0| 64.0| 71.6| 78.8| 88.8| 93.5| *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* *
* For example, for residues with Relindex = 5 64% of all predicted betha- *
* strand residues are correctly identified. *
* *
* *
****************************************************************************
* *
* *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* Solvent accessibility prediction by PHDacc: *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Author: Burkhard Rost *
* EMBL, Heidelberg, FRG *
* Meyerhofstrasse 1, 69 117 Heidelberg *
* Internet: Rost@EMBL-Heidelberg.DE *
* *
* All rights reserved. *
* *
* *
****************************************************************************
* *
* About the network method *
* ~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The network for prediction of secondary structure is described in *
* detail in: *
* Rost, Burkhard; Sander, Chris: *
* Prediction of protein structure at better than 70% accuracy. *
* J. Mol. Biol., 1993, 232, 584-599. *
* *
* The analysis of the prediction of solvent exposure is given in: *
* Rost, Burkhard; Sander, Chris: *
* Conservation and prediction of solvent accessibility in protein *
* families. Proteins, 1994, 20, 216-226. *
* *
* To be quoted for publications of PHD exposure prediction: *
* Both papers quoted above. *
* *
****************************************************************************
* *
* Definition of accessibility *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* For training the residue solvent accessibility the DSSP (Dictionary of *
* Secondary Structure of Proteins; Kabsch & Sander (1983) Biopolymers, 22,*
* 2577-2637) values of accessible surface area have been used. The *
* prediction provides values for the relative solvent accessibility. The *
* normalisation is the following: *
* *
* | ACCESSIBILITY (from DSSP in Angstrom) *
* |RELATIVE_ACCESSIBILITY = ------------------------------------- * 100 *
* | MAXIMAL_ACC (amino acid type i) *
* *
* where MAXIMAL_ACC (i) is the maximal accessibility of amino acid type i.*
* The maximal values are: *
* *
* +----+----+----+----+----+----+----+----+----+----+----+----+ *
* | A | B | C | D | E | F | G | H | I | K | L | M | *
* | 106| 160| 135| 163| 194| 197| 84| 184| 169| 205| 164| 188| *
* +----+----+----+----+----+----+----+----+----+----+----+----+ *
* | N | P | Q | R | S | T | V | W | X | Y | Z | *
* | 157| 136| 198| 248| 130| 142| 142| 227| 180| 222| 196| *
* +----+----+----+----+----+----+----+----+----+----+----+ *
* *
* Notation: one letter code for amino acid, B stands for D or N; Z stands *
* for E or Q; and X stands for undetermined. *
* *
* The relative solvent accessibility can be used to estimate the number *
* of water molecules (W) in contact with the residue: *
* *
* W = ACCESSIBILITY /10 *
* *
* The prediction is given in 10 states for relative accessibility, with *
* *
* RELATIVE_ACCESSIBILITY = (PREDICTED_ACC * PREDICTED_ACC) *
* *
* where PREDICTED_ACC = 0 - 9. *
* *
****************************************************************************
* *
* Estimated Accuracy of Prediction *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* A careful cross validation test on some 238 protein chains (in total *
* about 62,000 residues) with less than 25% pairwise sequence identity *
* gave the following results: *
* *
* *
* Correlation *
* ........... *
* *
* The correlation between observed and predicted solvent accessibility *
* is: *
* *
* ----------- *
* corr = 0.53 *
* ----------- *
* *
* This value ought to be compared to the worst and best case prediction *
* scenario: random prediction (corr = 0.0) and homology modelling *
* (corr = 0.66). (Note: homology modelling yields a relative accurate *
* prediction in 3D if, and only if, a significantly identical sequence *
* has a known 3D structure.) *
* *
* *
* 3-state accuracy *
* ................ *
* *
* Often the relative accessibility is projected onto, e.g., 3 states: *
* b = buried (here defined as < 9% relative accessibility), *
* i = intermediate ( 9% <= rel. acc. < 36% ), *
* e = exposed ( rel. acc. >= 36% ). *
* *
* A projection onto 3 states or 2 states (buried/exposed) enables the *
* compilation of a 3- and 2-state prediction accuracy. PHD reaches an *
* overall 3-state accuracy of: *
* Q3 = 57.5% *
* (compared to 35% for random prediction and 70% for homology modelling). *
* *
* In detail: *
* *
* +-----------------------------------+-------------------------+ *
* | Qburied (% of observed)=77% | Qb (% of predicted)=60% | *
* | Qintermediate (% of observed)= 9% | Qi (% of predicted)=44% | *
* | Qexposed (% of observed)=78% | Qe (% of predicted)=56% | *
* +-----------------------------------+-------------------------+ *
* *
* *
* 10-state accuracy *
* ................. *
* *
* The network predicts relative solvent accessibility in 10 states, with *
* state i (i = 0-9) corresponding to a relative solvent accessibility of *
* i*i %. The 10-state accuracy of the network is: *
* *
* Q10 = 24.5% *
* *
*..........................................................................*
* *
* These percentages are defined by: *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* | number of correctly predicted residues *
* |Q3 = --------------------------------------- (*100)*
* | number of all residues *
* | *
* | no of res. correctly predicted to be buried *
* |Qburied (% of obs) = ------------------------------------------- (*100)*
* | no of all res. observed to be buried *
* | *
* | *
* | no of res. correctly predicted to be buried *
* |Qburied (% of pred)= ------------------------------------------- (*100)*
* | no of all residues predicted to be buried *
* *
*..........................................................................*
* *
* Averaging over single chains *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The most reasonable way to compute the overall accuracies is the above *
* quoted percentage of correctly predicted residues. However, since the *
* user is mainly interested in the expected performance of the prediction *
* for a particular protein, the mean value when averaging over protein *
* chains might be of help as well. Computing first the correlation *
* between observed and predicted accessibility for each protein chan, and *
* then averaging over all 238 chains yields the following average: *
* *
* +-------------------------------====--+ *
* | corr/averaged over chains = 0.53 | *
* +-------------------------------====--+ *
* | standard deviation = 0.11 | *
* +-------------------------------------+ *
* *
*..........................................................................*
* *
* Further details of performance accuracy *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The accuracy matrix in detail: *
* .............................. *
* *
* -------+----------------------------------------------------+----------- *
* \ PHD | 0 1 2 3 4 5 6 7 8 9 | SUM %obs *
* -------+----------------------------------------------------+----------- *
* OBS 0 | 8611 140 8 44 82 169 772 334 27 0 | 10187 16.6 *
* OBS 1 | 4367 164 0 50 106 231 738 346 44 3 | 6049 9.8 *
* OBS 2 | 3194 168 1 68 125 303 951 513 42 7 | 5372 8.7 *
* OBS 3 | 2760 159 8 80 136 327 1246 746 58 19 | 5539 9.0 *
* OBS 4 | 2312 144 2 72 166 396 1615 1245 124 19 | 6095 9.9 *
* OBS 5 | 1873 96 3 84 138 425 1979 1834 187 27 | 6646 10.8 *
* OBS 6 | 1387 67 1 60 80 278 2237 2627 231 51 | 7019 11.4 *
* OBS 7 | 1082 35 0 32 56 225 1871 3107 302 60 | 6770 11.0 *
* OBS 8 | 660 25 0 27 43 136 1206 2374 325 87 | 4883 7.9 *
* OBS 9 | 325 20 2 27 29 74 648 1159 366 214 | 2864 4.7 *
* -------+----------------------------------------------------+----------- *
* SUM |26571 1018 25 544 961 2564 13263 14285 1706 487 | *
* %pred | 43.3 1.7 0.0 0.9 1.6 4.2 21.6 23.3 2.8 0.8 | *
* -------+----------------------------------------------------+----------- *
* *
* Note: This table is to be read in the following manner: *
* 8611 of all residues predicted to be in exposed by 0%, were *
* observed with 0% relative accessibility. However, 325 of all *
* residues predicted to have 0% are observed as completely exposed *
* (obs = 9 -> rel. acc. >= 81%). The term "observed" refers to the *
* DSSP compilation of area of solvent accessibility calculated from *
* 3D coordinates of experimentally determined structures (Diction- *
* ary of Secondary Structure of Proteins: Kabsch & Sander (1983) *
* Biopolymers, 22, 2577-2637). *
* *
* *
* Accuracy for each amino acid: *
* ............................. *
* *
* +---+------------------------------+-----+-------+------+ *
* |AA | Q3 b%o b%p i%o i%p e%o e%p | Q10 | corr | N | *
* +---+------------------------------+-----+-------+------+ *
* | A | 59.0 87 60 2 38 66 57 | 31 | 0.530 | 5054 | *
* | C | 62.0 91 67 5 39 25 21 | 34 | 0.244 | 893 | *
* | D | 56.5 21 45 6 49 94 57 | 20 | 0.321 | 3536 | *
* | E | 60.8 9 40 3 41 98 61 | 21 | 0.347 | 3743 | *
* | F | 63.3 94 67 9 46 29 37 | 27 | 0.366 | 2436 | *
* | G | 52.1 75 51 1 31 67 53 | 22 | 0.405 | 4787 | *
* | H | 50.9 63 53 23 45 71 50 | 18 | 0.442 | 1366 | *
* | I | 64.9 95 68 6 41 30 38 | 34 | 0.360 | 3437 | *
* | K | 66.6 2 11 2 37 98 67 | 23 | 0.267 | 3652 | *
* | L | 61.6 93 65 8 44 31 40 | 31 | 0.368 | 5016 | *
* | M | 60.1 92 64 5 39 45 44 | 29 | 0.452 | 1371 | *
* | N | 55.5 45 45 8 38 87 59 | 17 | 0.410 | 2923 | *
* | P | 53.0 48 48 9 39 83 56 | 18 | 0.364 | 2920 | *
* | Q | 54.3 27 44 7 44 92 56 | 20 | 0.344 | 2225 | *
* | R | 49.9 15 47 36 47 76 51 | 18 | 0.372 | 2765 | *
* | S | 55.6 69 53 3 51 81 56 | 22 | 0.464 | 3981 | *
* | T | 51.8 61 51 8 38 78 53 | 21 | 0.432 | 3740 | *
* | V | 61.1 93 65 5 40 39 42 | 34 | 0.418 | 4156 | *
* | W | 56.2 85 62 20 49 29 27 | 21 | 0.318 | 891 | *
* | Y | 49.7 73 52 33 49 36 38 | 19 | 0.359 | 2301 | *
* +---+------------------------------+-----+-------+------+ *
* *
* Abbreviations: *
* *
* AA: amino acid in one-letter code *
* b%o, i%o, e%o: = Qburied, Qintermediate, Qexposed (% of observed), *
* i.e. percentage of correct prediction in each state, see above *
* b%p, i%p, e%p: = Qburied, Qintermediate, Qexposed (% of predicted), *
* i.e. probability of correct prediction in each state, see above *
* b%o: = Qburied (% of observed), see above *
* Q10: percentage of correctly predicted residues in each of the 10 *
* states of predicted relative accessibility. *
* corr: correlation between predicted and observed rel. acc. *
* N: number of residues in data set *
* *
* *
* Accuracy for different secondary structure: *
* ........................................... *
* *
* +--------+------------------------------+----+-------+-------+ *
* | type | Q3 b%o b%p i%o i%p e%o e%p |Q10 | corr | N | *
* +--------+------------------------------+----+-------+-------+ *
* | helix | 59.5 79 64 8 44 80 56 | 27 | 0.574 | 20100 | *
* | strand | 61.3 84 73 9 46 69 37 | 35 | 0.524 | 13356 | *
* | loop | 54.4 64 43 11 44 78 61 | 18 | 0.442 | 27968 | *
* +--------+------------------------------+----+-------+-------+ *
* *
* Abbreviations as before. *
* *
****************************************************************************
* *
* Position-specific reliability index *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The network predicts the 10 states for relative accessibility using real*
* numbers from the output units. The prediction is assigned by choosing *
* the maximal unit ("winner takes all"). However, the real numbers *
* contain additional information. *
* E.g. the difference between the maximal and the second largest output *
* unit (with the constraint that the second largest output is compiled *
* among all units at least 2 positions off the maximal unit) can be used *
* to derive a "reliability index". This index is given for each residue *
* along with the prediction. The index is scaled to have values between *
* 0 (lowest reliability), and 9 (highest). *
* The accuracies (Q3, corr, asf.) to be expected for residues with values *
* above a particular value of the index are given below as well as the *
* fraction of such residues (%res).: *
* *
* +---+------------------------------+----+-------+-------+ *
* |RI | Q3 b%o b%p i%o i%p e%o e%p |Q10 | corr | %res | *
* +---+------------------------------+----+-------+-------+ *
* | 0 | 57.5 77 60 9 44 78 56 | 24 | 0.535 | 100.0 | *
* | 1 | 59.1 76 63 9 45 82 57 | 25 | 0.560 | 91.2 | *
* | 2 | 61.7 79 66 4 47 87 58 | 27 | 0.594 | 77.1 | *
* | 3 | 66.6 87 70 1 51 89 63 | 30 | 0.650 | 57.1 | *
* | 4 | 70.0 89 72 0 83 91 67 | 32 | 0.686 | 45.8 | *
* | 5 | 72.9 92 75 0 0 93 70 | 34 | 0.722 | 35.6 | *
* | 6 | 76.3 95 77 0 0 93 75 | 36 | 0.769 | 24.7 | *
* | 7 | 79.0 97 79 0 0 93 78 | 39 | 0.803 | 16.0 | *
* | 8 | 80.9 98 80 0 0 91 81 | 43 | 0.824 | 9.6 | *
* | 9 | 81.2 99 80 0 0 88 83 | 45 | 0.828 | 5.9 | *
* +---+------------------------------+----+-------+-------+ *
* *
* Abbreviations as before. *
* *
* The above table gives the cumulative results, e.g. 45.8% of all *
* residues have a reliability of at least 4. The correlation for this *
* most reliably predicted half of the residues is 0.686, i.e. a value *
* comparable to what could be expected if homology modelling were *
* possible. For this subset of 45.8% of all residues, 89% of the buried *
* residues are correctly predicted, and 72% of all residues predicted to *
* be buried are correct. *
* *
*..........................................................................*
* *
* The following table gives the non-cumulative quantities, i.e. the *
* values per reliability index range. These numbers answer the question: *
* how reliable is the prediction for all residues labeled with the *
* particular index i. *
* *
* +---+------------------------------+----+-------+-------+ *
* |RI | Q3 b%o b%p i%o i%p e%o e%p |Q10 | corr | %res | *
* +---+------------------------------+----+-------+-------+ *
* | 0 | 40.9 79 40 16 41 21 40 | 14 | 0.175 | 8.8 | *
* | 1 | 45.4 61 46 28 44 48 44 | 17 | 0.278 | 14.1 | *
* | 2 | 47.4 53 52 10 46 80 44 | 19 | 0.343 | 19.9 | *
* | 3 | 52.9 75 59 4 50 77 47 | 23 | 0.439 | 11.4 | *
* | 4 | 60.0 81 63 0 83 84 56 | 25 | 0.547 | 10.1 | *
* | 5 | 65.2 82 70 0 0 93 62 | 28 | 0.607 | 10.9 | *
* | 6 | 71.3 90 72 0 0 94 70 | 31 | 0.692 | 8.8 | *
* | 7 | 76.0 94 76 0 0 95 75 | 34 | 0.762 | 6.3 | *
* | 8 | 80.5 97 81 0 0 94 79 | 39 | 0.808 | 3.8 | *
* | 9 | 81.2 99 80 0 0 88 83 | 45 | 0.828 | 5.9 | *
* +---+------------------------------+----+-------+-------+ *
* *
* For example, for residues with RI = 4 83% of all predicted intermediate *
* residues are correctly predicted as such. *
* *
* *
****************************************************************************
* *
* *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* Prediction of helical transmembrane segments by PHDhtm: *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Author: Burkhard Rost *
* EMBL, Heidelberg, FRG *
* Meyerhofstrasse 1, 69 117 Heidelberg *
* Internet: Rost@EMBL-Heidelberg.DE *
* *
* All rights reserved. *
* *
* *
****************************************************************************
* *
* About the network method *
* ~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The PHD mail server is described in: *
* Rost, Burkhard; Sander, Chris; Schneider, Reinhard: *
* PHD - an automatic mail server for protein secondary structure *
* prediction. *
* CABIOS, 1994, 10, 53-60. *
* *
* To be quoted for publications of PHDhtm output: *
* Rost, Burkhard; Casadio, Rita; Fariselli, Piero; Sander, Chris: *
* Prediction of helical transmembrane segments at 95% accuracy. *
* Protein Science, 1995, 4, 521-533. *
* *
****************************************************************************
* *
* Estimated Accuracy of Prediction *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* A cross validation test on 69 helical trans-membrane proteins (in total*
* about 30,000 residues) with less than 25% pairwise sequence identity *
* gave the following results: *
* *
* ++================++-----------------------------------------+ *
* || Qtotal = 94.7% || ("overall two state accuracy") | *
* ++================++-----------------------------------------+ *
* *
* +----------------------------+-----------------------------+ *
* | Qhelix (% of observed)=92% | Qhelix (% of predicted)=83% | *
* | Qloop (% of observed)=96% | Qloop (% of predicted)=97% | *
* +----------------------------+-----------------------------+ *
* *
*..........................................................................*
* *
* These percentages are defined by: *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* | number of correctly predicted residues *
* |Qtotal = --------------------------------------- (*100)*
* | number of all residues *
* | *
* | no of res correctly predicted to be in helix *
* |Qhelix (% of obs) = -------------------------------------------- (*100)*
* | no of all res observed to be in helix *
* | *
* | *
* | no of res correctly predicted to be in helix *
* |Qhelix (% of pred)= -------------------------------------------- (*100)*
* | no of all residues predicted to be in helix *
* *
*..........................................................................*
* *
* Further measures of performance *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Matthews correlation coefficient: *
* *
* +---------------------------------------------+ *
* | Chelix = 0.84, Cloop = 0.84 | *
* +---------------------------------------------+ *
*..........................................................................*
* *
* Average length of predicted secondary structure segments: *
* *
* | +------------+----------+ *
* | | predicted | observed | *
* +-----------+------------+----------+ *
* | Lhelix = | 24.6 | 22.2 | *
* +-----------+------------+----------+ *
*..........................................................................*
* *
* The accuracy matrix in detail: *
* *
* +---------------------------------+ *
* | number of residues with H, L | *
* +---------+------+-------+--------+ *
* | |net H | net L |sum obs | *
* +---------+------+-------+--------+ *
* | obs H | 5214 | 492 | 5706 | *
* | obs L | 1050 | 22423 | 23473 | *
* +---------+------+-------+--------+ *
* | sum Net | 6264 | 22915 | 29179 | *
* +---------+------+-------+--------+ *
* *
* Note: This table is to be read in the following manner: *
* 5214 of all residues predicted to be in a helical trans-membrane *
* region, were observed to be in the lipid bilayer, 1050 however *
* were observed either inside or outside of the protein, i.e. in *
* loop (or non-membrane) regions. The term "observed" refers to DSSP*
* assignment of secondary structure calculated from 3D coordinates *
* of experimentally determined structures (Dictionary of Secondary *
* Structure of Proteins: Kabsch & Sander (1983) Biopolymers, 22, *
* 2577-2637) where these were available. For all other proteins, *
* the assignment of trans-membrane segments has been taken from the *
* Swissprot data bank (Bairoch, A.; Boeckmann, B.: The SWISS-PROT *
* protein sequence data bank. Nucl. Acids Res. 20: 2019-2022, 1992).*
* *
*..........................................................................*
* *
* Overlap between predicted and observed segments: *
* *
* +-----------------+---------------+----------------+ *
* | segment overlap | % of observed | % of predicted | *
* | Sov helix | 95.6% | 95.5% | *
* | Sov loop | 83.6% | 97.2% | *
* +-----------------+---------------+----------------+ *
* | Sov total | 86.0% | 96.8% | *
* +-----------------+---------------+----------------+ *
* *
* Definition of Sov in: Rost et al., JMB, 1994, 235, 13-26. *
* *
* As helical trans-membrane segments are longer than globular heli- *
* ces, correctly predicted segments can easily be made out. PHDhtm *
* misses 5 out of 258 observed segments, predicts 6 where non is *
* observed and 3 times the predicted helical segment overlaps two *
* observed regions. Thus, in total more than 95% of all segments *
* are correctly predicted. *
* *
*..........................................................................*
* *
* Entropy of prediction (information measure): *
* *
* +-----------------+ *
* | I = 0.64 | *
* +-----------------+ *
* *
* (For comparison: homology modelling of globular proteins in three *
* states: I=0.62.) *
* Definition of Sov in: Rost et al., JMB, 1994, 235, 13-26. *
* *
****************************************************************************
* *
* Position-specific reliability index *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* The network predicts two states: helical trans-membrane region and rest *
* using two output units. The prediction is assigned by choosing the ma- *
* ximal unit ("winner takes all"). However, the real numbers of the out- *
* put units contain additional information. *
* E.g. the difference between the two output units can be used to derive *
* a "reliability index". This index is given for each residue along with *
* the prediction. The index is scaled to have values between 0 (lowest *
* reliability), and 9 (highest). *
* The accuracies (Qtot) to be expected for residues with values above a *
* particular value of the index are given below as well as the fraction *
* of such residues (%res).: *
* *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | index| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | *
* | %res |100.0| 98.8| 97.3| 95.9| 94.1| 92.3| 89.9| 86.2| 75.0| 66.8| *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | | | | | | | | | | | | *
* | Qtot | 94.7| 95.2| 95.6| 96.2| 96.7| 97.2| 97.7| 98.4| 99.4| 99.8| *
* | | | | | | | | | | | | *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* | H%obs| 91.8| 92.9| 93.8| 94.4| 95.0| 95.7| 96.2| 96.8| 95.5| 78.7| *
* | L%obs| 95.3| 95.7| 96.1| 96.6| 97.0| 97.5| 98.1| 98.8| 99.7|100.0| *
* | | | | | | | | | | | | *
* | H%prd| 82.7| 83.8| 85.0| 86.7| 88.1| 89.7| 91.4| 93.8| 96.3| 97.1| *
* | L%prd| 97.9| 98.3| 98.5| 98.7| 98.8| 99.0| 99.2| 99.4| 99.7| 99.9| *
* +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ *
* *
* The above table gives the cumulative results, e.g. 92.3% of all *
* residues have a reliability of at least 5. The overall two-state *
* accuracy for this subset is 97.2%. For this subset, e.g., 95.7% of *
* the observed helical trans-membrane residues are correctly predicted, *
* and 89.7% of all residues predicted to be in helical trans-membrane *
* segment are correct. *
* *
* *
* *
****************************************************************************
The resulting network (PHD) prediction is:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
________________________________________________________________________________
****************************************************************************
* *
* PredictProtein@EMBL-Heidelberg.DE *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* PHD: Profile fed neural network systems from HeiDelberg *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* Prediction of: *
* - secondary structure, by PHDsec *
* - solvent accessibility, by PHDacc *
* - and helical transmembrane regions, by PHDhtm *
* *
* Author: Burkhard Rost *
* EMBL, Heidelberg, FRG *
* Meyerhofstrasse 1, 69 117 Heidelberg *
* Internet: Predict-Help@EMBL-Heidelberg.DE *
* All rights reserved. *
* *
****************************************************************************
* *
* The network systems are described in: *
* *
* PHDsec: B Rost & C Sander: JMB, 1993, 232, 584-599. *
* B Rost & C Sander: Proteins, 1994, 19, 55-72. *
* PHDacc: B Rost & C Sander: Proteins, 1994, 20, 216-226. *
* PHDhtm: B Rost, R Casadio, P Fariselli & C Sander, *
* Prot. Science, 4, 521-533. *
* *
****************************************************************************
* *
* Some statistics *
* ~~~~~~~~~~~~~~~ *
* *
* Percentage of amino acids: *
* +--------------+--------+--------+--------+--------+--------+ *
* | AA: | A | P | G | E | R | *
* | % of AA: | 16.3 | 10.2 | 10.2 | 10.2 | 8.2 | *
* +--------------+--------+--------+--------+--------+--------+ *
* | AA: | Q | N | L | V | T | *
* | % of AA: | 8.2 | 8.2 | 8.2 | 6.1 | 6.1 | *
* +--------------+--------+--------+--------+--------+--------+ *
* | AA: | S | K | F | D | *
* | % of AA: | 2.0 | 2.0 | 2.0 | 2.0 | *
* +--------------+--------+--------+--------+--------+ *
* *
* Percentage of secondary structure predicted: *
* +--------------+--------+--------+--------+ *
* | SecStr: | H | E | L | *
* | % Predicted: | 16.3 | 0.0 | 83.7 | *
* +--------------+--------+--------+--------+ *
* *
* According to the following classes: *
* all-alpha: %H>45 and %E< 5; all-beta : %H<5 and %E>45 *
* alpha-beta : %H>30 and %E>20; mixed: rest, *
* this means that the predicted class is: mixed class *
* *
****************************************************************************
* *
* PHD output for your protein *
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
* *
* Wed Nov 15 05:50:12 1995 *
* Jury on: 10 different architectures (version 5.94_317 ). *
* Note: differently trained architectures, i.e., different versions can *
* result in different predictions. *
* *
****************************************************************************
* *
* About the protein *
* ~~~~~~~~~~~~~~~~~ *
* *
* HEADER /home/phd/tmp/t-pir-fast_16596.pir *
* COMPND *
* SOURCE *
* AUTHOR *
* SEQLENGTH 49 *
* NCHAIN 1 chain(s) in t-pir-fast_16596 data set *
* NALIGN 2 *
* (=number of aligned sequences in HSSP file) *
* *
****************************************************************************
* *
* WARNING *
* ~~~~~~~ *
* *
* Expected accuracy is about 72% if, and only if, the alignment contain *
* sufficient information. For your sequence there were not many *
* homologues in the current version of Swissprot detected. This *
* implies that the expected accuracy is some percentage points lower ! *
* *
****************************************************************************
* *
* Abbreviations: PHDsec *
* ~~~~~~~~~~~~~~~~~~~~~ *
* *
* sequence: *
* AA : amino acid sequence *
* secondary structure: *
* HEL: H=helix, E=extended (sheet), blank=other (loop) *
* PHD: Profile network prediction HeiDelberg *
* Rel: Reliability index of prediction (0-9) *
* detail: *
* prH: 'probability' for assigning helix *
* prE: 'probability' for assigning strand *
* prL: 'probability' for assigning loop *
* note: the 'probabilites' are scaled to the interval 0-9, e.g.,*
* prH=5 means, that the first output node is 0.5-0.6 *
* subset: *
* SUB: a subset of the prediction, for all residues with an expected *
* average accuracy > 82% (tables in header) *
* note: for this subset the following symbols are used: *
* L: is loop (for which above " " is used) *
* ".": means that no prediction is made for this residue, as the *
* reliability is: Rel < 5 *
* *
* Abbreviations: PHDacc *
* ~~~~~~~~~~~~~~~~~~~~~ *
* *
* solvent accessibility: *
* 3st: relative solvent accessibility (acc) in 3 states: *
* b = 0-9%, i = 9-36%, e = 36-100%. *
* PHD: Profile network prediction HeiDelberg *
* Rel: Reliability index of prediction (0-9) *
* P_3: predicted relative accessibility in 3 states *
* note: for convenience a blank is used intermediate (i). *
* 10st:relative accessibility in 10 states: *
* = n corresponds to a relative acc. of n*n % *
* subset: *
* SUB: a subset of the prediction, for all residues with an expected *
* average correlation > 0.69 (tables in header) *
* note: for this subset the following symbols are used: *
* "I": is intermediate (for which above " " is used) *
* ".": means that no prediction is made for this residue, as the *
* reliability is: Rel < 4 *
* *
* *
* Abbreviations: PHDhtm *
* ~~~~~~~~~~~~~~~~~~~~~ *
* *
* secondary structure: *
* HL: T=helical transmembrane region, blank=other (loop) *
* PHD: Profile network prediction HeiDelberg *
* PHDF:filtered prediction, i.e., too long transmembrane segments *
* are split, too short ones are deleted *
* Rel: Reliability index of prediction (0-9) *
* detail: *
* prH: 'probability' for assigning helical transmembrane region *
* prL: 'probability' for assigning loop *
* note: the 'probabilites' are scaled to the interval 0-9, e.g.,*
* prH=5 means, that the first output node is 0.5-0.6 *
* subset: *
* SUB: a subset of the prediction, for all residues with an expected *
* average accuracy > 82% (tables in header) *
* note: for this subset the following symbols are used: *
* L: is loop (for which above " " is used) *
* ".": means that no prediction is made for this residue, as the *
* reliability is: Rel < 5 *
* *
****************************************************************************
* *
* protein: t-pir-f length 49 *
* *
....,....1....,....2....,....3....,....4....,....5....,....6
AA |EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRGAGGAPTLPETLNV|
PHD sec | HHHHHHHH |
Rel sec |9986568899424799989964366744423689999999998544469|
detail:
prH sec |0001210000232100010012567756643210000000000101000|
prE sec |0000000000000000000000210122111000000000000122320|
prL sec |9987778899656899989976211111235788999999998666679|
subset: SUB sec |LLLLLLLLLL...LLLLLLLL..HHH.....LLLLLLLLLLLLL...LL|
ACCESSIBILITY
3st: P_3 acc |ebeeeeeeeeeeebbebeeeeb bbbbbb bbebeeeeeeee eebbeb|
10st: PHD acc |9077778767778007076760500000050060699999873780070|
Rel acc |7045437419446224052410143679716111135984870340230|
subset: SUB acc |e.eee.ee.eeee..e.e.e...b.bbbb.b.....eeeeee..e....|
________________________________________________________________________________