Database of homology derived protein structures and the structural meaning of sequence alignment.
Chris Sander and Reinhard Schneider
Proteins, 1991,9, 56-68
The database of known protein three-dimensional structures can be
significantly increased by the use of sequence homology, based on the
following observations. (1) The database of known sequences, currently
at more than 12000 proteins, is two orders of magnitude larger than the
database of known structures. (2) The currently most powerful method of
predicting protein structures is model building by homology.
(3) Structural homology can be inferred from the level of sequence
similarity. (4) The threshold of sequence similarity sufficient for
structural homology depends strongly on the length of the alignment.
Here, we first quantify the relation between sequence similarity,
structure similarity and alignment length by an exhaustive survey of
alignments between proteins of known structure and report a homology
threshold curve as a function of alignment length. We then produce a
database of homology-derived secondary structure of proteins (HSSP) by
aligning to each protein of known structure all sequences deemed
homologous on the basis of the threshold curve. For each known protein
structure, the derived database contains the aligned sequences,
secondary structure, sequence variability and sequence profile. Tertiary
structures of the aligned sequences are implied, but not modelled
explicitly. The database effectively increases the number of known
protein structures by a factor of five to more than 1800.
The results may be useful in assessing the structural significance of
matches in sequence database searches, in deriving preferences and
patterns for structure prediction, in elucidating the structural role of
conserved residues and in modelling three-dimensional detail by homology.