Specification of MSF format
Example for MSF format when using the email interface
Example for MSF format when using the WWW interface
Compulsory features are :
- KEYWORDS
Keywords "MSF:", "Type:" , and "Check:" in a line that ends with two dots. After the dots the data is assumed to start, anything preceding that line may be missing. The following abbreviations are used:
- MSF:
= alignment length (length of longest sequence)
- Type:
= P for protein sequences,
= N for nucleotide sequences (not allowed for PredictProtein)
- Check:
gives a checksum made up of the ASCII values of the sequence characters. This value can be used to check whether an alignment has been edited since it was created. PredictProtein does not make use of the explicit values given for Check. Unfortunately, the current format converter software produces errors, if no number is given. Thus, invent any number you want and put it here!
- ALIGNMENT DESCRIPTION
After the dots, and preceding the alignment, there is the alignment description part.
- Sequence identifier
The sequence names (following the keyword "Name:") HAVE to be UNIQUE (different names for any sequence pair).
No blank is accepted within a sequence name. E.g. 'Id seq a' will be interpreted as: identifier = Id, and amino acid 1-4 = s, e, q, a.
The maximal number of characters of 'Id_seq_0' is 13.
- Len:, Check:, and Weight:
The fields "Len:", "Check:", and "Weight:" are not used for PredictProtein. However, the conversion software again requires that numbers (of any value) are given.
- //
Essential double backslash: "//" as termination of the header list. After this the alignment is expected to begin.
- ALIGNMENT
The rest of the file is interpreted as alignment. Any line not starting with a sequence identifier (as given in the header!) is ignored. If a line starts with a correct identifier, say Id_seq_n, EVERYTHING following the first word of this line is appended to the sequence Id_seq_n.
General note:
MSF is the multiple sequence alignment format of the GCG sequence analysis package. If you have access to the GCG package, the generation of an MSF format is straightforward. If not, and if you ought to have difficulties generating it according to the above given protocol, please contact:
Predict-Help@embl-heidelberg.de.