The new MultiSeq version of VMD that is currently under development was originally created to allow biomedical researchers to study the evolutionary changes in sequence and structure of proteins across all three domains of life, from bacteria to humans. The comparative sequence and structure metrics, and analysis tools introduced in the accompanying article by O’Donoghue and Luthey-Schulten 1 1P. O’Donoghue and Z. Luthey-Schulten. “Evolution of Structure in Aminoacyl-tRNA Synthetases” MMBR, 67(4):550-73. December, 2003. are now part of this new version. In particular, the Luthey-Schulten group has included a recently developed structure-based measure of homology QH (see Appendix B), that takes into account the effect of insertions and deletions and has been shown to produce accurate structure-based phylogenetic trees. The STAMP structural alignment algorithm, kindly provided by our colleagues Russell and Barton, is included in our alpha release 2 2R.B. Russell and G.J. Barton. “Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels.” Proteins: Struct. Func. Genet., 14:309-323. 1992. . We plan to offer biomedical researchers a tool to examine the changes in protein structure in the correct statistical framework. As a result, Multiple Alignment is an invaluable tool for relating protein structure to its function or misfunction.
MultiSeq with VMD, in its alpha release, is operable on the following platforms:
Since MultiSeq is an extension to VMD, VMD 1.8.3MultiSeq is required to run MultiSeq. Although BLAST is not necessary for the overall function of MultiSeq, it is highly recommended to have BLAST installed locally (i.e. accessible through file browsing on your local computer).
MultiSeq is accessed as an extension through VMD. To begin MultiSeq, you first must begin the VMD 1.8.3MultiSeq application. Once VMD is up and running:
The main MultiSeq window will appear alongside the VMD Main, OpenGL Display, and terminal windows. Figure 8 shows the main MultiSeq window.
|
To begin analyzing proteins in MultiSeq, data from sequence3 3Something about FASTA files. and structure4 4The ASTRAL database (http://astral.stanford.edu) is a compendium of protein domain structures derived from the PDB database. It divides each protein structure into its domain components. For example, AspRS is divided into three separate PDB files: one containing the catalytic domain, one with the insertion domain, and one for the anticodon binding domain. The names of the files contain the PDB extension, the letter a for ASTRAL, and a number, which corresponds to which domain it is in the original PDB file. The PDB is the single worldwide repository for the processing and distribution of 3-D structure data of large molecules of proteins and nucleic acids. files is required. Import Data allows you to load structure and sequence files, both locally and via a network connection. Although various structure and trajectory files, such as PDB and PSI, can be loaded via the New Molecule function of the VMD Main window, Import Data allows you to load sequence files. Additionally, Import Data has BLAST searching capabilities, if a local copy of BLAST is installed. Load Session is another way to input structure and sequence files into MultiSeq. Unlike Import Data, Load Session opens up a previous session of MultiSeq with all of the sequence and structure files aligned, and using previous coloring and drawing methods. Import Data is distinguished from Load Session by the fact that it only loads sequence and structure files into MultiSeq (raw data). Whereas Load Session provides a whole MultiSeq session (manipulated data).
As noted before, MultiSeq sessions are previously saved sessions of MultiSeq, inculding structure and sequence files with alignments and variosu visual representations. To load a previously saved MultiSeq Session, simply select the File menu and Load Session. A file broswer will appear allowing you to select a file with the extension .multiseq and make sure it has a corresponding directory of the same name.
Structure5 5See VMD Manual for supported formats and Sequence files can be loaded into MultiSeq via Import Data. PDB files are structure files, whereas FASTA is a sequence file format. To load these files:
If you would like to load mutiple files at once, you can separate each file with a
comma.
|
In Import Data, you can also search for sequences via BLAST. In order to do this, you must first have BLAST installed in a locally acccesible location. To conduct a BLAST search:
After hitting the OK button, your computer will begin a BLAST search. This may take several minutes. When the search is done, a new window called BLAST Search Results will appear. The results do not immediately appear in the main MultiSeq window, because you may want to apply further filters on the retrieved sequences. The Blast Search Results window is divided into three main parts: sequence viewer, filter options, and view options.
|
MultiSeq provides a unique working environment for the analysis of proteins. As with most pieces of software, MultiSeq has its own nomenclature and unique methods of analysis.
While working with the Sequence Viewer in MultiSeq, you may notice certain patterns or trends. As a result you would like to put certain sequences closer to others to analyze such motifs. MultiSeq allows such grouping based on taxonomy or you can customize the groupings using Create from Selection.
Whenever you load a sequence or structure into MultiSeq an I button will appear next to the protein’s ID. If you click on this button, a new window will appear called the Info Viewer. Within this window information regarding the species the protein is from will appear.
As you browse the menus of MultiSeq you will notice options for Selected Sequences or Marked Sequences. Selecting Sequences is when you highlight a portion of the sequence(s) in the sequence viewer using the mouse. This can be either the entire sequence or a portion. However Marking Sequences allows you to more easily select an entire sequence by simply checking the box next to the protein ID.
MultiSeq uses the program STAMP to structurally align protein molecules. The STAMP algorithm minimizes the Cα distance between aligned residues of each molecule by applying globally optimal rigid-body rotations and translations. Also, note that you can perform alignments on molecules that are structurally similar. If you try to align proteins that have no common structures, STAMP will have no means to align them. If you would like further information about how the alignment occurs, please refer to the STAMP manual.
|
|
Highlight style is an option for the OpenGL diplay. The style refers to drawing method in VMD7 7For more information about drawing methods, please refer to the VMD manual.. This option allows a user to highlight residues of a structure in the sequence display and see the areas simultaneously highlighted in the OpenGL display.
Highlight color is another option for the OpenGL diplay. Alongside Highlight Style, Highlight color is the color or coloring method8 8For more information, please refer to the VMD manual used in the OpenGL display when highlighting residues in the Sequence Display. The default Highlight Color is yellow.
The Phylogenetic Tree feature in MultiSeq helps in determining the structure and
sequence-based relationships between the aligned domains of proteins.
To do this, it uses a modification of Q that accounts for both gapped and aligned
regions. This new metric, QH
, creates a structure-based phylogeny that is congruent to the sequence-based
phylogenies
|
|
|
The Residue Selection feature lets you analyze conservation, using different measures, and highlight residues in the Sequence Display and Structure Display simultaneously. Residue Selection allows you to examine the conservation on a per residue basis.
|
You can save a session of MultiSeq, with all of the files, alignments, and visual representations, by simply going to the File Menu and selecting Save session. You will be prompted to save the session, and will have the opportunity to create a unique name for the session here. Hit the OK button. Save session will generate a file with a .multiseq extention along with a directory filled with various files necessary to load the saved session into MultiSeq. Please note that both the generated file and directory have to be in the same directory location in order to load up the session in the future properly.
During a MultiSeq session, you may want to save various persepectives of the structural alignments you created. Often these images are generated by highlighting specific portions of the aligned protein sequences. If you would like to study your selections further, you can can do so by generating your own PDB file(s). To begin this process:
The following equation is from the article “Evaluationg protein structure-prediction schemes using energy landscape theory” by Eastwood, et al.
![∑ [ ( N )2 ]
Q = ------2------- exp - -rij --r2ij-
(N - 1)(N - 2)i<j-1 2σij](mseq_manual0x.png)
rij is the distance between a pair of Cα atoms.
rijN is the Cα-Cα distance between residues i and j in the native state.
σij2 =
0.15 is the standard deviation, determining the width of the Gaussian
function.
N is the number of residues of the protein being considered.
The following text is in the article “On the evolution of structure in aminoacyl-tRNA
synthetases.” by O’Donoghue et al.
Homology Measure
We employ a structural homology measure which is based on the structural similarity
measure, Q, developed by Wolynes, Luthey-Schulten, and coworkers in the field of
protein folding. Our adaptation of Q is referred to as QH, and the measure is
designed to include the effects of the gaps on the aligned portion: QH=
(qaln+qgap),
where
is the normalization, specifically given below. QH is composed of two
components. qaln is identical in form to the unnormalized Q measure of Eastwood et
al. and accounts for the structurally aligned regions. The qgap term accounts for the
structural deviations induced by insertions in each protein in an aligned
pair:
![QH = aleph [qaln + qgap]](mseq_manual2x.png)
![∑ [ (r - r ′′)2]
qaln = exp - --ij---i2j---
i<j-2 2σ ij](mseq_manual3x.png)
![{ [ ( ) ] [ ( ) ]}
∑ Na∑ln rgaj --rg′aj′-2 -rgaj-- rg′a′j′2
qgap = max exp - 2σ2 ,exp - 2σ2
ga j ( ⌊ ( gaj ) ⌋ ⌊ ( gaj ) ⌋)
∑ Na∑ln |{ rgj - rg′j′ 2 rgj - rg′′j′ 2 |}
+ max exp |⌈- ---b--2-b---|⌉ ,exp|⌈- ---b--2-b----|⌉
gb j |( 2σgbj 2σgbj |)](mseq_manual4x.png)
The first term, qaln, computes the unnormalized fraction of Cα-Cα pair distances
that are the same or similar between two aligned structures. rij is the spatial Cα-Cα
distance between residues i and j in protein a, and ri′j′ is the Cα-Cα distance
between residues i’ and j’ in protein b. This term is restricted to aligned positions,
e.g., where i is aligned to i’ and j is aligned to j’. The remaining terms
account for the residues in gaps. ga and gb are the residues in insertions in
both proteins, respectively. g′a and g′′a are the aligned residues on either
side of the insertion in protein a. The definition is analogous for g′b and
g′′b.
The normalization and the
terms are computed as:


where
is the number of aligned residues.
is the number of
residues appearing in gaps, and
is sum of the number of insertions
in protein “a”, the number of insertions in protein “b” and the number of
simultaneous insertions (referred to as bulges or c-gaps).
is the number of
c-gaps. Gap-to-gap contacts and intra-gap contacts do not enter into the
computation, and terminal gaps are also ignored.
is a slowly growing
function of sequence separation of residues
and
, and this serves to
stretch the spatial tolerance of similar contacts at larger sequence separations.
ranges from 0 to 1 where
refers to identical proteins. If
there are no gaps in the alignment, then
becomes
,
which is identical to the Q-measure described into the Q measure described
before.