Contents

1 Introduction

The new MultiSeq version of VMD that is currently under development was originally created to allow biomedical researchers to study the evolutionary changes in sequence and structure of proteins across all three domains of life, from bacteria to humans. The comparative sequence and structure metrics, and analysis tools introduced in the accompanying article by O’Donoghue and Luthey-Schulten 1 1P. O’Donoghue and Z. Luthey-Schulten. “Evolution of Structure in Aminoacyl-tRNA Synthetases” MMBR, 67(4):550-73. December, 2003. are now part of this new version. In particular, the Luthey-Schulten group has included a recently developed structure-based measure of homology QH (see Appendix B), that takes into account the effect of insertions and deletions and has been shown to produce accurate structure-based phylogenetic trees. The STAMP structural alignment algorithm, kindly provided by our colleagues Russell and Barton, is included in our alpha release 2 2R.B. Russell and G.J. Barton. “Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels.” Proteins: Struct. Func. Genet., 14:309-323. 1992. . We plan to offer biomedical researchers a tool to examine the changes in protein structure in the correct statistical framework. As a result, Multiple Alignment is an invaluable tool for relating protein structure to its function or misfunction.

1.1 Software Requirements

MultiSeq with VMD, in its alpha release, is operable on the following platforms:

Since MultiSeq is an extension to VMD, VMD 1.8.3MultiSeq is required to run MultiSeq. Although BLAST is not necessary for the overall function of MultiSeq, it is highly recommended to have BLAST installed locally (i.e. accessible through file browsing on your local computer).

2 The main MultiSeq Window

MultiSeq is accessed as an extension through VMD. To begin MultiSeq, you first must begin the VMD 1.8.3MultiSeq application. Once VMD is up and running:

  1. In the VMD main window, click on the Extensions Menu.
  2. In Extensions, select Analysis Multiple Alignment.

The main MultiSeq window will appear alongside the VMD Main, OpenGL Display, and terminal windows. Figure 8 shows the main MultiSeq window.


PIC

Figure 1: Main MultiSeq Window


3 Using and Managing Data

To begin analyzing proteins in MultiSeq, data from sequence3 3Something about FASTA files. and structure4 4The ASTRAL database (http://astral.stanford.edu) is a compendium of protein domain structures derived from the PDB database. It divides each protein structure into its domain components. For example, AspRS is divided into three separate PDB files: one containing the catalytic domain, one with the insertion domain, and one for the anticodon binding domain. The names of the files contain the PDB extension, the letter a for ASTRAL, and a number, which corresponds to which domain it is in the original PDB file. The PDB is the single worldwide repository for the processing and distribution of 3-D structure data of large molecules of proteins and nucleic acids. files is required. Import Data allows you to load structure and sequence files, both locally and via a network connection. Although various structure and trajectory files, such as PDB and PSI, can be loaded via the New Molecule function of the VMD Main window, Import Data allows you to load sequence files. Additionally, Import Data has BLAST searching capabilities, if a local copy of BLAST is installed. Load Session is another way to input structure and sequence files into MultiSeq. Unlike Import Data, Load Session opens up a previous session of MultiSeq with all of the sequence and structure files aligned, and using previous coloring and drawing methods. Import Data is distinguished from Load Session by the fact that it only loads sequence and structure files into MultiSeq (raw data). Whereas Load Session provides a whole MultiSeq session (manipulated data).

3.1 MultiSeq Sessions

As noted before, MultiSeq sessions are previously saved sessions of MultiSeq, inculding structure and sequence files with alignments and variosu visual representations. To load a previously saved MultiSeq Session, simply select the File menu and Load Session. A file broswer will appear allowing you to select a file with the extension .multiseq and make sure it has a corresponding directory of the same name.

3.2 Importing from files

Structure5 5See VMD Manual for supported formats and Sequence files can be loaded into MultiSeq via Import Data. PDB files are structure files, whereas FASTA is a sequence file format. To load these files:

  1. Make sure From Files is selected as a Data Source.
  2. In the Filenames: dialogue, either type in the location of the file, or hit the browse button to locate the file.
  3. Another option for loading PDB files only in the Filenames: dialogue, is to simply type in the PDB or SCOP id. This option requires a network connection for your computer to obtain files from PDB or ASTRAL directly.
  4. Once you have completed step 2 or 3, hit the OK button.

If you would like to load mutiple files at once, you can separate each file with a comma.


PIC

Figure 2: Main MultiSeq Window


From Files
selecting this option means you know the location and or ids of the files.
Filenames:
Within this dialog, you can either type in the location of the file with the help of the Browse button, or the ID of a PDB file.
From BLAST Search
selecting this options means you want to search for sequences based on a certain criteria.
Search Profile:
You have three options All Sequences, Marked Sequences, or Selected Sequences
Database:
In this dialog, you can type in the local location of the database.
E Score:
Iterations:

3.3 Sequences and BLAST searching

In Import Data, you can also search for sequences via BLAST. In order to do this, you must first have BLAST installed in a locally acccesible location. To conduct a BLAST search:

  1. Before you open the Import Data window, you have the option of either selecting a set of sequences, or a region within a sequence.
  2. Open Import Data and select From BLAST Search, and All Sequences, Marked Sequences, or Selected Regions.
  3. In the Databases, either type the location of the database, or use the browse button to loacate it.
  4. Select the E Score and Iterations, then hit the OK button.

After hitting the OK button, your computer will begin a BLAST search. This may take several minutes. When the search is done, a new window called BLAST Search Results will appear. The results do not immediately appear in the main MultiSeq window, because you may want to apply further filters on the retrieved sequences. The Blast Search Results window is divided into three main parts: sequence viewer, filter options, and view options.


PIC

Figure 3: BLAST Search Results


Matches
Number of files BLAST has retrieved based on the criteria set in Import Data.
Filter Options
Further options to filter the files obtained by BLAST
E Score:
Redundancy Cutoff:
Superkingdom:
Kingdom:
Phylum:
View Options

4 Working in the Environment

MultiSeq provides a unique working environment for the analysis of proteins. As with most pieces of software, MultiSeq has its own nomenclature and unique methods of analysis.

4.1 Grouping

While working with the Sequence Viewer in MultiSeq, you may notice certain patterns or trends. As a result you would like to put certain sequences closer to others to analyze such motifs. MultiSeq allows such grouping based on taxonomy or you can customize the groupings using Create from Selection.

4.2 Managing Representations

4.3 Info Viewer

Whenever you load a sequence or structure into MultiSeq an I button will appear next to the protein’s ID. If you click on this button, a new window will appear called the Info Viewer. Within this window information regarding the species the protein is from will appear.

4.4 Selecting vs. Marking

As you browse the menus of MultiSeq you will notice options for Selected Sequences or Marked Sequences. Selecting Sequences is when you highlight a portion of the sequence(s) in the sequence viewer using the mouse. This can be either the entire sequence or a portion. However Marking Sequences allows you to more easily select an entire sequence by simply checking the box next to the protein ID.

5 Performing Alignments

5.1 Structure Alignments

MultiSeq uses the program STAMP to structurally align protein molecules. The STAMP algorithm minimizes the Cα distance between aligned residues of each molecule by applying globally optimal rigid-body rotations and translations. Also, note that you can perform alignments on molecules that are structurally similar. If you try to align proteins that have no common structures, STAMP will have no means to align them. If you would like further information about how the alignment occurs, please refer to the STAMP manual.


PIC

Figure 4: Main MultiSeq Window


Align the following:
Show Alignment Preview:
Number of passes (npass):
Whether one or two fits are to be performed. The idea is that the initial fit can be used with a conformation biased set of parameters to improve the initial fit prior to fitting using distance and conformation parameters. Default NPASS = 1
Similarity (scanscore):
Specifies how the Sc value (STAMP algorithm) is to be calculated. This depends on the particular application. As a general rule of thumb, use SCANSCORE=6 for large database scans, when you are scanning with a small domain, and wishing to find all examples of this domain - even within large structures. Use SCANSCORE=1 when you wish to obtain a set of transformations for a set of domains which you know are similar (and have defined fairly precisely as domains rather than the larger structure that they may be a part of). Default SCANSCORE = 6
Comparison residues (scanslide):
This is the number of residues that a query sequence is ’slid’ along a database sequence to derive each initial superimposition. Initially, the N-terminus of the query is aligned to the 1st residue of the databse, once this fit has been performed and refined, and tested for good structural similarity, the N-terminus is aligned with the 1+th position, and the process repeated until the end of the database sequence has been reached. Default SCANSLIDE = 5
Slow scan:
If set to TRUE, then the SLOW method of getting the initial fits for scanning will be used (See chapter 1). Default SLOWSCAN = FALSE
Defaults:
resets the STAMP parameters to their original values

5.2 Sequence Alignments


PIC

Figure 5:


Multiple Alignment
Align All Sequences
Align Marked Sequences
Profile Alignment
Align marked sequences to:
Show Alignment Preview
5.2.1 CLUSTALW

5.2.2 Profile Alignments

6 View Menu

6.1 Coloring

Options
None
Add Current Selection
Conservation
Q Score
Traditionally, Q has meant “the fraction of similar native contacts” between the aligned residues in two proteins6 6Eastwood, M.P., C. Hardin, Z. Luthey-Schulten, and P.G. Wolynes. ”Evaluating protein structure-prediction schemes using energy landscape theory.” IBM J . Res. Dev. 45: 475-497. 2001, or in two different conformational states of the same protein. When Q = 1, it indicates that the structures are identical. When Q has a low score (0.1), it means the structures do not align well, or, in other words, only a small fraction of the C-alpha atoms superimpose. You will discover that homologs typically have Q0.4. Q per residue is the contribution from each residue to the overall average Q score. For more information see Appendix A.
Residue Type
Sequence Identity
The aligned domains are colored by how much of the sequence is conserved. This tool, Sequence identity, colors each amino acid according to the degree of conservation within the alignment: blue means highly conserved, wheras red means very low or no conservation.
Import
Refresh Scale...

6.2 Sequence Display Options

6.3 Open GL Display Options

6.3.1 Highlight Style

Highlight style is an option for the OpenGL diplay. The style refers to drawing method in VMD7 7For more information about drawing methods, please refer to the VMD manual.. This option allows a user to highlight residues of a structure in the sequence display and see the areas simultaneously highlighted in the OpenGL display.

6.3.2 Highlight Color

Highlight color is another option for the OpenGL diplay. Alongside Highlight Style, Highlight color is the color or coloring method8 8For more information, please refer to the VMD manual used in the OpenGL display when highlighting residues in the Sequence Display. The default Highlight Color is yellow.

7 Phylogenetic Tree

The Phylogenetic Tree feature in MultiSeq helps in determining the structure and sequence-based relationships between the aligned domains of proteins.
 
To do this, it uses a modification of Q that accounts for both gapped and aligned regions. This new metric, QH , creates a structure-based phylogeny that is congruent to the sequence-based phylogenies


PIC

Figure 6:


Create tree for:
Create the following trees
Structural tree using QH
Structural tree using RMSD
Sequence tree using CLUSTALW

8 Plot Data

9 Options Menu

9.1 Turning Sequence Editing On

9.2 Searching within sequences

10 Search Menu

10.1 Select Contact Shells


PIC

Figure 7:


Select residues in:
With a contact distance of:
That are in the following contact shell(s) for the currently selected residues

10.2 Select Non-Redundant Set


PIC

Figure 8: Main MultiSeq Window


Select from:
Using Structure QR
QH Cutoff:
Using Sequence QR
Identity Cutoff:
Gap Scale Factor:
Seed with selected sequences

10.3 Select Residues

The Residue Selection feature lets you analyze conservation, using different measures, and highlight residues in the Sequence Display and Structure Display simultaneously. Residue Selection allows you to examine the conservation on a per residue basis.


PIC

Figure 9:


Select residues in:
There are two options: either Where Sequence Identity is: or Where Q score is:. Where Sequence Identity is: is a sequence identity measure, whereas Where Q score is: is a structure measure.
Where Sequence Identity is:
If this option is selected you can select either the less than or equal to option, then a number between 0-99.
Where Q Score is:
If this option is selected you can select either the less than or equal to option, then a number between 0-99.

11 Saving Data from MultiSeq analysis

11.1 Save Session

You can save a session of MultiSeq, with all of the files, alignments, and visual representations, by simply going to the File Menu and selecting Save session. You will be prompted to save the session, and will have the opportunity to create a unique name for the session here. Hit the OK button. Save session will generate a file with a .multiseq extention along with a directory filled with various files necessary to load the saved session into MultiSeq. Please note that both the generated file and directory have to be in the same directory location in order to load up the session in the future properly.

11.2 Save to PostScript

11.3 Write PDB from selection...

During a MultiSeq session, you may want to save various persepectives of the structural alignments you created. Often these images are generated by highlighting specific portions of the aligned protein sequences. If you would like to study your selections further, you can can do so by generating your own PDB file(s). To begin this process:

  1. Highlight the portions of the sequence that you want to examine in the Sequence Display of the MultiSeq window.
  2. In the same window top pull-down menu, go to ViewHighlight style. Multiple Highlight styles will appear to choose from. Select one and make sure it appears in the OpenGL Display.
  3. Click on File Write PDB from selection....
  4. The PDB file(s) will be saved in a directory that can be chosen by clicking on the File Choose Work Directory.... If you havent selected your Work directory, you be prompted to choose a directory when you click on Write PDB from selection....

11.4 Save work data into another directory...

12 Appendices

12.1 Appendix A: Q

The following equation is from the article “Evaluationg protein structure-prediction schemes using energy landscape theory” by Eastwood, et al.

                   ∑      [  (     N )2 ]
Q = ------2-------     exp - -rij --r2ij-
     (N - 1)(N - 2)i<j-1         2σij

rij is the distance between a pair of Cα atoms.
 
rijN is the Cα-Cα distance between residues i and j in the native state.
 
σij2 = ∣i- j∣0.15 is the standard deviation, determining the width of the Gaussian function.
 
N is the number of residues of the protein being considered.

12.2 Appendix B: QH

The following text is in the article “On the evolution of structure in aminoacyl-tRNA synthetases.” by O’Donoghue et al.

Homology Measure

 
We employ a structural homology measure which is based on the structural similarity measure, Q, developed by Wolynes, Luthey-Schulten, and coworkers in the field of protein folding. Our adaptation of Q is referred to as QH, and the measure is designed to include the effects of the gaps on the aligned portion: QH= aleph (qaln+qgap), where  aleph is the normalization, specifically given below. QH is composed of two components. qaln is identical in form to the unnormalized Q measure of Eastwood et al. and accounts for the structurally aligned regions. The qgap term accounts for the structural deviations induced by insertions in each protein in an aligned pair:

QH  =  aleph [qaln + qgap]

      ∑      [  (r  - r ′′)2]
qaln =     exp - --ij---i2j---
     i<j-2          2σ ij

                   {    [  (         ) ]    [  (         ) ]}
        ∑  Na∑ln            rgaj --rg′aj′-2       -rgaj-- rg′a′j′2
qgap  =         max  exp  -    2σ2       ,exp -     2σ2
         ga  j     (    ⌊  (    gaj  ) ⌋     ⌊  (    gaj   )  ⌋)
        ∑  Na∑ln    |{        rgj - rg′j′ 2         rgj - rg′′j′ 2 |}
     +         max   exp |⌈- ---b--2-b---|⌉ ,exp|⌈- ---b--2-b----|⌉
         gb  j     |(           2σgbj                 2σgbj      |)

The first term, qaln, computes the unnormalized fraction of Cα-Cα pair distances that are the same or similar between two aligned structures. rij is the spatial Cα-Cα distance between residues i and j in protein a, and rij is the Cα-Cα distance between residues i’ and j’ in protein b. This term is restricted to aligned positions, e.g., where i is aligned to i’ and j is aligned to j’. The remaining terms account for the residues in gaps. ga and gb are the residues in insertions in both proteins, respectively. ga and g′′a are the aligned residues on either side of the insertion in protein a. The definition is analogous for gb and g′′b.
The normalization and the  2
σij  terms are computed as:

                         1
 aleph  = 1(N----1)(N-----2)+-N---N-----n-----2n-----
    2   aln      aln        aln  gr    gaps    cgaps

σ2 = ∣i- j∣0.15
 ij

where Naln  is the number of aligned residues. Ngr  is the number of residues appearing in gaps, and ngaps  is sum of the number of insertions in protein “a”, the number of insertions in protein “b” and the number of simultaneous insertions (referred to as bulges or c-gaps). ncgaps  is the number of c-gaps. Gap-to-gap contacts and intra-gap contacts do not enter into the computation, and terminal gaps are also ignored. σ2ij  is a slowly growing function of sequence separation of residues i  and j  , and this serves to stretch the spatial tolerance of similar contacts at larger sequence separations. QH  ranges from 0 to 1 where QH = 1  refers to identical proteins. If there are no gaps in the alignment, then QH  becomes Qaln =  aleph qaln  , which is identical to the Q-measure described into the Q measure described before.