Combined clustering analysis

Example of Combined Cluster Analysis

by Daniel R. Roe

Cluster Analysis with CPPTRAJ

One way of determining structure populations from simulations is cluster analysis.
Clustering is a means of partitioning data so that data points inside a cluster
are more similar to each other than they are to points outside a cluster. In the
context of molecular simulation, this means grouping similar conformations together.
Similarity is determined by a distance metric – the smaller the distance, the more
similar the structures. One commonly used distance metric is coordinate RMSD.

It is very important to note that there is no one “correct” way to perform
cluster analysis. There are many different algorithms and distance metrics
available, and different combinations will work better for certain systems.
Cluster analysis usually requires some trial and error before satisfactory
results are obtained. This tutorial is just one example of cluster analysis.
For a more in-depth discussion on cluster analysis of MD trajectories, users
are encouraged to read the 2007 paper by
Shao et al.
on the subject.

In this example we will use CPPTRAJ
to perform both cluster analysis and
combined clustering analysis (as a means of ascertaining convergence). CPPTRAJ supports
a variety of clustering algorithms, distance metrics, clustering metrics, and
output options. This example will use trajectories of
the tetranucleotide rGACC that were generated using multidimensional
replica exchange (MREMD, 24 x Temperature, 8 x aMD). Although the
trajectories were originally generated with explicit TIP3P water, the solvent has
been removed to decrease the size of the trajectories.
This data and analysis was first reported (in part)
in
Roe et al., J. Phys. Chem. B, 2014, 118(13), pp 2543-3552
.

Files

Note that although input is provided in files, users are encouraged to use
the interactive mode to become better familiar with CPPTRAJ workflow and
command options.

Performing Cluster Analysis with CPPTRAJ

Step 1: Loading topology and trajectory

We will begin by performing cluster analysis on a single trajectory. Start CPPTRAJ
by typing ‘cpptraj’. Load the topology and trajectory with the following commands:

parm rGACC.nowat.parm7
trajin rGACC.MREMD1.nowat.nc.40

Since these trajectories contain ions and we are only interested in rGACC itself,
we can remove them (ensuring they do not show up in any output structures) with a
strip command. We will also use
the outprefix keyword so that the
corresponding stripped topology will be written as ‘noions.rGACC.nowat.parm7’.
Note that this step is optional and does not have to be done in order to perform clustering.

strip :Na+ outprefix noions

Step 2: Clustering Command

Now we will issue the clustering command. There are many options for clustering which can
be grouped into 4 general categories: Clustering Algorithm, Distance Metric, Output, and
Coordinate Output.

cluster C0 \
        dbscan minpoints 25 epsilon 0.9 sievetoframe \
        rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \
        sieve 10 random \
        out cnumvtime.dat \
        sil Sil \
        summary summary.dat \
        info info.dat \
        cpopvtime cpopvtime.agr normframe \
        repout rep repfmt pdb \
        singlerepout singlerep.nc singlerepfmt netcdf \
        avgout Avg avgfmt restart

We will briefly look at each option with more detail:

  • C0: Cluster output data set(s) name.
  • CLUSTERING OPTIONS:

  • dbscan: Use the DBSCAN (density-based) clustering algorithm.
  • minpoints: Minimum # of points to form a cluster.
  • epsilon: Distance cutoff for forming cluster.
  • sievetoframe: Restore sieved frames by comparing to all cluster frames,
    not just centroid.
  • DISTANCE METRIC OPTIONS:

  • rms : Use RMSD of atoms in as distance metric.
  • sieve 10: Typically generating the pair-wise distance matrix (the distance of
    every frame to every other frame) is a very time and memory consuming
    part of the clustering calculation. “Sieving” is a way to reduce the
    expense of this step by using “total / 10” frames for initial clustering.
    The sieved frames are then added to those clusters as an additional step.
    Note that in most cases you will want to also use the
    random keyword
    to make the sieve random (instead of ordered); however it is not used
    here in order to make this example reproducible.
  • OUTPUT OPTIONS:

  • out <file>: Write cluster number versus time to ‘file’. Note that since the
    DBSCAN algorithm has a concept of “noise”, any noise frames will be
    assigned to cluster -1 (no cluster).
  • summary <file>: Write overall clustering summary to ‘file’.
  • info <file>: Write detailed cluster results (including DBI, pSF etc) to ‘file’.
  • cpopvtime <file> normframe: Write cluster population vs time to ‘file’,
    normalized by # frames.
  • COORDINATE OUTPUT OPTIONS:

  • repout <prefix> repfmt pdb: Write cluster representatives to ‘prefix.cX.fmt’ with
    PDB format, where X is the cluster number and ‘fmt’ is the
    default format extension.
  • singlerepout <file> singlerepfmt netcdf: Write all cluster representatives to
    ‘file’ with NetCDF format.
  • avgout <prefix> avgfmt restart: Write average over all frames in each cluster
    to ‘prefix.cX.fmt’ with Amber restart file format.

Once your input has been read in, type run
to begin trajectory processing and analysis. Be patient, as this may take about a minute
running on a fairly modern CPU. Once clustering has completed you can type
quit to exit CPPTRAJ. You will now have many
output files containing clustering results. If you check ‘summary.dat’ you should see
16 total clusters. The various output options are explained in detail in the Amber manual.

One particularly interesting output are the cluster populations versus time, written
to ‘cpopvtime.agr’ (xmgrace format). In particular one can see that at the beginning of the
trajectory the cluster populations are changing rapidly over time. As the run progresses
the cluster populations gradually stabilize until they reach their final values somewhere
around 70,000 frames. This is an indication that the cluster populations are reaching
an equilibrium and the results may be suitable for things like thermodynamic analysis.
However, a better way of ascertaining this is to compare the results to an independent
run.

Combined Cluster analysis with CPPTRAJ

“Convergence” is an important concept in molecular simulations. It is a way of gauging
how effectively the underlying conformational landscape is being sampled. A simulation
starting from a single point (i.e. conformation) may become kinetically trapped in a few
local energy wells; it is difficult to ascertain whether this has occurred from a single
simulation. However, if additional simulations of a system are started with different
initial conditions and/or different initial conformations (ideally both), eventually
they should sample the same conformations in the same populations; once this has occurred
the simulations are said to have “converged”. When two or more simulations have converged
and structure populations are no longer significantly changing, one can start to calculate
things like thermodynamic quantities with some degree of certainty.

Cluster analysis can be used to compare populations of structures in separate runs
as a means of determining convergence. However, it can be challenging to compare
clusters between different trajectory runs; they may either be present in each trajectory
in different populations or a conformation may occur in one trajectory but not the other.
In order to facilitate the comparison of clusters between different trajectories,
CPPTRAJ supports “combined clustering”, where cluster analysis is performed on
two (or more!) combined trajectories, then the results are partitioned based on the
original trajectories.

Step 1: Loading topology and trajectories

Since we are going to be combining the two trajectories before clustering, we need
to know how many frames are in each so we can tell the
cluster command where to partition the combined
clustering results. This can be done with cpptraj on the command line:

cpptraj -p rGACC.nowat.parm7 -y rGACC.MREMD1.nowat.nc.40 -tl

Here the ‘-p’ flag loads the topology, the ‘-y’ flag loads the trajectory, and the ‘-tl’
flag is used to report the number of frames. The output should be:

Frames: 83843

So we know we need to split the combined trajectories at frame 83843.

Now start CPPTRAJ by typing ‘cpptraj’. This will bring up the interactive prompt.
Load the topology and trajectories (and strip ions):

parm rGACC.nowat.parm7
trajin rGACC.MREMD1.nowat.nc.40
trajin rGACC.MREMD2.nowat.nc.40
strip :Na+ outprefix noions

Step 2: Clustering command for combined cluster analysis

Now enter the clustering command. Most lines will look similar to the previous run,
except we will be adding keywords for doing combined cluster analysis.

cluster C1 \
        dbscan minpoints 25 epsilon 0.9 sievetoframe \
        rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \
        sieve 20 \
        out combined.cnumvtime.dat \
        info combined.info.dat \
        summarysplit split.dat splitframe 83843

There are two new keywords here:

  • summarysplit <file>: Write cluster information for split to file.
  • splitframe 83843:Split clustering at frame 83843 (where first traj stops).

Once your input has been read in, type run
to begin trajectory processing and analysis. Be patient, as this may take about a minute
and a half to two minutes running on a fairly modern CPU. Once clustering has completed
you can type quit to exit CPPTRAJ.

The main output from the combined clustering analysis is in ‘split.dat’. The first two
lines of this file describe how the splitting of the combined input frames was done and
how many total frames comprise each section:

# 1st <= 83843 < 2nd
# 1st= 83843  2nd= 108887

Next comes several columns:

  • #Cluster: Cluster number.
  • Total: Total number of frames in the cluster.
  • Frac: Fraction of total frames in the cluster.
  • C#: Grace-compatible cluster color. Colors start from 1. Clusters 14 and beyond
    are all assigned color 15.

  • Color: Name of the Grace-compatible color (using Grace's default scheme).
  • NumIn1st NumIn2nd: Number of frames of that cluster that fall into the first trajectory
    and second trajectory respectively.
  • Frac1 Frac2: Fraction of frames of that cluster that fall into the first trajectory
    and second trajectory respectively.
  • First1 First2: Frame at which the cluster was first encountered in the first trajectory
    and second trajectory respectively.

The cluster populations between the first and second trajectories agree quite well
(max absolute difference in fraction population about 0.015),
indicating that the two trajectories are relatively well-converged.

Copyright Daniel R. Roe, 2015