Combined clustering analysis
Example of Combined Cluster Analysis
Example of Combined Cluster Analysis
One way of determining structure populations from simulations is cluster analysis.
Clustering is a means of partitioning data so that data points inside a cluster
are more similar to each other than they are to points outside a cluster. In the
context of molecular simulation, this means grouping similar conformations together.
Similarity is determined by a distance metric – the smaller the distance, the more
similar the structures. One commonly used distance metric is coordinate RMSD.
It is very important to note that there is no one “correct” way to perform
cluster analysis. There are many different algorithms and distance metrics
available, and different combinations will work better for certain systems.
Cluster analysis usually requires some trial and error before satisfactory
results are obtained. This tutorial is just one example of cluster analysis.
For a more in-depth discussion on cluster analysis of MD trajectories, users
are encouraged to read the 2007 paper by
Shao et al.
on the subject.
In this example we will use CPPTRAJ
to perform both cluster analysis and
combined clustering analysis (as a means of ascertaining convergence). CPPTRAJ supports
a variety of clustering algorithms, distance metrics, clustering metrics, and
output options. This example will use trajectories of
the tetranucleotide rGACC that were generated using multidimensional
replica exchange (MREMD, 24 x Temperature, 8 x aMD). Although the
trajectories were originally generated with explicit TIP3P water, the solvent has
been removed to decrease the size of the trajectories.
This data and analysis was first reported (in part)
in
Roe et al., J. Phys. Chem. B, 2014, 118(13), pp 2543-3552.
Note that although input is provided in files, users are encouraged to use
the interactive mode to become better familiar with CPPTRAJ workflow and
command options.
We will begin by performing cluster analysis on a single trajectory. Start CPPTRAJ
by typing ‘cpptraj’. Load the topology and trajectory with the following commands:
parm rGACC.nowat.parm7 trajin rGACC.MREMD1.nowat.nc.40
Since these trajectories contain ions and we are only interested in rGACC itself,
we can remove them (ensuring they do not show up in any output structures) with a
strip command. We will also use
the outprefix keyword so that the
corresponding stripped topology will be written as ‘noions.rGACC.nowat.parm7’.
Note that this step is optional and does not have to be done in order to perform clustering.
strip :Na+ outprefix noions
Now we will issue the clustering command. There are many options for clustering which can
be grouped into 4 general categories: Clustering Algorithm, Distance Metric, Output, and
Coordinate Output.
cluster C0 \ dbscan minpoints 25 epsilon 0.9 sievetoframe \ rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \ sieve 10 random \ out cnumvtime.dat \ sil Sil \ summary summary.dat \ info info.dat \ cpopvtime cpopvtime.agr normframe \ repout rep repfmt pdb \ singlerepout singlerep.nc singlerepfmt netcdf \ avgout Avg avgfmt restart
We will briefly look at each option with more detail:
Once your input has been read in, type run
to begin trajectory processing and analysis. Be patient, as this may take about a minute
running on a fairly modern CPU. Once clustering has completed you can type
quit to exit CPPTRAJ. You will now have many
output files containing clustering results. If you check ‘summary.dat’ you should see
16 total clusters. The various output options are explained in detail in the Amber manual.
One particularly interesting output are the cluster populations versus time, written
to ‘cpopvtime.agr’ (xmgrace format). In particular one can see that at the beginning of the
trajectory the cluster populations are changing rapidly over time. As the run progresses
the cluster populations gradually stabilize until they reach their final values somewhere
around 70,000 frames. This is an indication that the cluster populations are reaching
an equilibrium and the results may be suitable for things like thermodynamic analysis.
However, a better way of ascertaining this is to compare the results to an independent
run.
“Convergence” is an important concept in molecular simulations. It is a way of gauging
how effectively the underlying conformational landscape is being sampled. A simulation
starting from a single point (i.e. conformation) may become kinetically trapped in a few
local energy wells; it is difficult to ascertain whether this has occurred from a single
simulation. However, if additional simulations of a system are started with different
initial conditions and/or different initial conformations (ideally both), eventually
they should sample the same conformations in the same populations; once this has occurred
the simulations are said to have “converged”. When two or more simulations have converged
and structure populations are no longer significantly changing, one can start to calculate
things like thermodynamic quantities with some degree of certainty.
Cluster analysis can be used to compare populations of structures in separate runs
as a means of determining convergence. However, it can be challenging to compare
clusters between different trajectory runs; they may either be present in each trajectory
in different populations or a conformation may occur in one trajectory but not the other.
In order to facilitate the comparison of clusters between different trajectories,
CPPTRAJ supports “combined clustering”, where cluster analysis is performed on
two (or more!) combined trajectories, then the results are partitioned based on the
original trajectories.
Since we are going to be combining the two trajectories before clustering, we need
to know how many frames are in each so we can tell the
cluster command where to partition the combined
clustering results. This can be done with cpptraj on the command line:
cpptraj -p rGACC.nowat.parm7 -y rGACC.MREMD1.nowat.nc.40 -tl
Here the ‘-p’ flag loads the topology, the ‘-y’ flag loads the trajectory, and the ‘-tl’
flag is used to report the number of frames. The output should be:
Frames: 83843
So we know we need to split the combined trajectories at frame 83843.
Now start CPPTRAJ by typing ‘cpptraj’. This will bring up the interactive prompt.
Load the topology and trajectories (and strip ions):
parm rGACC.nowat.parm7 trajin rGACC.MREMD1.nowat.nc.40 trajin rGACC.MREMD2.nowat.nc.40 strip :Na+ outprefix noions
Now enter the clustering command. Most lines will look similar to the previous run,
except we will be adding keywords for doing combined cluster analysis.
cluster C1 \ dbscan minpoints 25 epsilon 0.9 sievetoframe \ rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \ sieve 20 \ out combined.cnumvtime.dat \ info combined.info.dat \ summarysplit split.dat splitframe 83843
There are two new keywords here:
Once your input has been read in, type run
to begin trajectory processing and analysis. Be patient, as this may take about a minute
and a half to two minutes running on a fairly modern CPU. Once clustering has completed
you can type quit to exit CPPTRAJ.
The main output from the combined clustering analysis is in ‘split.dat’. The first two
lines of this file describe how the splitting of the combined input frames was done and
how many total frames comprise each section:
# 1st <= 83843 < 2nd # 1st= 83843 2nd= 108887
Next comes several columns:
The cluster populations between the first and second trajectories agree quite well
(max absolute difference in fraction population about 0.015),
indicating that the two trajectories are relatively well-converged.
Copyright Daniel R. Roe, 2015