cluster
Cluster input frames using the specified clustering algorithm and distance metric.
cluster [crdset <crd set>] [data <dset0> [,<dset1>...]] [nocoords] [<name>] [<Algorithm>] [<Metric>] [<Pairwise>] [<Sieve>] [<BestRep>] [<Output>] [<Coord. Output>] [<Graph>] [readinfo {infofile <info file> | cnvtset <dataset>}] [useframesincache]
Algorithm Args: :
[{hieragglo|dbscan|kmeans|dpeaks}]
[hieragglo [epsilon <e>][clusters <n>][linkage | averagelinkage | complete][epsilonplot <file>]] [dbscan minpoints <n> epsilon <e> [kdist <k> [kfile <prefix>]]] [kmeans clusters <n> [randompoint [kseed <seed>]] [maxit <iterations>]] [dpeaks epsilon <e> [noise] [dvdfile <density_vs_dist_file>][choosepoints {manual | auto}][distancecut <distcut>][densitycut <densitycut>][runavg <runavg_file>][deltafile <file>][gauss]]
Metric Args:
[{dme | rms | srmsd | qrmsd} [mass] [nofit] [<mask>] ] [{euclid | manhattan}] [wgt <list>]
Pairwise Args:
[pairwise <name> [pairdistfile <file>]] [pwrecalc] [loadpairdist] [savepairdist] [pairwisecache {mem | disk | none}]
Sieve Args:
[sieve <#> [sieveseed <#>] [random] [includesieveincalc] [includesieved_cdist] [{sievetoframe|sievetocentroid|closestcentroid}] [repsilon <restore epsilon>]]
BestRep Args:
[bestrep {cumulative|centroid|cumulative_nosieve}] [savenreps <#>]
Output Args:
[out <cnumvtime> [gracecolor]] [noinfo|info <file>] [summary <file>][summarysplit <splitfile>] [splitframe <comma-separated frame list>] [clustersvtime <file> [cvtwindow <#>]] [sil <prefix> [silidx {idx|frm}]] [metricstats <file>] [cpopvtime <file> [{normpop|normframe}]] [lifetime]
Coordinate Output Args:
[clusterout <trajfileprefix> [clusterfmt <trajformat>]][singlerepout <trajfilename> [singlerepfmt <trajformat>]][repout <repprefix> [repfmt <trajformat>] [repframe]][avgout <avgprefix> [avgfmt <trajformat>]][assignrefs [refcut <rms>] [refmask <mask>]]
Graph Args:
[{drawgraph|drawgraph3d} [draw_tol <tolerance>] [draw_maxit <iterations]]
[crdset <crd set>]
Name of COORDS data set to cluster on and/or use for coordinate output. If not specified the default COORDS set will be generated and used unless nocoords has been specified.
[data <dset0>[,<dset1>,...]
Distance between frames calculated using specified data set(s). Currently 1D scalar sets and COORDS sets are supported.
[nocoords]
Do not use a COORDS data set; distance metrics that require coordinates and coordinate output will be disabled.
[<name>]
Data set Name for generated cluster data sets.
[readinfo]
Use previous cluster results to set up initial clusters. Clustering will continue if possible (i.e. this can be used to restart clustering).
infofile <file>
Cluster info file to read clusters from.
cnvtset <dataset>
Cluster number vs time data set to use to generate initial clusters.
[useframesincache]
If a pairwise cache is specified, cluster on the frames stored in the cache.
Algorithms:
hieragglo
(Default) Use hierarchical agglomerative (bottom-up) approach.
[epsilon <e>]
Finish clustering when minimum distance between clusters is greater than <e>.
[clusters <n>]
Finish clustering when <n> clusters remain.
[linkage]
Single-linkage; use the shortest distance between members of two clusters.
[averagelinkage]
Average-linkage (default); use the average distance between members of two clusters.
[complete]
Complete-linkage; use the maximum distance between members of two clusters.
[epsilonplot <file>]
Write number of clusters vs epsilon to <file>.
[includesieved_cdist]
Include sieved frames in final cluster distance calculation (may be very slow).
dbscan
Use DBSCAN clustering algorithm of Ester et al.
minpoints <n>
Minimum number of points required to form a cluster.
epsilon <e>
Distance cutoff between points for forming a cluster.
[sievetoframe]
When restoring sieved frames, compare frame to every frame in a cluster instead of the centroid; slower but more accurate.
[kdist <k>]
Generate K-dist plot for help in determining DBSCAN parameters (see below).
[kfile <prefix>]
Prefix for K-dist plot file.
dpeaks
Use the density peaks algorithm of Rodriguez and Laio.
epsilon <e>
Cutoff for determining local density in Angstroms.
[noise]
If specified, treat all points within epsilon of another cluster as noise.
[dvdfile <density_vs_dist_file>]
File to write density versus minimum distance to point with next highest density. This can be used to determine
appropriate cutoffs for distance and density in a subsequent step with choosepoints
manual.
[choosepoints {manual | auto}]
Specify whether clusters will be chosen based on specified distance/density cutoffs, or automatically. If not specified
only the density vs distance file will be written and no clustering will be performed. Currently manual is recommended.
[distancecut <distcut>] [densitycut <densitycut>]
If choosepoints manual, points with minimum distance greather than or equal to <distcut> and density
greater than or equal to <densitycut> will be chosen.
[runavg <runavg file>]
If choosepoints automatic, the calculated running average of density versus distance will be written to <runavg file>.
[deltafile <file>]
If choosepoints automatic, distance minus the running average for each point will be written to this file.
[gauss]
Calculate density with Gaussian kernels instead of using discrete density.
kmeans
Use K-means clustering algorithm.
clusters <n>
Finish clustering when number of clusters is <n>.
[randompoint]
Randomize initial set of points used (recommended).
[kseed <seed>]
Random number generator seed for randompoint.
[maxit <iteration>]
Algorithm will run until frames no longer change clusters of <iteration> iterations are reached (default 100).
readtxt|readinfo
No clustering – read in previous cluster results.
infofile <file>
Cluster info file to read.
Distance Metric Options:
[rms | srmsd[<mask>]]
(Default rms) Distance between frames calculated via
best-fit coordinate RMSD using atoms in <mask>. If srmsd specified use symmetry-corrected RMSD.
[mass]
Mass-weight the RMSD.
[nofit]
Do not fit structures onto each other prior to calculating RMSD.
dme [<mask>]
Distance between frames calculated using distance-RMSD (aka DME, distrmsd) using atoms in <mask>.
[data <dset0>[,<dset1>,...]
Distance between frames calculated using specified data set(s) (Euclidean distance).
[sieve <#>]
Perform clustering only for every <#> frame. After clustering, all other frames will be added to clusters.
[random]
When sieve is specified, select initial frames to cluster randomly.
[sieveseed <#>]
Seed for random sieving; if not set the wallclock time will be used.
[pairdist <file>]
File to use for loading/saving pairwise distances.
[loadpairdist]
Load pairwise distances from <file> (CpptrajPairDist if pairdist not specified).
[savepairdist]
Save pairwise distances from <file> (CpptrajPairDist if pairdist not specified). NOTE: If sieving was performed only the calculated
distances are saved.
[pairwisecache {mem | disk | none}]
Cache pairwise distance data in memory (default), to disk, or disable pairwise caching. No caching will save
memory but be extremely slow. Caching to disk will likely be slow unless writing to a fast storage device (e.g. SSD) – data is saved to a file named ’CpptrajPairwiseCache’.
[includesieveincalc]
Include sieved frames when calculating within-cluster average (may be very slow).
[pwrecalc]
If a loaded pairwise distance file does not match the current setup, force recalculation.
Pairwise Distance Matrix Options:
[pairdist <name>]
Pairwise cache DataSet/File name to use for loading/saving pairwise distances.
[pairdistfile <file>]
File name to use for pairwise cache; if not specified and ‘pairdist’ specified, uses ‘pairdist’.
[pwrecalc]
If the loaded pairwise distance matrix does not match the current setup, force recalculation.
[loadpairdist]
Load pairwise distances from file specified by pairdist (CpptrajPairDist if pairdist not specified).
[savepairdist]
Save pairwise distances to file specified by pairdist (CpptrajPairDist if pairdist not specified). NOTE: If sieving was performed only the calculated distances are saved.
[pairwisecache {mem | disk | none}]
Cache pairwise distance data in memory (default), to disk, or disable pairwise caching. No caching will save memory but be extremely slow. Caching to disk will likely be slow unless writing to a fast storage device (e.g. SSD) – data is saved to a file named ‘CpptrajPairwiseCache’.
Sieving Options:
[sieve <#>]
Perform clustering only for every <#> frame. After clustering, all other frames will be added to clusters.
[random]
When sieve is specified, select initial frames to cluster randomly.
[sieveseed <#>]
Seed for random sieving; if not set the wallclock time will be used.
[includesieved_cdist]
Include sieved frames in final cluster distance calculation (may be very slow).
[includesieveincalc]
Include sieved frames when calculating within-cluster average (may be very slow).
[sievetoframe]
When restoring sieved frames, compare frame to every frame in a cluster using a cutoff of <restore epsilon> (default is algorithm epsilon when using DPeaks/DBscan) instead of the centroid; slower but more accurate.
[sievetocentroid]
When restoring sieved frames, compare frame to cluster centroid using a cutoff of <restore epsilon> (default is algorithm epsilon when using DPeaks/DBscan). Default method for DPeaks/DBSCAN.
[closestcentroid]
When restoring sieved frames, add each frame to its closest centroid. Default method for hieragglo/kmeans.
[repsilon <restore epsilon>]
Epsilon to use for sievetoframe/sievetocentroid (default is algorithm epsilon when using DPeaks/DBscan).
Best Representative Options:
[bestrep {cumulative|centroid|cumulative_nosieve}]
Method for choosing cluster representative frames.
cumulative
Choose by lowest cumulative distance to all other frames in cluster. Default when not sieving.
centroid
Choose by lowest distance to cluster centroid. Default when sieving.
cumulative_nosieve
Choose by lowest cumulative distance to all other frames, ignoring sieved frames.
[savenreps <#>]
Number of best representative frames to choose (default 1).
Output Options:
[out <cnumvtime>]
Write cluster # vs frame to <cnumvtime>. Algorithms that calculate noise (e.g. DBSCAN) will assign noise points a value of -1.
[gracecolor]
Instead of cluster # vs frame, write cluster# + 1 (corresponding to colors used by XMGRACE) vs frame. Cluster #s larger than 15 are given the
same color. Algorithms that calculate noise (e.g. DBSCAN) will assign noise points a color of 0 (blank).
[summary <summaryfile>]
Summarize each cluster with format ’#Cluster Frames Frac AvgDist Stdev Centroid AvgCDist’:
#Cluster
Cluster number starting from 0 (0 is most populated).
Frames #
of frames in cluster.
Frac
Size of cluster as fraction of total trajectory.
AvgDist
Average distance between points in the cluster.
Stdev
Standard deviation of points in the cluster.
Centroid
Frame # of structure in cluster that has the lowest cumulative distance to every other point. If multiple representatives are being saved this column is replaced with two columns for each representative, ‘Rep’ (representative frame #) and ‘RepScore’ (score according to current best representative metric).
AvgCDist
Average distance of this cluster to every other cluster.
[info <infofile>]
Write ptraj-like cluster information to <infofile>. This file has format:
#Clustering: <X> clusters <N> frames
#Cluster <I> has average-distance-to-centroid <AVG>
…
#DBI: <DBI>
#pSF: <PSF>
#Algorithm: <algorithm-specific info>
<Line for cluster 0>
…
#Representative frames: <representative frame list>
Where <X> is the number of clusters, <N> is the number of frames clustered, <I> ranges from 0 to <X>-1, <AVG> is the average distance of all frames in
that cluster to the centroid, <DBI> is the Davies-Bouldin Index, <pSF> is the pseudo-F statistic, and <representative frame list> contains the frame
# of the representative frame (i.e. closest to the centroid) for each cluster. Each cluster has a line made up of characters (one for each frame) where ’.’ means ’not in cluster’ and ’X’ means ’in cluster’.
[noinfo]
Suppress printing of cluster info.
[summarysplit <splitfile>]
Summarize each cluster based on which of its frames fall in portions of the trajectory specified by splitframe with format ’#Cluster
Total Frac C# Color NumInX … FracX … FirstX’:
#Cluster
Cluster number starting from 0 (0 is most populated).
Total #
of frames in cluster.
Frac
Size of cluster as a fraction of the total trajectory.
C#
Grace color number.
Color
Text description of the color (based on standard XMGRACE coloring).
NumInX
Number of frames in Xth portion of the trajectory.
FracX
Fraction of frames in Xth portion of the trajectory.
FirstX
Frame in the Xth portion of the trajectory where the cluster is first observed.
[splitframe <frame>]
For summarysplit, frame or comma-separated list of frames to split the trajectory at, e.g. ’100,200,300’.
[clustersvtime <filename>]
Write number of unique clusters observed in a given time window to <filename>.
[cvtwindow <windowsize>]
Window size for clustersvtime output.
[sil <prefix>]
Write average cluster silhouette value for each cluster to ‘<prefix>.cluster.dat’ and cluster silhouette value for each individual frame to ‘<prefix>.frame.dat’.
[silidx {idx|frm}]
Choose what indices to write to the cluster silhouette frame file: idx (the default) specifies the sorted index (starting from 0), frm specifies the actual frame number.
[metricstats <file>]
When more than one metric in use, print the fraction contribution of each metric to the total distance. This information can be used in conjunction with the wgt
keyword to adjust the contribution of each metric to the total distance. It is written to <file>
with format:
#Metric FracAv FracSD Avg SD Min Max Description
Where #Metric is the metric number, FracAv and FracSD are the average and standard devation of the fraction contribution of that metric to the total distance (taking into account distance type and weights), Avg, SD, Min, and Max are the average, standard devation, minimum, and maximum of the unmodified distance contribution from that metric, and Description is the metric description. This may be slow for large numbers of frames, so it is advisable to run this on a smaller (potentially sieved) number of frames
[cpopvtime <file> [normpop | normframe]]
Write cluster population vs time to <file>
; if normpop
specified normalize each cluster to 1.0; if normframe
specified normalize cluster populations by number of frames.
[lifetime]
Create a DataSet with aspect [Lifetime]
for each cluster; for each frame, have 1 if the cluster is present and 0 otherwise. Can be used with lifetime analysis
Coordinate Output Options:
clusterout <trajfileprefix>
Write frames in each cluster to files named <trajfileprefix>.cX, where X is the cluster number.
clusterfmt <trajformat>
Format keyword for clusterout (default Amber Trajectory).
singlerepout <trajfilename>
Write all representative frames to single trajectory named <trajfilename>.
singlerepfmt <trajformat>
Format keyword for singlerepout (default Amber Trajectory).
repout <repprefix>
Write representative frames to separate files named <repprefix>.X.<ext>, where X is the cluster number and <ext> is a format-specific filename extension.
repfmt <trajformat>
Format keyword for repout (default Amber Trajectory).
repframe
Include representative frame number in repout filename.
avgout <avgprefix>
Write average structure for each cluster to separate files named <avgprefix>.X.<ext>, where X is the cluster number and <ext> is a
format-specific filename extension.
avgfmt <trajformat>
Format keyword for avgout.
assignrefs
In summary/summarysplit, assign clusters to loaded reference structures if RMSD to that reference is less than specified cutoff. This will be printed in summary and summarysplit files as 2 extra columns: ‘Name’ (reference name) and ‘RMS’ (RMS to cluster centroid).
[refcut <rms>]
RMSD cutoff in Angstroms.
[refmask <mask>]
Mask to use for RMSD calculation. If not specified the default mask is all heavy atoms.
DataSet Created:
<name>
Cluster number vs time (color number if gracecolor specified).
<name> [DBI]
Hold final Davies-Bouldin index.
<name> [PSF]
Hold final pseudo-F value.
<name> [SSRSST]
Hold final SSR/SST value. [NCVT] (clustersvtime only). Number of unique clusters observed over time.
<name> [Pop]:<X>
Cluster X population vs time; index X corresponds to cluster number.
<name> [Lifetime]:<X>
(lifetime only). For each cluster X, contain 1 if cluster present that frame, 0 otherwise.
Note: cluster population vs time data sets are not generated until the analysis has been run.
Cluster input frames using the specified input data sets (can be any combination of coordinates/COORDS and/or 1D scalar data) with the specified clustering algorithm. For COORDS sets, the distance metric can be RMSD, symmetry-corrected RMSD, or DME. When multiple data sets are present, the total distance can be determined either via the Euclidean (default) or Manhattan method.
In order to speed up clustering of large trajectories, the sieve
keyword can be used. In addition, subsequent clustering calculations can be sped up by writing/reading calculated pair distances between each frame to/from a file specified by pairdist
(or “CpptrajPairDist” if pairdist
not specified).
Example: cluster on a specific distance:
distance endToEnd :1 :255 cluster data endToEnd clusters 10 epsilon 3.0 summary summary.dat info info.dat
Example: two clustering commands on the CA atoms of residues 2-10 using average-linkage, stopping when either 3 clusters are reached or the minimum distance between clusters is 4.0 for the first, and 8 clusters or minimum distance 2.0 for the second. The first command will write the cluster number vs time to cnumvtime.dat
and a summary of each cluster to avg.summary.dat
. The second clustering command will use the pairwise distance matrix from the first to speed things up:
cluster C1 :2-10 clusters 3 epsilon 4.0 info C1.info out cnumvtime.dat summary avg.summary.dat cluster C2 :2-10 clusters 8 epsilon 2.0 info C2.info pairdist PW
Clustering Metrics
The Davies-Bouldin Index (DBI, reported in the info le) measures sum over all clusters of the within cluster scatter to the between cluster separation; the smaller the DBI, the better. The DBI is dened as the average, for all clusters X, of fred, where fred(X) = max, across other clusters Y, of (Cx + Cy)/dXY. Here Cx is the average distance from points in X to the centroid, similarly Cy, and dXY is the distance between cluster centroids.
The pseudo-F statistic (pSF, reported in the info le) is another measure of clustering goodness. It is intended to capture the ‘tightness’ of clusters, and is in essence a ratio of the mean sum of squares between groups to the mean sum of squares within group. Higher values of pseudo-F are good. Generally, one selects a cluster-count that gives a peak in the pseudo-f statistic. Formula: A/B, where A = (T – P)/(G-1), and B = P / (n-G). Here n is the number of points, G is the number of clusters, T is the total distance from the all-data centroid, and P is the sum (for all clusters) of the distances from the cluster centroid.
The SSR/SST (reported in the info file) is the ratio of the sum of squares regression (SSR or between sum of squares) and the total sum of squares (SST). The SSR is calculated via the sum of the squared distances of all points within a given cluster to its centroid, and summed together for all clusters. The total sum of squares is the sum of squared distances for all frames to the overall mean. The ratio lies between 0 and 1 and is supposed to give the fraction of explained variance by the data. The ratio should increase with cluster count. There should be a point at which adding more clusters does not substantially increase SSR/SST, i.e. the point where increasing the cluster count does not add new information and should not increase further.
The cluster silhouette (sil/silidx keywords) is a measure of how well each point fits within a cluster. Values of 1 indicate the point is very similar to other points in the cluster, i.e. it is well-clustered. Values of -1 indicate the point is dissimilar and may t better in a neighboring cluster. Values of 0 indicate the point is on a border between two clusters. The sil <prefix> keyword will write two les. The first, <prefix>.cluster.dat, which has the format:
#Cluster <Si> StdDev
Where #Cluster
is the cluster number, <Si>
is the average silhouette value for all frames in the cluster, and StdDev
is the standard deviation of the silhouette value for all frames in the cluster. The second, <prefix>.frame.dat, will contain the silhouette value for each frame grouped by cluster, with indices controlled by the silidx keyword (default sorted by ascending silhouette value), e.g.:
#C0 Silhouette #C0 Silhouette 0 -0.135988 1 -0.0266746 2 -0.0167628 3 0.0609673 4 0.0649603 5 0.0835595 #C1 Silhouette 6 0.319039 7 0.319785 8 0.348833 9 0.358286 0 0.1376 9 0.1376
The last two lines will contain the overall average silhouette value twice, one at the lowest index and one at the highest. The le is formatted in this way to make it easy to visualize each cluster silhouette relative to the average value in e.g. the XMGRACE plotting program. If the clustering results in one or more cluster with silhouette values completely below the average line, the clustering is likely poor.
Hints for setting DBSCAN parameters with ’kdist’
It is not always obvious what parameters to set for DBSCAN. You can get a rough idea of what to set ’mindist’ and ’epsilon’ to by generating a so-called “K-dist” plot with the ’kidst <k>’ option. The K-dist plot shows for each point (X axis) the Kth farthest distance (Y axis), sorted by decreasing distance. You supply the same
distance metric and sieve parameters you want to use for the actual clustering, but nothing else. For example:
cluster C0 dbscan kdist 4 rms :1-4@CA sieve 10 loadpairdist pairdist CpptrajPairDist
The K-dist plot will be named <prefix>.<k>.dat, with the default prefix being ’Kdist’ (in this case the file name would be Kdist.4.dat). The K-dist plot usually looks like a curve with an initially steep slope that gradually decreases. Around where the initial part of the curve starts to flatten out (indicating an increas in density) is around where epsilon should be set; minpoints is set to whatever <k> was. It has been suggested that the shape of the K-dist curve doesn’t change too much after Kdist=4, but users are encouraged to experiment.
Using ’dpeaks’ clustering
The ’dpeaks’ (density peaks) algorithm attempts to find clusters by identifying points in high density regions which are far from other points of high density[reference]. There are two ways these points can be chosen. The first and recommended way is manually. In this method, clustering if first run with choosepoints not specified to generate a plot containing density versus minimum distance to point with next highest density (the decision graph). Appropriate cut offs for distance and density can then be chosen based on visual inspection; cutoffs should be chosen so that they select points that have both a high density and a high distance to point with next highest density. Clustering can then be run again with distancecut and densitycut set.
The second way is automatically; CPPTRAJ will attempt to identify outliers in the density vs distance plot based on distance from the running average. Although this only requires a single pass, this method of choosing points is not well-tested and currently not recommended.
The Binary pairwise matrix file format
When NetCDF is not present, the pairwise matrix file will be written in binary. The exact format depends on what version of CPPTRAJ generated the file (since earlier versions had no concept of ‘sieve’). The CpptrajPairDist file starts with a 4 byte header containing the characters ‘C’ ‘T’ ‘M’ followed by the version
number. A quick way to figure out the version is to use the linux od
command to output the first 4 bytes as hexadecimal, e.g.:
$ od -t x1 -N 4 CpptrajPairDist 0000000 43 54 4d 02
So the CpptrajPairDist file version in the above example is 2. The next few numbers describe the matrix size and depend on the version.
Version 0: Two 4-byte integers: # of rows and # of elements.
Version 1: Two 8-byte unsigned integers (equivalent to size_t on most systems): # of rows and # of elements.
Version 2: Three 8 byte unsigned integers: original # of rows, actual # of rows, and sieve value.
This is followed by the actual matrix data, stored as a single array of floats (4 bytes). For versions 1 and 2 the number of elements is explicitly stored. For version 2, to calculate the number of matrix elements you need to read:
Elements = (actual_rows * (actual_rows - 1)) / 2
The cluster pair-distance matrix is an upper-right triangle matrix without the diagonal (in row-major order), so the rst element is the distance between elements 0 and 1, the second is between elements 0 and 2, etc.
In version 2 files, if the sieve value is greater than 1 that means original_rows> actual_rows and there is an additional array of characters original_nrows long, with ‘T’ if the row is being ignored (i.e. it was sieved out) and ‘F’ if the row is active (i.e. is active in the actual pairwise-distance matrix).
The code that cpptraj uses to read in CpptrajPairDist files is in ClusterMatrix:: LoadFile() (ClusterMatrix.cpp).
The NetCDF pairwise matrix format
The default way to write pairwise matrix files as of version 6.0.0 is with NetCDF. This will be set up with the following parameters:
Attributes:
Conventions CPPTRAJ_CMATRIX
Version <version string>
MetricDescription <description of the distance metric used to create matrix>
Dimensions:
n_original_frames Number of frames originally in the set (i.e. before sieving).
n_rows Number of rows in the upper-triangle pairwise matrix.
msize Actual size of the matrix; should be (nRows_ * (nRows_-1)) / 2;
Variables:
sieve (integer, no dimension). Sieve value.
matrix (float, dimension msize). The pairwise matrix, flattened to 1 dimension. Index calc is: i1 = i + 1; index = (((nRows_ * i) – ((i1 * i) / 2)) + j – i1);
actual frames (integer, dimension n_rows). The actual frame numbers for which pairwise distances were calculated.
Examples of the use of the cluster command are available in this recipe