cluster

Cluster input frames using the specified clustering algorithm and distance metric.

cluster [crdset <crd set> | nocoords]

Algorithms:

[hieragglo [epsilon ] [clusters ] [linkage|averagelinkage|complete]
[epsilonplot ] [includesieved_cdist]]
[dbscan minpoints epsilon [sievetoframe] [kdist [kfile ]]]
[dpeaks epsilon [noise] [dvdfile ] [choosepoints {manual | auto}] [distancecut ] [densitycut ] [runavg ] [deltafile ] [gauss]] 
[kmeans clusters [randompoint [kseed ]] [maxit ] [{readtxt|readinfo} infofile ]

Distance options:

{[[rms | srmsd] [<mask>] [mass] [nofit]] | [dme [<mask>]] |
[data <dset0>[,<dset1>,...]]}
[sieve <#> [random [sieveseed <#>]]] [loadpairdist] [savepairdist] [pairdist <file>]
[pairwisecache {mem | none}] [includesieveincalc] [pwrecalc]

Output options:

[out <cnumvtime>] [gracecolor] [summary <summaryfile>] [info <infofile>]
[summarysplit <splitfile>] [splitframe <comma-separated frame list>]
[bestrep {cumulative|centroid|cumulative_nosieve}] [savenreps <#>]
[clustersvtime <filename> cvtwindow <window size>]
[cpopvtime <file> [normpop | normframe] [lifetime]]
[sil <silhouette file prefix>] [assignrefs [refcut <rms>] [refmask <mask>]]

Coordinate output options:

[ clusterout <trajfileprefix> [clusterfmt <trajformat>] ]
[ singlerepout <trajfilename> [singlerepfmt <trajformat>] ]
[ repout <repprefix> [repfmt <repfmt>] [repframe] ]
[ avgout <avgprefix> [avgfmt <avgfmt>] ][crdset <crd set>] Name of previously generated COORDS data set. If not specified the default COORDS set will be used unless nocoords has been specified.
[nocoords] Do not use a COORDS data set; distance metrics that require coordinates and coordiante output will be disabled.

Algorithms:
hieragglo (Default) Use hierarchical agglomerative (bottom-up) approach.
[epsilon <e>] Finish clustering when minimum distance between clusters is greater than <e>.
[clusters <n>] Finish clustering when <n> clusters remain.
[linkage] Single-linkage; use the shortest distance between members of two clusters.
[averagelinkage] Average-linkage (default); use the average distance between members of two clusters.
[complete] Complete-linkage; use the maximum distance between members of two clusters.
[epsilonplot <file>] Write number of clusters vs epsilon to <file>.
[includesieved_cdist] Include sieved frames in final cluster distance calculation (may be very slow).

dbscan Use DBSCAN clustering algorithm of Ester et al.[643]
minpoints <n> Minimum number of points required to form a cluster.
epsilon <e> Distance cutoff between points for forming a cluster.
[sievetoframe] When restoring sieved frames, compare frame to every frame in a cluster instead of the centroid; slower but more accurate.
[kdist <k>] Generate K-dist plot for help in determining DBSCAN parameters (see below).
[kfile <prefix>] Prefix for K-dist plot file.

dpeaks Use the density peaks algorithm of Rodriguez and Laio[644]
epsilon <e> Cutoff for determining local density in Angstroms.
[noise] If specified, treat all points within epsilon of another cluster as noise.
[dvdfile <density_vs_dist_file>] File to write density versus minimum distance to point with next highest density. This can be used to determine
appropriate cutoffs for distance and density in a subsequent step with choosepoints manual.
[choosepoints {manual | auto}] Specify whether clusters will be chosen based on specified distance/density cutoffs, or automatically. If not specified
only the density vs distance file will be written and no clustering will be performed. Currently manual is recommended.
[distancecut <distcut>] [densitycut <densitycut>] If choosepoints manual, points with minimum distance greather than or equal to <distcut> and density
greater than or equal to <densitycut> will be chosen.
[runavg <runavg file>] If choosepoints automatic, the calculated running average of density versus distance will be written to <runavg file>.
[deltafile <file>] If choosepoints automatic, distance minus the running average for each point will be written to this file.
[gauss] Calculate density with Gaussian kernels instead of using discrete density.

kmeans Use K-means clustering algorithm.
clusters <n> Finish clustering when number of clusters is <n>.
[randompoint] Randomize initial set of points used (recommended).
[kseed <seed>] Random number generator seed for randompoint.
[maxit <iteration>] Algorithm will run until frames no longer change clusters of <iteration> iterations are reached (default 100).

readtxt|readinfo No clustering – read in previous cluster results.
infofile <file> Cluster info file to read.

Distance Metric Options:
[rms | srmsd[<mask>]] (Default rms) Distance between frames calculated via
best-fit coordinate RMSD using atoms in <mask>. If srmsd specified use symmetry-corrected RMSD.
[mass] Mass-weight the RMSD.
[nofit] Do not fit structures onto each other prior to calculating RMSD.
dme [<mask>] Distance between frames calculated using distance-RMSD (aka DME, distrmsd) using atoms in <mask>.
[data <dset0>[,<dset1>,...] Distance between frames calculated using specified data set(s) (Euclidean distance).
[sieve <#>] Perform clustering only for every <#> frame. After clustering, all other frames will be added to clusters.
[random] When sieve is specified, select initial frames to cluster randomly.
[sieveseed <#>] Seed for random sieving; if not set the wallclock time will be used.
[pairdist <file>] File to use for loading/saving pairwise distances.
[loadpairdist] Load pairwise distances from <file> (CpptrajPairDist if pairdist not specified).
[savepairdist] Save pairwise distances from <file> (CpptrajPairDist if pairdist not specified). NOTE: If sieving was performed only the calculated
distances are saved.
[pairwisecache {mem | disk | none}] Cache pairwise distance data in memory (default), to disk, or disable pairwise caching. No caching will save
memory but be extremely slow. Caching to disk will likely be slow unless writing to a fast storage device (e.g. SSD) – data is saved to a file named ’CpptrajPairwiseCache’.
[includesieveincalc] Include sieved frames when calculating within-cluster average (may be very slow).
[pwrecalc] If a loaded pairwise distance file does not match the current setup, force recalculation.

Output Options:
[out <cnumvtime>] Write cluster # vs frame to <cnumvtime>. Algorithms that calculate noise (e.g. DBSCAN) will assign noise points a value of -1.
[gracecolor] Instead of cluster # vs frame, write cluster# + 1 (corresponding to colors used by XMGRACE) vs frame. Cluster #s larger than 15 are given the
same color. Algorithms that calculate noise (e.g. DBSCAN) will assign noise points a color of 0 (blank).
[summary <summaryfile>] Summarize each cluster with format ’#Cluster Frames Frac AvgDist Stdev Centroid AvgCDist’:
#Cluster Cluster number starting from 0 (0 is most populated).
Frames # of frames in cluster.
Frac Size of cluster as fraction of total trajectory.
AvgDist Average distance between points in the cluster.
Stdev Standard deviation of points in the cluster.
Centroid Frame # of structure in cluster that has the lowest cumulative distance to every other point.
AvgCDist Average distance of this cluster to every other cluster.
[info <infofile>] Write ptraj-like cluster information to <infofile>. This file has format:
#Clustering: <X> clusters <N> frames
#Cluster <I> has average-distance-to-centroid <AVG>

#DBI: <DBI>
#pSF: <PSF>
#Algorithm: <algorithm-specific info>
<Line for cluster 0>

#Representative frames: <representative frame list>
Where <X> is the number of clusters, <N> is the number of frames clustered, <I> ranges from 0 to <X>-1, <AVG> is the average distance of all frames in
that cluster to the centroid, <DBI> is the Davies-Bouldin Index, <pSF> is the pseudo-F statistic, and <representative frame list> contains the frame
# of the representative frame (i.e. closest to the centroid) for each cluster. Each cluster has a line made up of characters (one for each frame) where ’.’ means ’not in cluster’ and ’X’ means ’in cluster’.
[summarysplit <splitfile>] Summarize each cluster based on which of its frames fall in portions of the trajectory specified by splitframe with format ’#Cluster
Total Frac C# Color NumInX … FracX … FirstX’:
#Cluster Cluster number starting from 0 (0 is most populated).
Total # of frames in cluster.
Frac Size of cluster as a fraction of the total trajectory.
C# Grace color number.
Color Text description of the color (based on standard XMGRACE coloring).
NumInX Number of frames in Xth portion of the trajectory.
FracX Fraction of frames in Xth portion of the trajectory.
FirstX Frame in the Xth portion of the trajectory where the cluster is first observed.
[splitframe <frame>] For summarysplit, frame or comma-separated list of frames to split the trajectory at, e.g. ’100,200,300’.
[bestrep {cumulative|centroid|cumulative_nosieve}] Method for choosing cluster representative frames.
cumulative Choose by lowest cumulative distance to all other frames in cluster. Default when not sieving.
centroid Choose by lowest distance to cluster centroid. Default when sieving.
cumulative_nosieve Choose by lowest cumulative distance to all other frames, ignoring sieved frames.
[savenreps <#>] Number of best representative frames to choose (default 1).
[clustersvtime <filename>] Write number of unique clusters observed in a given time window to <filename>.
[cvtwindow <windowsize>] Window size for clustersvtime output.
[cpopvtime <file> [normpop | normframe]] Write cluster population vs time to <file>; if normpop specified normalize each cluster to 1.0; if normframe specified normalize cluster populations by number of frames.
[sil <prefix>] Write average cluster silhouette value for each cluster to ’<prefix>.cluster.dat’ and cluster silhouette value for each individual
frame to ’<prefix>.frame.dat’.
assignrefs In summary/summarysplit, assign clusters to loaded representative structures if RMSD to that reference is less than specified cutoff.
[refcut <rms>] RMSD cutoff in Angstroms.
[refmask <mask>] Mask to use for RMSD calculation.

Coordinate Output Options:
clusterout <trajfileprefix> Write frames in each cluster to files named <trajfileprefix>.cX, where X is the cluster number.
clusterfmt <trajformat> Format keyword for clusterout (default Amber Trajectory).
singlerepout <trajfilename> Write all representative frames to single trajectory named <trajfilename>.
singlerepfmt <trajformat> Format keyword for singlerepout (default Amber Trajectory).
repout <repprefix> Write representative frames to separate files named <repprefix>.X.<ext>, where X is the cluster number and <ext> is a format-specific filename extension.
repfmt <trajformat> Format keyword for repout (default Amber Trajectory).
repframe Include representative frame number in repout filename.
avgout <avgprefix> Write average structure for each cluster to separate files named <avgprefix>.X.<ext>, where X is the cluster number and <ext> is a
format-specific filename extension.
avgfmt <trajformat> Format keyword for avgout.

DataSet Aspects:
[Pop] Cluster population vs time; index corresponds to cluster number.

 

 

Note cluster population vs time data sets are not generated until the analysis has been run.

Cluster input frames using the specified clustering algorithm and distance metric.

In order to speed up clustering of large trajectories, the sieve keyword can be used. In addition, subsequent clustering calculations can be sped up by writing/reading calculated pair distances between each frame to/from a file specified by pairdist (or “CpptrajPairDist” if pairdist not specified).

Example: cluster on a specific distance:

distance endToEnd :1 :255
cluster data endToEnd clusters 10 epsilon 3.0 summary summary.dat info info.dat

Example: cluster on the CA atoms of residues 2-10 using average-linkage, stopping when either 3 clusters are reached or the minimum distance between clusters is 4.0, writing the cluster number vs time to “cnumvtime.dat” and a summary of each cluster to “avg.summary.dat”:

cluster C1 :2-10 clusters 3 epsilon 4.0 out cnumvtime.dat summary avg.summary.dat

Clustering Metrics
The Davies-Bouldin Index (DBI) measures sum over all clusters of the within cluster scatter to the between cluster separation; the smaller the DBI, the better. The DBI is defined as the average, for all clusters X, of fred, where fred(X) = max, across other clusters Y, of (Cx + Cy)/dXY. Here Cx is the average distance from points in X to the centroid, similarly Cy, and dXY is the distance between cluster centroids.

The pseudo-F statistic (pSF) is another measure of clustering goodness. It is intended to capture the ’tightness’ of clusters, and is in essence a ratio of the mean sum of squares between groups to the mean sum of squares within group. High values are good. Generally, one selects a cluster-count that gives a peak in the pseudo-f statistic. Formula: A/B, where A = (T – P)/(G-1), and B = P / (n-G). Here n is the number of points, G is the number of clusters, T is the total distance from the all-data centroid, and P is the sum (for all clusters) of the distances from the cluster centroid.

The cluster silhouette is a measure of how well each point fits within a cluster. Values of 1 indicate the point is very similar to other points in the cluster, i.e. it is well-clustered. Values of -1 indicate the point is dissimilar and may fit better in a neighboring cluster. Values of 0 indicate the point is on a border between two clusters.

Hints for setting DBSCAN parameters with ’kdist’
It is not always obvious what parameters to set for DBSCAN. You can get a rough idea of what to set ’mindist’ and ’epsilon’ to by generating a so-called “K-dist” plot with the ’kidst <k>’ option. The K-dist plot shows for each point (X axis) the Kth farthest distance (Y axis), sorted by decreasing distance. You supply the same
distance metric and sieve parameters you want to use for the actual clustering, but nothing else. For example:

cluster C0 dbscan kdist 4 rms :1-4@CA sieve 10 loadpairdist pairdist CpptrajPairDist

The K-dist plot will be named <prefix>.<k>.dat, with the default prefix being ’Kdist’ (in this case the file name would be Kdist.4.dat). The K-dist plot usually looks like a curve with an initially steep slope that gradually decreases. Around where the initial part of the curve starts to flatten out (indicating an increas in density) is around where epsilon should be set; minpoints is set to whatever <k> was. It has been suggested that the shape of the K-dist curve doesn’t change too much after Kdist=4, but users are encouraged to experiment.

Using ’dpeaks’ clustering
The ’dpeaks’ (density peaks) algorithm attempts to find clusters by identifying points in high density regions which are far from other points of high density[reference]. There are two ways these points can be chosen. The first and recommended way is manually. In this method, clustering if first run with choosepoints not specified to generate a plot containing density versus minimum distance to point with next highest density (the decision graph). Appropriate cut offs for distance and density can then be chosen based on visual inspection; cutoffs should be chosen so that they select points that have both a high density and a high distance to point with next highest density. Clustering can then be run again with distancecut and densitycut set.

The second way is automatically; CPPTRAJ will attempt to identify outliers in the density vs distance plot based on distance from the running average. Although this only requires a single pass, this method of choosing points is not well-tested and currently not recommended.

The CpptrajPairDist file format
The CpptrajPairDist file is binary; the exact format depends on what version of cpptraj generated the file (since earlier versions had no concept of ’sieve’). The CpptrajPairDist file starts with a 4 byte header containing the characters ’C’ ’T’ ’M’ followed by the version number. A quick way to figure out the version is to use the linux ’od’ command to output the first 4 bytes as hexadecimal, e.g.:

$ od -t x1 -N 4 CpptrajPairDist 0000000 43 54 4d 02

So the CpptrajPairDist file version in the above example is 2. The next few numbers describe the matrix size and depend on the version.

Version 0: Two 4-byte integers: # of rows and # of elements.
Version 1: Two 8-byte unsigned integers (equivalent to size_t on most systems): # of rows and # of elements.
Version 2: Three 8 byte unsigned integers: original # of rows, actual # of rows, and sieve value.

This is followed by the actual matrix data, stored as a single array of floats (4 bytes). For versions 1 and 2 the number of elements is explicitly stored. For version 2, to calculate the number of matrix elements you need to read:

Elements = (actual_rows * (actual_rows - 1)) / 2

The cluster pair-distance matrix is an upper-right triangle matrix without the diagonal (in row-major order), so the first element is the distance between elements 0 and 1, the second is between elements 0 and 2, etc. In version 2 files, if the sieve value is greater than 1 that means original_rows > actual_rows and there is an
additional array of characters original_nrows long, with ’T’ if the row is being ignored (i.e. it was sieved out) and ’F’ if the row is active (i.e. is active in the actual pairwise-distance matrix).

The code that CPPTRAJ uses to read in CpptrajPairDist files is in ClusterMatrix::LoadFile() (ClusterMatrix.cpp).