Parallelization

CPPTRAJ has many levels of parallelization.

These levels can be enabled via the ’-mpi’, ’-openmp’, and/or ’-cuda’ configure flags for MPI, OpenMP, and CUDA parallelization respectively. At the highest level, trajectory and ensemble reads are parallelized with MPI. In addition, certain time consuming actions have been parallelized with OpenMP and/or CUDA.

Note that any combination of the ’-openmp’, ’-cuda’, and ’-mpi’ flags may be used to generate a hybrid MPI/OpenMP/CUDA binary; however this may require additional runtime setup (e.g. setting OMP_NUM_THREADS for OpenMP) to work properly and not oversubscribe cores.

MPI Trajectory Parallelization

CPPTRAJ has two levels of MPI parallelization for reading input trajectories. The first is for trajin trajectory input, where the trajectory read is divided as evenly as possible among all input frames. For example, if given two trajectories of 1000 frames each and 4 MPI threads, thread 0 reads frames 1-500 of trajectory 1, thread 1 reads frames 501-1000 of trajectory 1, thread 2 reads frames 1-500 of trajectory 2, and thread 3 reads frames 501-1000 of trajectory 2. The second is for ensemble trajectory input, where the reading/processing/writing of each member of the ensemble is divided up among MPI threads. The number of MPI threads must be a multiple of the ensemble size.

If the number of threads is greater than the ensemble size then the processing of each ensemble member will be divided among MPI threads. For example, given an ensemble of 4 trajectories and 8 threads, threads 0 and 1 are assigned to the first ensemble trajectory, threads 2 and 3 are assigned to the second ensemble trajectory, and so on.

When using ensemble mode in parallel it is recommended that the ensemblesize command be used prior to any ensemble command as this will make set up far more efficient. In order to use the MPI version, Amber/cpptraj should be configured with the ’-mpi’ flag. You can tell if cpptraj has been compiled with MPI as it will print ’MPI’ in the title, and/or by calling ’cpptraj —defines’ and looking for ’-DMPI’.

OpenMP Parallelization

Some of the more time-consuming actions/analyses in cpptraj have been parallelized with OpenMP to take advantage of machines with multiple cores. In order to use OpenMP parallelization Amber/cpptraj should be configured with the ’-openmp’ flag. You can easily tell if CPPTRAJ has been compiled with OpenMP as it will print ’OpenMP’ in the title, and/or by calling ’cpptraj —defines’ and looking for ’-D_OPENMP’.

The following actions/analyses have been OpenMP parallelized:
2drms/rms2d
atomiccorr
checkstructure
closest
cluster (pair-wise distance calculation and sieved frame restore only)
dssp/secstruct
gist (non-bonded calculation)
kde
mask (distance-based masks only)
matrix (coordinate covariance matrices only)
minimage
radial
replicatecell
rmsavgcorr
spam
surf
velocityautocorr
volmap
watershell
wavelet

By default OpenMP cpptraj will use all available cores. The number of OpenMP threads can be controlled by setting the OMP_NUM_THREADS environment variable.

CUDA Parallelization

Some time-consuming actions in cpptraj have been parallelized with CUDA to take advantage of machines with NVIDIA GPUs. In order to use CUDA parallelization Amber/cpptraj should be configured with the ’-cuda’ flag. You can easily tell if cpptraj has been compiled with CUDA as it will print ’CUDA’ and details on the current graphics device in the title, and/or by calling ’cpptraj —defines’ and looking for ’-DCUDA’.

The following actions have been CUDA parallelized:
closest
watershell