.. raw:: html

*k*-means NANI Tutorial
=======================
.. contents::
:local:
:depth: 2
Overview
--------
This clustering tutorial is meant for datasets for **all** applications
(2D fingerprints, mass spectrometry imaging data, etc). Molecular
Dynamics Trajectory has a different treatment. If specific step is only
for Molecular Dynamics trajectory, it will be specified. Otherwise, it
is applicable for all datasets.
Tutorial
--------
1. Clone the repository
~~~~~~~~~~~~~~~~~~~~~~~
Clone the MDANCE repository if you haven't already.
.. code:: bash
$ git clone https://github.com/mqcomplab/MDANCE.git
$ cd MDANCE/scripts/nani
2. Input Preparations
~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
.. raw:: html
Preparation for Molecular Dynamics Trajectory
.. raw:: html
Prepare a valid topology file (e.g. ``.pdb``, ``.prmtop``), trajectory
file (e.g. ``.dcd``, ``.nc``), and the atom selection. This step will
convert a Molecular Dynamics trajectory to a numpy ndarray. **Make sure
the trajectory is already aligned and/or centered if needed!**
`Preprocessing Notebook <../examples/preprocessing.html>`__
contains step-by-step tutorial to prepare the input for NANI.
A copy of this notebook can be found in ``$PATH/MDANCE/scripts/inputs/preprocessing.ipynb``.
.. raw:: html
.. raw:: html
.. raw:: html
Preparation for all other datasets (OPTIONAL)
.. raw:: html
This step is **optional**. If you are using a metric that is NOT the
mean-square deviation (MSD)–default metric, you will need to normalize
the dataset. Otherwise, you can skip this step.
`normalize.py `__ will
normalize the dataset. The following parameters to be specified in the
script:
::
# System info - EDIT THESE
data_file = '../data/2D/blob_disk.csv'
array = np.genfromtxt(data_file, delimiter=',')
output_base_name = 'output_base_name'
Inputs
^^^^^^
System info
'''''''''''
| ``data_file`` is your input file with a 2D array.
| ``array`` is the array is the loaded dataset from ``data_file``. This step can be changed according to the type of file format you have. However, ``array`` must be an array-like in the shape (number of samples, number of features).
| ``output_base_name`` is the base name for the output file. The output file will be saved as ``output_base_name.npy``.
.. raw:: html
3. NANI Screening
~~~~~~~~~~~~~~~~~
`screen_nani.py `__ will
run NANI for a range of clusters and calculate cluster quality metrics.
For the best result, we recommend running NANI over a wide range of
number of clusters. The following parameters to be specified in the
script:
::
# System info
input_traj_numpy = '../../data/md/backbone.npy'
N_atoms = 50
sieve = 1
# NANI parameters
output_dir = 'outputs'
init_type = 'strat_all'
metric = 'MSD'
start_n_clusters = 2
end_n_clusters = 30
.. _system-info-1:
Inputs
^^^^^^
System info
'''''''''''
| ``input_traj_numpy`` is the numpy array prepared from step 1, if not it will be your loaded dataset.
| ``N_atoms`` is the number of atoms used in the clustering. **For all non-Molecular Dynamics datasets, ``N_atoms=1``.**
| ``sieve`` takes every sieve-th frame from the trajectory for analysis.
NANI parameters
''''''''''''''''
| ``output_dir`` is the directory to store the clustering results.
| ``init_type`` is the selected seed selectors (See ``mdance.cluster.nani.KmeansNANI`` for details).
| ``metric`` is the metric used to calculate the similarity between frames (See ``mdance.tools.bts.extended_comparisons`` for details).
| ``start_n_clusters`` is the starting number for screening. **This number must be greater than 2**.
| ``end_n_clusters`` is the ending number for screening.
Execution
^^^^^^^^^
Make sure your pwd is ``$PATH/MDANCE/scripts/nani``.
.. code:: bash
$ python screen_nani.py
Outputs
^^^^^^^
csv file containing the number of clusters and the corresponding number
of iterations, Callinski-Harabasz score, Davies-Bouldin score, and
average mean-square deviation for that seed selector.
4. Analysis of NANI Screening Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The clustering screening results will be analyzed using the
Davies-Bouldin index (DB). There are two criteria to select the number
of clusters:
1. lowest DB
2. maximum 2nd derivative of DB.
`analysis notebook <../examples/analysis_db.html>`__
contains step-by-step tutorial to analyze clustering screening results.
A copy of this notebook can be found in ``$PATH/MDANCE/scripts/nani/analysis_db.ipynb``.
5. Cluster Assignment
~~~~~~~~~~~~~~~~~~~~~
`assign_labels.py `__
will assign labels to the clusters for *k*-means clustering using the
initialization methods. The following parameters to be specified in the
script:
::
# System info - EDIT THESE
input_traj_numpy = '../../data/md/backbone.npy'
N_atoms = 50
sieve = 1
# K-means params - EDIT THESE
n_clusters = 6
init_type = 'strat_all'
metric = 'MSD'
n_structures = 11
output_dir = 'outputs'
.. _inputs-1:
Inputs
^^^^^^
.. _system-info-2:
System info
'''''''''''
| ``input_traj_numpy`` is the numpy array prepared from step 1, if not it will be your loaded dataset.
| ``N_atoms`` is the number of atoms used in the clustering. **For all non-Molecular Dynamics datasets, ``N_atoms=1``.**
| ``sieve`` takes every ``sieve``\ th frame from the trajectory for analysis.
*k*-means params
''''''''''''''''
| ``n_clusters`` is the number of clusters for labeling.
| ``init_type`` is the seed selector to use (See ``mdance.cluster.nani.KmeansNANI`` for details).
| ``metric`` is the metric used to calculate the similarity between frames (See ``mdance.tools.bts.extended_comparisons`` for details).
| ``n_structures`` is the number of frames to extract from each cluster.
| ``output_dir`` is the directory to store the clustering results.
.. _execution-1:
Execution
^^^^^^^^^
Make sure your pwd is ``$PATH/MDANCE/scripts/nani``.
.. code:: bash
$ python assign_labels.py
.. _outputs-1:
Outputs
^^^^^^^
* csv file containing the indices of the best frames in each cluster.
* csv file containing the cluster labels for each frame.
* csv file containing the population of each cluster.
6. Extract frames for each cluster (Optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`postprocessing.ipynb <../examples/postprocessing.html>`__
will use the indices from last step to extract the designated frames
from the original trajectory for each cluster.
A copy of this notebook can be found in ``$PATH/MDANCE/scripts/outputs/postprocessing.ipynb``.
Further Reading
~~~~~~~~~~~~~~~
For more information on the NANI algorithm, please refer to the `NANI
paper `__.
Please Cite
.. code:: bibtex
@article{chen_k-means_2024,
title = {k-{Means} {NANI}: {An} {Improved} {Clustering} {Algorithm} for {Molecular} {Dynamics} {Simulations}},
volume = {20},
copyright = {https://doi.org/10.15223/policy-029},
issn = {1549-9618, 1549-9626},
shorttitle = {k-{Means} {NANI}},
url = {https://pubs.acs.org/doi/10.1021/acs.jctc.4c00308},
doi = {10.1021/acs.jctc.4c00308},
abstract = {One of the key challenges of k-means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as k-means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex data sets such as those obtained from molecular simulation, k-means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of k-means++ will lead to a lack of reproducibility. K-means N-Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient n-ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping k-means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse data sets and be used as a standalone tool or as part of our MDANCE clustering package.},
language = {en},
number = {13},
urldate = {2024-07-09},
journal = {Journal of Chemical Theory and Computation},
author = {Chen, Lexin and Roe, Daniel R. and Kochert, Matthew and Simmerling, Carlos and Miranda-Quintana, Ramón Alain},
month = jul,
year = {2024},
pages = {5583--5597},
}