HELM - Hierarchical Extended Linkage Method

Overview 

HELM is a hierarchical agglomerative clustering algorithm that uses the n-ary similarity to merge clusters at each level. It transformed from the traditional hierarchical clustering algorithm to be more efficient and scalable turning a $O(N^2)$ algorithm to $O(N)$. It specializes in recognizing dominant conformations within an ensemble and is often used alongside NANI to achieve a balance between efficiency and precision.

This clustering tutorial is meant for datasets for all applications (2D fingerprints, mass spectrometry imaging data, etc). Molecular Dynamics Trajectory has a different treatment. If a specific step is only for Molecular Dynamics trajectory, it will be specified. Otherwise, it is applicable for all datasets.

Tutorial 

1. Clone the repository 

Clone the MDANCE repository if you haven’t already.

$ git clone https://github.com/mqcomplab/MDANCE.git
$ cd MDANCE/scripts/nani

2. Input Preparations 

Preparation for Molecular Dynamics Trajectory

Prepare a valid topology file (e.g. .pdb, .prmtop), trajectory file (e.g. .dcd, .nc), and the atom selection. This step will convert a Molecular Dynamics trajectory to a numpy ndarray. Make sure the trajectory is already aligned and/or centered if needed!

Preprocessing Notebook contains step-by-step tutorial to prepare the input for NANI.

A copy of this notebook can be found in $PATH/MDANCE/scripts/inputs/preprocessing.ipynb.

Preparation for all other datasets (OPTIONAL)

This step is optional. If you are using a metric that is NOT the mean-square deviation (MSD)–default metric, you will need to normalize the dataset. Otherwise, you can skip this step.

normalize.py will normalize the dataset. The following parameters to be specified in the script:

# System info - EDIT THESE
data_file = '../data/2D/blob_disk.csv'
array = np.genfromtxt(data_file, delimiter=',')
output_base_name = 'output_base_name'

Inputs

System info

data_file is your input file with a 2D array.
array is the array is the loaded dataset from data_file. This step can be changed according to the type of file format you have. However, array must be an array-like in the shape (number of samples, number of features).
output_base_name is the base name for the output file. The output file will be saved as output_base_name.npy.

3. Prior Clustering (Optional)

HELM requires a set of initial clusters to start with. You can start from any clustering method. An example is with the NANI clustering also among this tutorial set. All you need is to have the cluster labels similar to this format:

#frame,cluster
0,0
1,0
2,1
3,1
4,2

4. HELM clustering 

scripts/helm/intra/run_helm.py will run HELM clustering on the dataset. The following parameters need to be specified in the script:

# System info - EDIT THESE
input_traj_numpy = data.sim_traj_numpy
cluster_labels = '../labels_60.csv'
sieve = 1
N_atoms = 50                                    # Number of atoms in the system

# HELM params - EDIT THESE
metric = 'MSD'                                  # Default
N0 = 60                                         # How many clusters to start with
final_target = 1                                # How many clusters to end with
align_method = None                             # Default
save_pairwise_sim = False                       # Default
merging_scheme = 'inter'                        # {'inter', 'intra'}

Inputs

System info

input_file: The trajectory from step 1.
cluster_labels: The cluster labels from step 2.
sieve: Reading every sieve frames from the trajectory.
N_atoms: The number of atoms used in the clustering.

HELM params

metric: The metric used to calculate the similarity between frames (See mdance.tools.bts.extended_comparisons for details).
N0: The number of clusters to start with.
final_target: The number of clusters to end with.
align_method (optional): The method to align the clusters. Default is None.
save_pairwise_sim (optional): A boolean variable to indicate whether to save the pairwise similarity matrix. Default is False.
merging_scheme (optional): The merging scheme to use. {inter, intra}. inter merges clusters with lowest interdistance. intra merges clusters with lowest intradistance. Default is inter.

Execution

$ python run_helm_intra.py

Outputs

Pickle file containing the clustering results.
CSV file containing the Calinski-Harabasz and Davies-Bouldin scores for each number of clusters.

5. Get most optimal number of clusters 

The clustering screening results will be analyzed using the Davies-Bouldin index (DB). There are two criteria to select the number of clusters:

lowest DB
maximum 2nd derivative of DB.

analysis notebook contains step-by-step tutorial to analyze clustering screening results.

6. Cluster Assignment 

assign_labels_intra.py will assign cluster labels to the trajectory. The following parameters need to be specified in the script:

# System info - EDIT THESE
input_traj_numpy = data.sim_traj_numpy
N_atoms = 50
sieve = 1

# HELM params - EDIT THESE
n_clusters = 10
pre_cluster_labels = '../labels_60.csv'
pickle_file = 'inter-helm.pkl'
metric = 'MSD'                                                      # Default
extract_type = 'top'                                                # Default
n_structures = 11                                                   # Default

Inputs

System info

input_traj_numpy: The normalized trajectory from step 1.
N_atoms: The number of atoms used in the clustering.
sieve: Reading every sieve frames from the trajectory.

HELM params

n_clusters: The number of clusters to assign labels to. Use the most optimal number of clusters from analysis in step 4.
pre_cluster_labels: The cluster labels from step 2.
pickle_file: The clustering results from step 3.
metric: The metric used to calculate the similarity between frames (See mdance.tools.bts.extended_comparisons for details).
extract_type: The type of extraction method to use. {top, random}. top means to extract the top n_structures from each cluster. random means to extract n_structures random structures from each cluster.
n_structures: The number of structures to extract from each cluster.

Execution

python assign_labels_intra.py

Outputs

helm_cluster_labels.csv: Contains the cluster labels for each frame.
helm_best_frames_indices.csv: Contains the indices of the best or random frames to extract from each cluster.
helm_summary.csv: Contains the summary of the clustered population.

6. Extract frames for each cluster (Optional)

postprocessing.ipynb will use the indices from last step to extract the designated frames from the original trajectory for each cluster.

A copy of this notebook can be found in $PATH/MDANCE/scripts/outputs/postprocessing.ipynb.