eQual - Extended Quality Clustering
Overview
eQual is a quality clustering algorithm that use the radial threshold to grow the cluster to maximize similarity between members in a cluster. It is an extension of the Radial Threshold Clustering algorithm (Daura and Oscar Conchillo-Solé). eQual has improved with new seed selection methods and tie-breaking criteria.
eQual selects the seed to start the clustering. It then grows the cluster by adding the neighbors within a threshold away from the seed. This threshold is calculated using the mean-square deviation from the seed. The iteration continues until it runs out of neighbors and chosen neighbors will be removed from original dataset. A new iteration begins and selects medoid and its neighbors from the available dataset. If user selects multiple medoids, then the medoid that proposed the densest and most similar cluster will be chosen. The process repeats and the algorithm terminates when the dataset is empty.
This clustering tutorial is meant for datasets for all applications (2D fingerprints, mass spectrometry imaging data, etc). Molecular Dynamics Trajectory has a different treatment. If specific step is only for Molecular Dynamics trajectory, it will be specified. Otherwise, it is applicable for all datasets.
Tutorial
1. Clone the repository
Clone the MDANCE repository if you haven’t already.
$ git clone https://github.com/mqcomplab/MDANCE.git
$ cd MDANCE/scripts/equal
2. Input Preparations
Preparation for Molecular Dynamics Trajectory
Prepare a valid topology file (e.g. .pdb, .prmtop), trajectory
file (e.g. .dcd, .nc), and the atom selection. This step will
convert a Molecular Dynamics trajectory to a numpy ndarray. Make sure
the trajectory is already aligned and/or centered if needed!
Preprocessing Notebook contains step-by-step tutorial to prepare the input for NANI.
A copy of this notebook can be found in $PATH/MDANCE/scripts/inputs/preprocessing.ipynb.
Preparation for all other datasets (OPTIONAL)
This step is optional. If you are using a metric that is NOT the mean-square deviation (MSD)–default metric, you will need to normalize the dataset. Otherwise, you can skip this step.
normalize.py will normalize the dataset. The following parameters to be specified in the script:
# System info - EDIT THESE
data_file = '../data/2D/blob_disk.csv'
array = np.genfromtxt(data_file, delimiter=',')
output_base_name = 'output_base_name'
Inputs
System info
data_file is your input file with a 2D array.array is the array is the loaded dataset from data_file. This step can be changed according to the type of file format you have. However, array must be an array-like in the shape (number of samples, number of features).output_base_name is the base name for the output file. The output file will be saved as output_base_name.npy.3. eQual Screening
scripts/equal/screen_equal.py will screen eQual clustering for multiple thresholds and give the most optimal threshold. For the best result, we recommend screening eQual with a wide range of threshold values. Depending on the number of samples or features, consider sieving over wide threshold range. For large dataset, please submit this as a job instead of running on command line. The following parameters to be specified in the script:
# System info - EDIT THESE
input_traj_numpy = data.sim_traj_numpy
N_atoms = 50
sieve = 1
# eQUAL params - EDIT THESE
metric = 'MSD' # Default
n_seeds = 3
check_sim = True # Default
reject_lowd = True # Default
sim_threshold = 16
min_samples = 10 # Default
# thresholds params- EDIT THESE
start_threshold = 5
end_threshold = 6
step = 0.1
save_clusters = False # Default False
Inputs
System info
eQual params
mdance.tools.bts.extended_comparisons for details).Radial threshold screening params
Execution
$ python screen_equal.py
Outputs
a csv with the number of clusters, cluster population for each threshold value.
a csv with the Calinski-Harabasz (CH) score and Davies-Bouldin (DB) score (two cluster quality indices) for each threshold value.
4. eQual Screening Analysis
The clustering screening results will be analyzed using the Davies-Bouldin index (DB). There are two criteria to select the number of clusters:
lowest DB
maximum 2nd derivative of DB.
$PATH/MDANCE/scripts/equal/analysis.ipynb will analyze the eQual screening results.
5. Assign labels to the frames
scripts/equal/assign_labels.py will assign cluster for each frame. The following parameters to be specified in the script:
# System info - EDIT THESE
input_traj_numpy = data.sim_traj_numpy
N_atoms = 50
sieve = 1
# eQUAL params - EDIT THESE
metric = 'MSD' # Default
n_seeds = 3 # Default
check_sim = True # Default
reject_lowd = True # Default
sim_threshold = 16
min_samples = 10 # Default
# extract params- EDIT THESE
threshold = 5.80
n_structures = 11 # Default
sorted_by = 'frame' # Default
open_clusters = None # Default
Inputs - New parameters
Execution
python assign_labels.py
Outputs
6. Extract frames for each cluster (Optional)
postprocessing.ipynb will use the indices from last step to extract the designated frames from the original trajectory for each cluster.
A copy of this notebook can be found in $PATH/MDANCE/scripts/outputs/postprocessing.ipynb.
Further Reading
For more information on the eQual algorithm, please refer to the eQual paper.
Please Cite
@article{chen_extended_2025,
title = {Extended {Quality} ({eQual}): {Radial} {Threshold} {Clustering} {Based} on n-ary {Similarity}},
issn = {1549-9596},
url = {https://doi.org/10.1021/acs.jcim.4c02341},
doi = {10.1021/acs.jcim.4c02341},
journal = {Journal of Chemical Information and Modeling},
author = {Chen, Lexin and Smith, Micah and Roe, Daniel R. and Miranda-Quintana, Ramón Alain},
month = may,
year = {2025},
note = {Publisher: American Chemical Society},
}