eQual - Extended Quality Clustering =================================== .. contents:: :local: :depth: 2 Overview -------- eQual is a quality clustering algorithm that use the radial threshold to grow the cluster to maximize similarity between members in a cluster. It is an extension of the Radial Threshold Clustering algorithm (`Daura and Oscar Conchillo-Solé `_). eQual has improved with new seed selection methods and tie-breaking criteria. eQual selects the seed to start the clustering. It then grows the cluster by adding the neighbors within a threshold away from the seed. This threshold is calculated using the mean-square deviation from the seed. The iteration continues until it runs out of neighbors and chosen neighbors will be removed from original dataset. A new iteration begins and selects medoid and its neighbors from the available dataset. If user selects multiple medoids, then the medoid that proposed the densest and most similar cluster will be chosen. The process repeats and the algorithm terminates when the dataset is empty. This clustering tutorial is meant for datasets for all applications (2D fingerprints, mass spectrometry imaging data, etc). Molecular Dynamics Trajectory has a different treatment. If specific step is only for Molecular Dynamics trajectory, it will be specified. Otherwise, it is applicable for all datasets. Tutorial -------- 1. Clone the repository ~~~~~~~~~~~~~~~~~~~~~~~ Clone the MDANCE repository if you haven't already. .. code:: bash $ git clone https://github.com/mqcomplab/MDANCE.git $ cd MDANCE/scripts/equal 2. Input Preparations ~~~~~~~~~~~~~~~~~~~~~ .. raw:: html
.. raw:: html Preparation for Molecular Dynamics Trajectory .. raw:: html Prepare a valid topology file (e.g. ``.pdb``, ``.prmtop``), trajectory file (e.g. ``.dcd``, ``.nc``), and the atom selection. This step will convert a Molecular Dynamics trajectory to a numpy ndarray. **Make sure the trajectory is already aligned and/or centered if needed!** `Preprocessing Notebook <../examples/preprocessing.html>`__ contains step-by-step tutorial to prepare the input for NANI. A copy of this notebook can be found in ``$PATH/MDANCE/scripts/inputs/preprocessing.ipynb``. .. raw:: html
.. raw:: html
.. raw:: html Preparation for all other datasets (OPTIONAL) .. raw:: html This step is **optional**. If you are using a metric that is NOT the mean-square deviation (MSD)–default metric, you will need to normalize the dataset. Otherwise, you can skip this step. `normalize.py `__ will normalize the dataset. The following parameters to be specified in the script: :: # System info - EDIT THESE data_file = '../data/2D/blob_disk.csv' array = np.genfromtxt(data_file, delimiter=',') output_base_name = 'output_base_name' Inputs ^^^^^^ System info ''''''''''' | ``data_file`` is your input file with a 2D array. | ``array`` is the array is the loaded dataset from ``data_file``. This step can be changed according to the type of file format you have. However, ``array`` must be an array-like in the shape (number of samples, number of features). | ``output_base_name`` is the base name for the output file. The output file will be saved as ``output_base_name.npy``. .. raw:: html
3. eQual Screening ~~~~~~~~~~~~~~~~~~ `scripts/equal/screen_equal.py `_ will screen eQual clustering for multiple thresholds and give the most optimal threshold. For the best result, we recommend screening eQual with a wide range of threshold values. *Depending on the number of samples or features, consider sieving over wide threshold range. For large dataset, please submit this as a job instead of running on command line.* The following parameters to be specified in the script: :: # System info - EDIT THESE input_traj_numpy = data.sim_traj_numpy N_atoms = 50 sieve = 1 # eQUAL params - EDIT THESE metric = 'MSD' # Default n_seeds = 3 check_sim = True # Default reject_lowd = True # Default sim_threshold = 16 min_samples = 10 # Default # thresholds params- EDIT THESE start_threshold = 5 end_threshold = 6 step = 0.1 save_clusters = False # Default False .. _system-info-2: Inputs ^^^^^^ System info ''''''''''' | `input_traj_numpy` is the numpy array prepared from step 1, if not it will be your loaded dataset. | `N_atoms` is the number of atoms used in the clustering. **For all non-Molecular Dynamics datasets, this is 1.** | `sieve` takes every sieve-th frame from the trajectory for analysis. eQual params ^^^^^^^^^^^^ | `metric` is the metric used to calculate the similarity between frames (See ``mdance.tools.bts.extended_comparisons`` for details). | `n_seeds` is the is the number of seeds selected per iteration. If `n_medoids` is greater than 1, then multiple clusters will be proposed; the cluster with the densest and greatest similarity of members will be selected. Performance time will increase with more seeds. | `check_sim` is boolean to check the similarity of the seed to the cluster. | `reject_lowd` is boolean to reject low density clusters. `sim_threshold` needs to be specified. | `sim_threshold` is the similarity threshold to reject less compact clusters. | `min_samples` is the minimum cluster size to reject low density clusters. Default is 10. Radial threshold screening params ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | `start_threshold` is the starting value `r_theshold` for screening range. | `end_threshold` is the ending value of `r_theshold` screening range. | `step` is the increment of the `r_theshold` screening range. | `save_clusters` is boolean to save the cluster dictionary. Default is False. Execution ^^^^^^^^^ .. code:: bash $ python screen_equal.py Outputs ^^^^^^^ - a csv with the number of clusters, cluster population for each threshold value. - a csv with the Calinski-Harabasz (CH) score and Davies-Bouldin (DB) score (two cluster quality indices) for each threshold value. 4. eQual Screening Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The clustering screening results will be analyzed using the Davies-Bouldin index (DB). There are two criteria to select the number of clusters: 1. lowest DB 2. maximum 2nd derivative of DB. `$PATH/MDANCE/scripts/equal/analysis.ipynb `_ will analyze the eQual screening results. 5. Assign labels to the frames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `scripts/equal/assign_labels.py `_ will assign cluster for each frame. The following parameters to be specified in the script: :: # System info - EDIT THESE input_traj_numpy = data.sim_traj_numpy N_atoms = 50 sieve = 1 # eQUAL params - EDIT THESE metric = 'MSD' # Default n_seeds = 3 # Default check_sim = True # Default reject_lowd = True # Default sim_threshold = 16 min_samples = 10 # Default # extract params- EDIT THESE threshold = 5.80 n_structures = 11 # Default sorted_by = 'frame' # Default open_clusters = None # Default .. _system-info-3: Inputs - New parameters ^^^^^^^^^^^^^^^^^^^^^^^ | `threshold` is desired threshold value to use for clustering. If `None`, it will use the best threshold value by reading `param_file`. | `n_structures` is the number of closest structure (from medoid) to extract from each cluster. | `sorted_by` is the sorting method for the cluster labels. {'frame', 'cluster'}. Either frames or clusters can be sorted by ascending order. Default is 'frame'. | `open_cluster_dict` is the cluster dictionary file to open. If `None`, it will run the clustering algorithm. Execution ^^^^^^^^^ .. code:: bash python assign_labels.py Outputs ^^^^^^^ | `best_frames_indices.csv` contains the top *n* number (`n_structures`) of most representative frames for each of the top clusters (`top_num_cluster`). | `frame_vs_cluster.csv` contains cluster assignment per frame. | `sorted_by="frame"` will sort `frame_vs_cluster.csv` by ascending frame number. `sorted_by="cluster"` will sort by ascending cluster number. 6. Extract frames for each cluster (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `postprocessing.ipynb <../examples/postprocessing.html>`__ will use the indices from last step to extract the designated frames from the original trajectory for each cluster. A copy of this notebook can be found in ``$PATH/MDANCE/scripts/outputs/postprocessing.ipynb``. Further Reading ~~~~~~~~~~~~~~~ For more information on the eQual algorithm, please refer to the `eQual paper `__. Please Cite .. code:: bibtex @article{chen_extended_2025, title = {Extended {Quality} ({eQual}): {Radial} {Threshold} {Clustering} {Based} on n-ary {Similarity}}, issn = {1549-9596}, url = {https://doi.org/10.1021/acs.jcim.4c02341}, doi = {10.1021/acs.jcim.4c02341}, journal = {Journal of Chemical Information and Modeling}, author = {Chen, Lexin and Smith, Micah and Roe, Daniel R. and Miranda-Quintana, Ramón Alain}, month = may, year = {2025}, note = {Publisher: American Chemical Society}, }