eQual - Extended Quality Clustering
===================================

.. contents::
   :local:
   :depth: 2

Overview
--------
eQual is a quality clustering algorithm that use the radial threshold to grow the cluster to maximize similarity between members in a cluster. It is an extension of the Radial Threshold Clustering algorithm (`Daura and Oscar Conchillo-Solé <https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.2c01079>`_). eQual has improved with new seed selection methods and tie-breaking criteria.

eQual selects the seed to start the clustering. It then grows the cluster by adding the neighbors within a threshold away from the seed. This threshold is calculated using the mean-square deviation from the seed. The iteration continues until it runs out of neighbors and chosen neighbors will be removed from original dataset. A new iteration begins and selects medoid and its neighbors from the available dataset. If user selects multiple medoids, then the medoid that proposed the densest and most similar cluster will be chosen. The process repeats and the algorithm terminates when the dataset is empty.

This clustering tutorial is meant for datasets for all applications (2D fingerprints, mass spectrometry imaging data, etc). Molecular Dynamics Trajectory has a different treatment. If specific step is only for Molecular Dynamics trajectory, it will be specified. Otherwise, it is applicable for all datasets.

Tutorial
--------

1. Clone the repository
~~~~~~~~~~~~~~~~~~~~~~~

Clone the MDANCE repository if you haven't already.

.. code:: bash

   $ git clone https://github.com/mqcomplab/MDANCE.git
   $ cd MDANCE/scripts/equal


2. Input Preparations
~~~~~~~~~~~~~~~~~~~~~

.. raw:: html

   <details>

.. raw:: html

   <summary>

Preparation for Molecular Dynamics Trajectory

.. raw:: html

   </summary>

Prepare a valid topology file (e.g. ``.pdb``, ``.prmtop``), trajectory
file (e.g. ``.dcd``, ``.nc``), and the atom selection. This step will
convert a Molecular Dynamics trajectory to a numpy ndarray. **Make sure
the trajectory is already aligned and/or centered if needed!**

`Preprocessing Notebook <../examples/preprocessing.html>`__ 
contains step-by-step tutorial to prepare the input for NANI. 

A copy of this notebook can be found in ``$PATH/MDANCE/scripts/inputs/preprocessing.ipynb``.

.. raw:: html

   </details>

.. raw:: html

   <details>

.. raw:: html

   <summary>

Preparation for all other datasets (OPTIONAL)

.. raw:: html

   </summary>

This step is **optional**. If you are using a metric that is NOT the
mean-square deviation (MSD)–default metric, you will need to normalize
the dataset. Otherwise, you can skip this step.

`normalize.py <https://github.com/mqcomplab/MDANCE/blob/main/scripts/inputs/normalize.py>`__ will
normalize the dataset. The following parameters to be specified in the
script:

::

   # System info - EDIT THESE
   data_file = '../data/2D/blob_disk.csv'
   array = np.genfromtxt(data_file, delimiter=',')
   output_base_name = 'output_base_name'

Inputs
^^^^^^

System info
'''''''''''

| ``data_file`` is your input file with a 2D array. 
| ``array`` is the array is the loaded dataset from ``data_file``. This step can be changed according to the type of file format you have. However, ``array`` must be an array-like in the shape (number of samples, number of features).
| ``output_base_name`` is the base name for the output file. The output file will be saved as ``output_base_name.npy``. 

.. raw:: html

   </details>

3. eQual Screening
~~~~~~~~~~~~~~~~~~
`scripts/equal/screen_equal.py <https://github.com/mqcomplab/MDANCE/blob/master/scripts/equal/screen_equal.py>`_ will screen eQual clustering for multiple thresholds and give the most optimal threshold. For the best result, we recommend screening eQual with a wide range of threshold values.  
*Depending on the number of samples or features, consider sieving over wide threshold range. For large dataset, please submit this as a job instead of running on command line.* 
The following parameters to be specified in the script:

::

    # System info - EDIT THESE
    input_traj_numpy = data.sim_traj_numpy
    N_atoms = 50
    sieve = 1

    # eQUAL params - EDIT THESE
    metric = 'MSD'                                                      # Default
    n_seeds = 3
    check_sim = True                                                    # Default
    reject_lowd = True                                                  # Default
    sim_threshold = 16
    min_samples = 10                                                    # Default

    # thresholds params- EDIT THESE
    start_threshold = 5
    end_threshold = 6
    step = 0.1
    save_clusters = False                                                # Default False

.. _system-info-2:

Inputs
^^^^^^
System info
'''''''''''

| `input_traj_numpy` is the numpy array prepared from step 1, if not it will be your loaded dataset. 
| `N_atoms` is the number of atoms used in the clustering. **For all non-Molecular Dynamics datasets, this is 1.** 
| `sieve` takes every sieve-th frame from the trajectory for analysis. 

eQual params
^^^^^^^^^^^^

| `metric` is the metric used to calculate the similarity between frames (See ``mdance.tools.bts.extended_comparisons`` for details). 
| `n_seeds` is the is the number of seeds selected per iteration. If `n_medoids` is greater than 1, then multiple clusters will be proposed; the cluster with the densest and greatest similarity of members will be selected. Performance time will increase with more seeds. 
| `check_sim` is boolean to check the similarity of the seed to the cluster. 
| `reject_lowd` is boolean to reject low density clusters. `sim_threshold` needs to be specified. 
| `sim_threshold` is the similarity threshold to reject less compact clusters. 
| `min_samples` is the minimum cluster size to reject low density clusters. Default is 10. 

Radial threshold screening params
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

| `start_threshold` is the starting value `r_theshold` for screening range. 
| `end_threshold` is the ending value of `r_theshold` screening range. 
| `step` is the increment of the `r_theshold` screening range. 
| `save_clusters` is boolean to save the cluster dictionary. Default is False. 

Execution
^^^^^^^^^
.. code:: bash

    $ python screen_equal.py

Outputs
^^^^^^^
- a csv with the number of clusters, cluster population for each threshold value. 
- a csv with the Calinski-Harabasz (CH) score and Davies-Bouldin (DB) score (two cluster quality indices) for each threshold value.

4. eQual Screening Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The clustering screening results will be analyzed using the
Davies-Bouldin index (DB). There are two criteria to select the number
of clusters: 

1. lowest DB
2. maximum 2nd derivative of DB.

`$PATH/MDANCE/scripts/equal/analysis.ipynb <https://github.com/mqcomplab/MDANCE/blob/master/scripts/equal/analysis_db.ipynb>`_ will analyze the eQual screening results. 

5. Assign labels to the frames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`scripts/equal/assign_labels.py <https://github.com/mqcomplab/MDANCE/blob/master/scripts/equal/assign_labels.py>`_ will assign cluster for each frame. The following parameters to be specified in the script:

::

    # System info - EDIT THESE
    input_traj_numpy = data.sim_traj_numpy
    N_atoms = 50
    sieve = 1

    # eQUAL params - EDIT THESE
    metric = 'MSD'                                                      # Default 
    n_seeds = 3                                                         # Default
    check_sim = True                                                    # Default
    reject_lowd = True                                                  # Default
    sim_threshold = 16
    min_samples = 10                                                    # Default

    # extract params- EDIT THESE
    threshold = 5.80
    n_structures = 11                                                   # Default
    sorted_by = 'frame'                                                 # Default
    open_clusters = None                                                # Default

.. _system-info-3:

Inputs - New parameters
^^^^^^^^^^^^^^^^^^^^^^^

| `threshold` is desired threshold value to use for clustering. If `None`, it will use the best threshold value by reading `param_file`. 
| `n_structures` is the number of closest structure (from medoid) to extract from each cluster. 
| `sorted_by` is the sorting method for the cluster labels. {'frame', 'cluster'}. Either frames or clusters can be sorted by ascending order. Default is 'frame'. 
| `open_cluster_dict` is the cluster dictionary file to open. If `None`, it will run the clustering algorithm. 

Execution
^^^^^^^^^
.. code:: bash

    python assign_labels.py

Outputs
^^^^^^^

| `best_frames_indices.csv` contains the top *n* number (`n_structures`) of most representative frames for each of the top clusters (`top_num_cluster`). 
| `frame_vs_cluster.csv` contains cluster assignment per frame. 
| `sorted_by="frame"` will sort `frame_vs_cluster.csv` by ascending frame number. `sorted_by="cluster"` will sort by ascending cluster number. 

6. Extract frames for each cluster (Optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`postprocessing.ipynb <../examples/postprocessing.html>`__
will use the indices from last step to extract the designated frames
from the original trajectory for each cluster.

A copy of this notebook can be found in ``$PATH/MDANCE/scripts/outputs/postprocessing.ipynb``.


Further Reading
~~~~~~~~~~~~~~~

For more information on the eQual algorithm, please refer to the `eQual
paper <https://pubs.acs.org/doi/10.1021/acs.jcim.4c02341>`__.

Please Cite

.. code:: bibtex

   @article{chen_extended_2025,
      title = {Extended {Quality} ({eQual}): {Radial} {Threshold} {Clustering} {Based} on n-ary {Similarity}},
      issn = {1549-9596},
      url = {https://doi.org/10.1021/acs.jcim.4c02341},
      doi = {10.1021/acs.jcim.4c02341},
      journal = {Journal of Chemical Information and Modeling},
      author = {Chen, Lexin and Smith, Micah and Roe, Daniel R. and Miranda-Quintana, Ramón Alain},
      month = may,
      year = {2025},
      note = {Publisher: American Chemical Society},
   }