MDANCE: Molecular Dynamics Analysis with *N*-ary Clustering Ensembles
=====================================================================

.. image:: https://raw.githubusercontent.com/mqcomplab/MDANCE/main/docs/_static/mdance.png
   :width: 300
   :height: auto
   :align: center

A transformative framework for analyzing molecular dynamics simulations through advanced clustering algorithms

.. contents:: Table of Contents
   :local:
   :depth: 2

The Problem
-----------

Molecular Dynamics (MD) simulations generate terabytes of conformational data, but extracting meaningful biological insights remains challenging. Traditional clustering methods struggle with:

- **Exponential complexity** - MD datasets are massive.
- **Poor initialization** - leading to suboptimal clustering.
- **Pathway ambiguity** - difficulty identifying dominant biological pathways.
- **Native structure prediction** - accurately identifying biologically relevant states.
- **Pairwise similarity limitations** - traditional methods only compare pairs of objects, causing performance bottlenecks.
- **Stochastic variability** - lack of reproducibility across clustering runs.

Our Solution
------------

MDANCE introduces a novel *n*-ary similarity framework that transforms how we analyze MD trajectories. Our algorithms provide:

- **Linear scaling** - from O(*N²*) to O(*N*) complexity.
- **Deterministic results** - reproducible science.
- **Biological relevance** - algorithms designed for structural biology.
- **Unprecedented accuracy** - validated against experimental structures.
- **Extended similarity techniques** - swift identification of high and low-density regions in linear time.

Key Features
------------

🪄 NANI - Smart *k*-means Initialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Breakthrough**: Deterministic centroid initialization using *n*-ary comparisons to identify high-density regions and select diverse initial conformations.

**Key Advantages**:

- Solves the seed selection challenge in *k*-means clustering.
- Creates compact, well-separated clusters that accurately find metastable states.
- Provides consistent cluster populations across replicates.
- Dramatically reduces runtime: **clusters 1.5 million HP35 frames in ~40 minutes**.

🧩 HELM - Scalable Hierarchical Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Breakthrough**: Combines *k*-means efficiency with hierarchical flexibility using *n*-ary difference functions.

**Performance**:

- Retains *k*-means computational efficiency while enabling arbitrary partitions.
- Successfully analyzes simulations with over 1.5 million frames.
- **Achieves in ~34 minutes what traditional HAC requires 29 hours for 1.5 million frames.**
- Builds hierarchy without expensive pairwise distance matrices.

🌳 DIVINE - Deterministic Divisive Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Breakthrough**: Top-down hierarchical clustering framework that recursively splits clusters based on *n*-ary similarity principles.

**Key Features**:

- Completely avoids O(*N²*) pairwise distance matrices.
- Deterministic anchor initialization with NANI.
- Multiple cluster selection criteria including weighted variance metric.
- Single-pass design enables efficient resolution exploration.
- Matches or exceeds bisecting *k*-means quality with reduced runtime.

🌿 mdBIRCH - Online Clustering for MD Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Innovation**: Adapts BIRCH CF-tree to molecular dynamics data with RMSD-calibrated merge tests.

**Key Capabilities**:

- Online clustering that processes frames as they arrive.
- Merge test calibrated directly to RMSD for physical interpretability.
- Completely avoids pairwise distance matrices.
- Scales near-linearly with number of frames.
- Two practical protocols: RMSD-anchored runs and blind sweep analysis.
- Processes hundreds of thousands of frames on a single CPU core in seconds.

🔍 SHINE - Pathway Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Transformative**: Hierarchical clustering that identifies dominant biological pathways from enhanced sampling data.

**Key Advantages**:

- Streamlines analysis of pathway ensembles from multiple MD simulations.
- Integrates *n*-ary similarity with cheminformatics-inspired tools.
- Identifies most representative pathway within each pathway class.
- Provides insight into dominant biomolecular transformation mechanisms.
- Lower computational cost than Fréchet distance approaches.
- Successfully applied to alanine dipeptide and adenylate kinase systems.

🎯 eQual - O(*N*) Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Innovation**: Transforms O(*N²*) Radial Threshold Clustering into O(*N*) algorithm with novel seed selection and tie-breaking.

**Key Features**:

- Uses *k*-means++ for efficient seed selection.
- Employs extended similarity indices for deterministic results.
- Eliminates memory-intensive pairwise RMSD matrices.
- Produces compact and well-separated clusters matching RTC quality.

📊 CADENCE - Density-Based Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Novelty**: Bridges the gap between efficient *k*-means and robust density-based clustering using *n*-ary similarity framework.

**Key Advantages**:

- Swiftly pinpoints high and low-density regions in linear O(*N*) time.
- Enables focused exploration of rare events.
- Identifies most representative conformations efficiently.
- Overcomes limitations of pairwise similarity operations.

🏆 PRIME - Native Structure Prediction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Game Changer**: Predicts native protein structures from simulation data with unprecedented accuracy. 

.. table:: PRIME Validation
   :widths: 60 40

   ============================================  ==================================
   **Scientific Validation**: PRIME (Protein     .. image:: https://raw.githubusercontent.com/mqcomplab/MDANCE/main/docs/img/2k2e.png
      Retrieval via Integrative Molecular           :width: 300
      Ensembles) perfectly mapped all structural    :alt: 2k2e
      motifs in benchmark studies and consistently  
      identified native structures within 2Å RMSD   
      of experimental data.                        *Superposition of native structure using PRIME (yellow) and experimental native structures (blue) of 2k2e.*
   ============================================  ==================================

Algorithm Comparison
~~~~~~~~~~~~~~~~~~~~

.. list-table:: Algorithm Comparison
   :widths: 15 15 20 25 25
   :header-rows: 1

   * - Algorithm
     - Complexity
     - Type
     - Key Feature
     - Best Use Case
   * - NANI
     - O(*N*)
     - Initialization
     - Deterministic centroids
     - *k*-means improvement
   * - HELM
     - O(*N*)
     - Hybrid hierarchical
     - *k*-means + hierarchical fusion
     - Large-scale analysis
   * - DIVINE
     - O(*N*)
     - Divisive hierarchical
     - Top-down splitting
     - Multi-resolution analysis
   * - mdBIRCH
     - O(*N*)
     - Online clustering
     - Streaming data processing
     - Large-scale trajectories
   * - SHINE
     - O(*N*)
     - Hierarchical
     - Pathway analysis
     - Enhanced sampling
   * - eQual
     - O(*N*)
     - Flat clustering
     - Linear RTC replacement
     - General purpose
   * - CADENCE
     - O(*N*)
     - Density-based
     - *n*-ary density estimation
     - Rare event detection
   * - PRIME
     - O(*N*)
     - Post-processing
     - Native structure prediction
     - Structure validation

Quick Start
-----------

Installation
~~~~~~~~~~~~

.. code-block:: bash

   pip install mdance

Basic Usage
~~~~~~~~~~~

.. code-block:: python

   import mdance
   import numpy as np

   # Load your MD trajectory data
   data = np.load('trajectory.npy')

   # Use NANI for optimal clustering initialization
   from mdance.cluster.nani import KmeansNANI
   nani = KmeansNANI(data, n_clusters=5, metric='MSD')
   optimal_centroids = nani.initiate_kmeans()

   # Cluster with standard *k*-means
   from sklearn.cluster import KMeans
   kmeans = KMeans(5, init=optimal_centroids[:5], n_init=1)
   labels = kmeans.fit_predict(data)

Tutorials
~~~~~~~~~

- `NANI Tutorial <https://mdance.readthedocs.io/en/latest/tutorials/nani.html>`_ - Smart *k*-means initialization.
- `HELM Tutorial <https://mdance.readthedocs.io/en/latest/tutorials/helm.html>`_ - Scalable hierarchical clustering.
- `DIVINE Scripts <https://github.com/mqcomplab/MDANCE/tree/main/scripts/divine>`_ - Deterministic divisive clustering. 1-``run_divine.py``, 2-``analysis_db.ipynb``, 3-``assign_labels.py``.
- `mdBIRCH Script <https://github.com/mqcomplab/MDANCE/blob/main/scripts/mdbirch/run_mdbirch.py>`_ - Online clustering for streaming MD data.
- `SHINE Script <https://github.com/mqcomplab/MDANCE/blob/main/scripts/shine/run_shine.py>`_ - Pathway analysis.
- `eQual Tutorial <https://mdance.readthedocs.io/en/latest/tutorials/equal.html>`_ - Linear-time clustering.
- CADENCE Tutorial - Density-based clustering (to be added).
- `PRIME Tutorial <https://mdance.readthedocs.io/en/latest/tutorials/prime.html>`_ - Native structure retrieval.

Publications
------------

Our methods are backed by peer-reviewed research:

- **NANI**: `J. Chem. Theory Comput. 2024 <https://pubs.acs.org/doi/10.1021/acs.jctc.4c00308>`_
- **HELM**: `J. Chem. Inf. Model. 2025 <https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c01287>`_
- **DIVINE**: `BioRxiv 2025 <https://www.biorxiv.org/content/10.1101/2025.06.20.660828v1>`_
- **mdBIRCH**: `BioRxiv 2025 <https://www.biorxiv.org/content/10.1101/2025.11.05.686879v1.abstract>`_
- **SHINE**: `J. Chem. Inf. Model. 2025 <https://pubs.acs.org/doi/abs/10.1021/acs.jcim.5c00240>`_
- **eQual**: `J. Chem. Inf. Model. 2025 <https://pubs.acs.org/doi/abs/10.1021/acs.jcim.4c02341>`_
- **CADENCE**: `J. Chem. Inf. Model. 2025 <https://pubs.acs.org/doi/10.1021/acs.jcim.5c00392>`_
- **PRIME**: `J. Chem. Theory Comput. 2024 <https://pubs.acs.org/doi/10.1021/acs.jctc.4c00362>`_

Impact
------

MDANCE is enabling researchers to:

- Accelerate drug discovery by rapidly identifying biologically relevant conformations.
- Understand disease mechanisms through precise pathway analysis.
- Validate computational models against experimental structures.
- Scale analyses to massive simulation datasets.

Contributing
------------

We welcome collaborations and contributions! Whether you're a:

- Computational biologist with novel analysis needs.
- Method developer interested in extending our framework.
- Structural biologist with challenging datasets.

Get involved:

- Open an issue for bug reports or feature requests.
- Submit a pull request for improvements.
- Reach out to discuss research collaborations.

Funding
-------

This research was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM150620.

----

*MDANCE: Transforming how we understand molecular dynamics through innovative n-ary clustering frameworks*