MDANCE: Molecular Dynamics Analysis with N-ary Clustering Ensembles

A transformative framework for analyzing molecular dynamics simulations through advanced clustering algorithms

The Problem

Molecular Dynamics (MD) simulations generate terabytes of conformational data, but extracting meaningful biological insights remains challenging. Traditional clustering methods struggle with:

  • Exponential complexity - MD datasets are massive.

  • Poor initialization - leading to suboptimal clustering.

  • Pathway ambiguity - difficulty identifying dominant biological pathways.

  • Native structure prediction - accurately identifying biologically relevant states.

  • Pairwise similarity limitations - traditional methods only compare pairs of objects, causing performance bottlenecks.

  • Stochastic variability - lack of reproducibility across clustering runs.

Our Solution

MDANCE introduces a novel n-ary similarity framework that transforms how we analyze MD trajectories. Our algorithms provide:

  • Linear scaling - from O(NΒ²) to O(N) complexity.

  • Deterministic results - reproducible science.

  • Biological relevance - algorithms designed for structural biology.

  • Unprecedented accuracy - validated against experimental structures.

  • Extended similarity techniques - swift identification of high and low-density regions in linear time.

Key Features

πŸͺ„ NANI - Smart k-means Initialization

Breakthrough: Deterministic centroid initialization using n-ary comparisons to identify high-density regions and select diverse initial conformations.

Key Advantages:

  • Solves the seed selection challenge in k-means clustering.

  • Creates compact, well-separated clusters that accurately find metastable states.

  • Provides consistent cluster populations across replicates.

  • Dramatically reduces runtime: clusters 1.5 million HP35 frames in ~40 minutes.

🧩 HELM - Scalable Hierarchical Clustering

Breakthrough: Combines k-means efficiency with hierarchical flexibility using n-ary difference functions.

Performance:

  • Retains k-means computational efficiency while enabling arbitrary partitions.

  • Successfully analyzes simulations with over 1.5 million frames.

  • Achieves in ~34 minutes what traditional HAC requires 29 hours for 1.5 million frames.

  • Builds hierarchy without expensive pairwise distance matrices.

🌳 DIVINE - Deterministic Divisive Clustering

Breakthrough: Top-down hierarchical clustering framework that recursively splits clusters based on n-ary similarity principles.

Key Features:

  • Completely avoids O(NΒ²) pairwise distance matrices.

  • Deterministic anchor initialization with NANI.

  • Multiple cluster selection criteria including weighted variance metric.

  • Single-pass design enables efficient resolution exploration.

  • Matches or exceeds bisecting k-means quality with reduced runtime.

🌿 mdBIRCH - Online Clustering for MD Data

Innovation: Adapts BIRCH CF-tree to molecular dynamics data with RMSD-calibrated merge tests.

Key Capabilities:

  • Online clustering that processes frames as they arrive.

  • Merge test calibrated directly to RMSD for physical interpretability.

  • Completely avoids pairwise distance matrices.

  • Scales near-linearly with number of frames.

  • Two practical protocols: RMSD-anchored runs and blind sweep analysis.

  • Processes hundreds of thousands of frames on a single CPU core in seconds.

πŸ” SHINE - Pathway Analysis

Transformative: Hierarchical clustering that identifies dominant biological pathways from enhanced sampling data.

Key Advantages:

  • Streamlines analysis of pathway ensembles from multiple MD simulations.

  • Integrates n-ary similarity with cheminformatics-inspired tools.

  • Identifies most representative pathway within each pathway class.

  • Provides insight into dominant biomolecular transformation mechanisms.

  • Lower computational cost than FrΓ©chet distance approaches.

  • Successfully applied to alanine dipeptide and adenylate kinase systems.

🎯 eQual - O(N) Clustering

Innovation: Transforms O(NΒ²) Radial Threshold Clustering into O(N) algorithm with novel seed selection and tie-breaking.

Key Features:

  • Uses k-means++ for efficient seed selection.

  • Employs extended similarity indices for deterministic results.

  • Eliminates memory-intensive pairwise RMSD matrices.

  • Produces compact and well-separated clusters matching RTC quality.

πŸ“Š CADENCE - Density-Based Clustering

Novelty: Bridges the gap between efficient k-means and robust density-based clustering using n-ary similarity framework.

Key Advantages:

  • Swiftly pinpoints high and low-density regions in linear O(N) time.

  • Enables focused exploration of rare events.

  • Identifies most representative conformations efficiently.

  • Overcomes limitations of pairwise similarity operations.

πŸ† PRIME - Native Structure Prediction

Game Changer: Predicts native protein structures from simulation data with unprecedented accuracy.

Algorithm Comparison

Algorithm Comparison

Algorithm

Complexity

Type

Key Feature

Best Use Case

NANI

O(N)

Initialization

Deterministic centroids

k-means improvement

HELM

O(N)

Hybrid hierarchical

k-means + hierarchical fusion

Large-scale analysis

DIVINE

O(N)

Divisive hierarchical

Top-down splitting

Multi-resolution analysis

mdBIRCH

O(N)

Online clustering

Streaming data processing

Large-scale trajectories

SHINE

O(N)

Hierarchical

Pathway analysis

Enhanced sampling

eQual

O(N)

Flat clustering

Linear RTC replacement

General purpose

CADENCE

O(N)

Density-based

n-ary density estimation

Rare event detection

PRIME

O(N)

Post-processing

Native structure prediction

Structure validation

Quick Start

Installation

pip install mdance

Basic Usage

import mdance
import numpy as np

# Load your MD trajectory data
data = np.load('trajectory.npy')

# Use NANI for optimal clustering initialization
from mdance.cluster.nani import KmeansNANI
nani = KmeansNANI(data, n_clusters=5, metric='MSD')
optimal_centroids = nani.initiate_kmeans()

# Cluster with standard *k*-means
from sklearn.cluster import KMeans
kmeans = KMeans(5, init=optimal_centroids[:5], n_init=1)
labels = kmeans.fit_predict(data)

Tutorials

Publications

Our methods are backed by peer-reviewed research:

Impact

MDANCE is enabling researchers to:

  • Accelerate drug discovery by rapidly identifying biologically relevant conformations.

  • Understand disease mechanisms through precise pathway analysis.

  • Validate computational models against experimental structures.

  • Scale analyses to massive simulation datasets.

Contributing

We welcome collaborations and contributions! Whether you’re a:

  • Computational biologist with novel analysis needs.

  • Method developer interested in extending our framework.

  • Structural biologist with challenging datasets.

Get involved:

  • Open an issue for bug reports or feature requests.

  • Submit a pull request for improvements.

  • Reach out to discuss research collaborations.

Funding

This research was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM150620.


MDANCE: Transforming how we understand molecular dynamics through innovative n-ary clustering frameworks