PRIME Tutorial

Overview

This clustering tutorial is meant for datasets Molecular Dynamics Trajectory. PRIME assumes a MD trajectory that has a well-sampled ensemble of conformations. The PRIME algorithm predicts the native structure of a protein from simulation or clustering data. These methods perfectly mapped all the structural motifs in the studied systems and required unprecedented linear scaling.

Tutorial

The following tutorial will guide you through the process of determining the native structure of a biomolecule using the PRIME algorithm. If you do not have clustered data. Please refer to other clustering algorithms such as NANI to cluster your data, follow all steps.

1. Clone the MDANCE Repository

First things first, clone the MDANCE repository if you haven’t already.

$ git clone https://github.com/mqcomplab/MDANCE.git
$ cd MDANCE/scripts/prime

2. Cluster Normalization

normalize.py With already clustered data, this script will normalize the trajectory data between \([0,1]\) using the Min-Max Normalization.

# System info - EDIT THESE
input_top = '../../examples/md/aligned_tau.pdb'
unnormed_cluster_dir = '../outputs/labels_*'
output_dir = 'normed_clusters'
output_base_name = 'normed_clusttraj'
atomSelection = 'resid 3 to 12 and name N CA C O H'
n_clusters = 6

Inputs

System info
input_top is the topology file used in the clustering.
unnormed_cluster_dir is the directory where the clustering files are located from step 3.
output_dir is the directory where the normalized clustering files will be saved.
output_base_name is the base name for the output files.
atomSelection is the atom selection used in the clustering.
n_clusters is the number of clusters used in the PRIME. If number less than total number of cluster, it will take top n number of clusters.

Execution

Make sure your pwd is $PATH/MDANCE/scripts/prime.

$ python normalize.py

Outputs

normed_clusttraj.c*.npy files, normalized clustering files.
normed_data.npy, appended all normed files together.

3. Similarity Calculations

prime_sim generates a similarity dictionary from running PRIME.

-h Help with the argument options.
-m Methods, {pairwise, union, medoid, outlier} (required).
-n Number of clusters (required).
-i Similarity index, {RR or SM} (required).
-t Fraction of outliers to trim in decimals (default is None).
-w Weighing clusters by frames it contains (default is True).
-d Directory where the normed_clusttraj.c*.npy files are located (required)
-s Location where summary file is located with population of each cluster (required)

Execution

Make sure your pwd is $PATH/MDANCE/scripts/prime.

$ prime_sim -m union -n 6 -i SM -t 0.1  -d normed_clusters -s ../nani/outputs/summary_6.csv

To generate a similarity dictionary using data in normed_clusters (make sure you are in the prime directory) using the union method (2.2 in Fig 2) and Sokal Michener index. In addition, 10% of the outliers were trimmed.

Outputs

w_union_SM_t10.txt file with the similarity dictionary.
The result is a dictionary organized as followes:
{
    "frame_0": [
        0.7,    # cluster 1 similarity.
        0.9,    # cluster 2 similarity.
        ...,
        0.8     # average similarity of all above similarities.
    ]
}

4. Representative Frames

prime_rep will determine the native structure of the protein using the similarity dictionary generated in step 5.

-h for help with the argument options.
-m methods (for one method, None for all methods).
-s folder to access for w_union_SM_t10.txt file.
-i similarity index (required)
-t Fraction of outliers to trim in decimals (default is None).
-d directory where the normed_clusttraj.c* files are located (required if method is None)

Execution

Make sure your pwd is $PATH/MDANCE/scripts/prime.

$ prime_rep -m union -s outputs -d normed_clusters -t 0.1 -i SM

Outputs

w_rep_SM_t10_union.txt file with the representative frames index.

Further Reading

For more information on the PRIME algorithm, please refer to the PRIME paper.

Please Cite

@article{chen_protein_2024,
    title = {Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices},
    issn = {1549-9618},
    url = {https://doi.org/10.1021/acs.jctc.4c00362},
    doi = {10.1021/acs.jctc.4c00362},
    journal = {Journal of Chemical Theory and Computation},
    author = {Chen, Lexin and Mondal, Arup and Perez, Alberto and Miranda-Quintana, Ramón Alain},
    month = jul,
    year = {2024},
    note = {Publisher: American Chemical Society},
}
Alternative text

Fig 2. Six techniques of protein refinement. Blue is top cluster.