mdance.cluster

mdance.cluster.nani

class mdance.cluster.nani.KmeansNANI(data, n_clusters, metric, N_atoms, init_type='strat_all', **kwargs)[source]

Bases: object

k-means NANI clustering alogorithm (N-Ary Natural Initialization).

Valid Values for init_types: (k means number of clusters) | strat_all: A number of bins are computed based on specified percentage of the data. Stratified sampling is then applied, and the first k points from the stratified data are selected as the initial centers. | strat_reduced: Identifies high-density regions using complementary similarity, selecting a specified percentage of points. Applies stratified sampling to this subset using a number of bins based on the subset size, and selects the first k points as initial centers. | comp_sim: Identifies high-density regions using complementary similarity, selecting a percentage% of the data. From this subset, diversity selection (with comp_sim as the sampling method) is used to choose the first k points as the initial centers. | div_select: Applies diversity selection (using comp_sim as the sampling method) on specified percentage% of points. First k points are the initial centers. | quota: Uses quota sampling to select initial centers based on complementary similarity values divided into bins. | k-means++ selects the initial centers based on the greedy k-means++ algorithm. | random selects the initial centers randomly. | vanilla_kmeans++ selects the initial centers based on the vanilla k-means++ algorithm.

Parameters:

data (array-like of shape (n_samples, n_features)) – A feature array.
n_clusters (int) – Number of clusters.
metric (str) – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.
init_type (str, default='comp_sim') – Type of initiator selection for initiating k-means. It must be an options allowed by mdance.cluster.nani.KmeansNANI.
percentage (int, default=10) – Percentage of the dataset to be used for the initial selection of the initial centers. (**kwargs)

labels

An array of the labels of each point.

Type:: array-like of shape (n_samples,)

centers

An array of the cluster centers.

Type:: array-like of shape (n_clusters, n_features)

n_iter

Number of iterations until coverage.

Type:: int

cluster_dict

Dictionary of the clusters and their corresponding indices.

Type:: dict

compute_scores(labels)[source]

Computes the Davies-Bouldin and Calinski-Harabasz scores.

Parameters:: labels (array-like of shape (n_samples,)) – Cluster labels.
Returns:: Davies-Bouldin and Calinski-Harabasz scores.
Return type:: tuple

create_cluster_dict(labels)[source]

Creates a dictionary with the labels as keys and the indices of the data as values.

Parameters:: labels (array-like of shape (n_samples,)) – Cluster labels.
Returns:: Dictionary with the labels as keys and the indices of the data as values.
Return type:: dict

execute_kmeans_all()[source]

Function to complete all steps of NANI for all different init_type options.

Returns:: Labels, centers and number of iterations.
Return type:: tuple

initiate_kmeans(**kwargs)[source]

Initializes the k-means algorithm with the selected initiators.

Raises:: ValueError – If the number of initiators is less than the number of clusters.
Returns:: The initial centers for k-means of shape (n_clusters, n_features).
Return type:: numpy.ndarray

kmeans_clustering(initiators)[source]

Executes the k-means algorithm with the selected initiators.

Parameters:: initiators ({numpy.ndarray, 'k-means++', 'random'}) – Method for selecting initial centers. k-means++ selects initial centers in a smart way to speed up convergence. random selects initial centers randomly. numpy.ndarray selects initial centers based on the input array.
Returns:: Labels, centers and number of iterations.
Return type:: tuple

write_centroids(centers, n_iter)[source]

Writes the centroids to a file.

Parameters:

centers (array-like of shape (n_clusters, n_features)) – Centroids of the clusters.
n_iter (int) – Number of iterations until converage.

mdance.cluster.nani.compute_scores(data, labels)[source]

Computes the Calinski-Harabasz and Davies-Bouldin scores.

Parameters:: labels (array-like of shape (n_samples,)) – Cluster labels.
Returns:: Calinski-Harabasz and Davies-Bouldin scores (in that order).
Return type:: tuple

mdance.cluster.equal

class mdance.cluster.equal.ExtendedQuality(data, threshold, metric, N_atoms, seed_method='greedy', n_seeds=1, check_sim=False, reject_lowd=True, **kwargs)[source]

Bases: object

Extended quality clustering algorithm is an extension of the radial threshold algorithm. It grows clusters from seeds and can rejects low density clusters.

Parameters:

data (array-like of shape (n_samples, n_features)) – A feature array.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int) – Number of atoms in the system used for normalization. N_atoms=1 for non-Molecular Dynamics datasets.
threshold (float) – The distance between the seed of the subcluster and a new sample should be lesser than the threshold.
n_seeds ({float, int}) – Number of seeds to be used per iteration. Default is 1. float: Real number between (0, 1). Indicates the % of the total data. int: Number of seeds.
seed_method ({'comp_sim', 'greedy', 'medoid', 'mini_batch_kmeans', 'vanilla'}) – Method used to select the initial seeds.
check_sim (bool, default False) – If True, validates the proposed cluster against a similarity threshold to ensure it meets acceptable criteria.
reject_lowd (bool, default False) – If True, will reject low density clusters if they are below the minimum cluster size.
align_method ({'uni', 'kron', None}, optional) – Alignment method used for the data. Default is None, which means no alignment. ‘uni’ is a uniform alignment method. ‘kron’ is a Kronecker alignment method.
percentage (int, default=10) – Percentage of the dataset to be used for the initial selection of the initial centers. (**kwargs)
sim_threshold (float) – The largest similarity value that is acceptable for the proposed cluster. (**kwargs)
min_samples ({float, int}, default=10) – Minimum number of data points required in a cluster. (**kwargs) float: Real number between (0, 1). Indicates the % of the total data. int: Number of data points.

calculate_best_frames(clusters, n_structures=10, sorted_by='frame')[source]

Extract the best n structures for each cluster.

Parameters:

clusters (dict) – Dictionary of the clusters and their corresponding indices.
n_structures (int, default=10) – Number of structures to be extracted for each cluster.
sorted_by ({'frame', 'similarity'}, default='frame') – Sort the best structures by frame number or similarity value.

Returns:

Array of the best frames.

Return type:

numpy.ndarray

calculate_populations(clusters)[source]

Calculate the populations of the clusters.

Returns:: Key: cluster number, value: cluster population.
Return type:: dict

comp_sim_seeds()[source]

Selects the inital centers based on the diversity in the high density region of the data using the n-ary similarity.

Returns:: (n_seeds, n_features) array of the initial seeds.
Return type:: numpy.ndarray

Notes

A complementary similarity is calculated for each point in the dataset. Next, the top n% of the points are selected for diversity selection. The first n_seeds number of points are selected as the seeds.

find_best_frames_indices(best_frames, sieve)[source]

Find the indices of the best frames.

Parameters:

best_frames (numpy.ndarray) – Array of the best frames.
sieve (int) – The sieve value used to select the frames.

Returns:

Array of the best frames indices.

Return type:

numpy.ndarray

find_medoids()[source]

Finds the seeds by selecting the medoids using the complementary similarity.

Returns:: (n_seeds, n_features) array of the initial seeds.
Return type:: numpy.ndarray

Notes

A complementary similarity is calculated for each point in the dataset. Then, the first n_seeds number of points are selected as the seeds.

greedy_seeds()[source]

Select the initial centers using the greedy k-means++ algorithm. (Arthur and Vassilvitskii, 2007).

Returns:: (n_seeds, n_features) array of the initial seeds.
Return type:: numpy.ndarray

grow_clusters()[source]

The heart of the ExtendedQuality algorithm.

Returns:: Key: iteration number, value: numpy.ndarray of the cluster members.
Return type:: dict

Notes

Initial seeds are selected using the method in seed_method attribute.
Each seed proposes a cluster by adding available objects within the radial threshold.
The winner seed cluster is the most dense cluster. If there are multiple,
the one with the lowest similarity is chosen.
Objects in the winner seed cluster are removed from the data.
If check_sim is True, clusters above the similarity threshold are rejected.
if reject_lowd is True, clusters below the min_samples are rejected.
Repeat steps 1-6 until there are less than 2 objects left in the data because
it is not possible to determine seeds with 2 or less objects.

labels(clusters, sieve)[source]

Assigns labels to the clusters.

Parameters:: clusters (dict) – Dictionary of the clusters and their corresponding indices.
Returns:: Array of the cluster labels.
Return type:: numpy.ndarray

mini_batch_kmeans_seeds()[source]

Select the initial centers using the mini-batch k-means algorithm.

Returns:: (n_seeds, n_features) array of the initial seeds.
Return type:: numpy.ndarray

run()[source]

Run the ExtendedQuality algorithm.

Returns:: Key: iteration number, value: numpy.ndarray of the cluster members.
Return type:: dict

vanilla_seeds()[source]

Select the initial centers using the vanilla k-means++ algorithm.

Returns:: (n_seeds, n_features) array of the initial seeds.
Return type:: numpy.ndarray

mdance.cluster.equal.compute_scores(results)[source]

Compute the Calinski-Harabasz (CH) and Davies-Bouldin (DB) scores for the clusters.

Parameters:

data (array-like of shape (n_samples, n_features)) – Input dataset.
results (dict) – Dictionary of the clusters and their corresponding indices.

Returns:

A tuple of the CH and DB scores in that order.

Return type:

tuple

Notes

Labels are assigned based on number of clusters. If there is only one cluster,: the CH and DB scores cannot be calculated and None is returned.

Example

>>> import numpy as np
>>> from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score
>>> from mdance.cluster.equal import compute_scores
>>> data = np.array([[1, 2], [1, 4], [1, 0],
...                  [4, 2], [4, 4], [4, 0]])
>>> results = {0: [0, 1, 2], 1: [3, 4, 5]}
>>> ch, db = compute_scores(data, results)
>>> print(ch, db)
3.375 0.8888888888888888

mdance.cluster.helm

class mdance.cluster.helm.HELM(cluster_dict, metric, N_atoms, merge_scheme='inter', n_clusters=None, eps=None, trim_start=False, align_method=None, min_samples=0.01, **kwargs)[source]

Bases: object

Hierarchical Extended Linkage Method (HELM) is a class that performs hierarchical clustering of clusters. It is uses the n-ary similarity framework to merge clusters based on the HELM merge schemes.

Parameters:

cluster_dict (dict) – dictionary of clusters following the format in the Notes section.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.
merge_scheme (str, default='inter') – The merge scheme to use when merging two clusters. Options are intra, inter, and half.
n_clusters (int) – Number of clusters to terminate the clustering process.
eps (float) – epsilon MSD value to terminate the clustering process.
trim_start (bool, default=False) – If True, the initial clusters are trimmed based on the trim_val or trim_k.
trim_val (float, default=None) – If trim_start is True, then this value is used to trim the initial clusters.
trim_k (int, default=None) – If trim_start is True, then this value is used to trim the initial clusters.
align_method (str, default=None) – If uni, the clusters are aligned using the uniform alignment method. If kron, the clusters are aligned using the Kronecker alignment method.
input_top (str, default=None) – The topology file of the MD system.
input_traj (str, default=None) – The trajectory file of the MD system.
save_pairwise_sim (bool, default=False) – If True, the pairwise similarity matrix is saved.
link (str, default=None) – The linkage algorithm to use. See the Linkage Methods for full descriptions.

Notes

The cluster_dict dictionary should be in the following format: Clusters = {N1: clustersN1, N2: clustersN2, ...} Nk : int

Number of clusters in the k-th iteration

clustersNklist of lists: Contains the info about clusters in *k*th iteration clustersNk = [C1k, C2k, ...]
Ciklist of lists: Contains information about *i*th cluster in *k*th iteration Cik = [Indicesik, (c_sumik, sq_sumik), Nik]
Indicesiklist: Cluster indices of merged clusters. For example, [0, 1] means cluster 0 and 1 are merged
c_sumikarray-like of (n_features,): A feature array of the column-wsie sum of the data.
sq_sumik: array-like of (n_features,): A feature array of the column-wise sum of the squared data.
Nikint: Number of elements in the cluster

calc(previous_clusters, i, j)[source]

Calculates the similarity between two clusters

Parameters:

previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration clustersNk = [C1k, C2k, ...]
i (int) – Index of the first cluster
j (int) – Index of the second cluster

Returns:

similarity between two clusters using the HELM merge schemes

Return type:

float

gen_cluster_dists(previous_clusters)[source]

Generates pairwise similarity matrix for the initial clusters

Parameters:: previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration clustersNk = [C1k, C2k, ...]
Returns:: pairwise similarity matrix
Return type:: array-like

gen_link_matrix()[source]

Generates the linkage matrix only for ward linkage

Returns:: linkage matrix
Return type:: array-like

gen_new_cluster(previous_clusters)[source]

Generates new cluster by merging two most similar clusters.

Parameters:: previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration clustersNk = [C1k, C2k, ...]
Returns:: Contains the info about clusters in *k*th iteration clustersNk = [C1k, C2k, ...]
Return type:: list of lists

initial_pairwise_matrix()[source]

Generates pairwise similarity matrix for the initial clusters

Parameters:: previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration clustersNk = [C1k, C2k, ...]
Returns:: pairwise similarity matrix
Return type:: array-like

link_matrix_to_cluster_dict()[source]

run()[source]

Performs HELM clustering of initial clusters.

Returns:: dictionary of clusters
Return type:: dict

Notes

N is the number of clusters in the k-th iteration.

trim_clusters()[source]

Trims the initial clusters based on the trim_val or trim_k.

Returns:: dictionary of clusters
Return type:: dict

mdance.cluster.helm.compute_scores(list_list, data)[source]

Computes Calinski-Harabasz and Davies-Bouldin scores of clusters using random labeling.

Returns:: list of tuples of Calinski-Harabasz and Davies-Bouldin scores of clusters
Return type:: list

mdance.cluster.helm.z_matrix(cluster_dict)[source]

Converts the cluster dictionary to a linkage matrix for plotting dendrogram.

Parameters:: cluster_dict (dict) – dictionary of clusters following the format in the Notes section.
Returns:: linkage matrix
Return type:: array-like

mdance.cluster.shine

class mdance.cluster.shine.Shine(trajs, metric, N_atoms, t, criterion, link='ward', merge_scheme='intra', sampling='diversity', **kwargs)[source]

Bases: object

SHINE (Sampling Hierarchical Intrinsic N-ary Ensembles) is a class that performs hierarchical clustering on a set of pathways. It uses the n-ary similarity framework to sample/calculate the pairwise distances between the trajectories. The class also provides a method to generate a dendrogram plot of the clustering results.

Parameters:

trajs (list) – List of tuples containing (idx, traj) where idx is the trajectory index and traj is the trajectory data
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int, default=1)) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.
link (str, default='ward') – The linkage algorithm to use. See the Linkage Methods for full descriptions.
t (scalar) –

For criteria ‘inconsistent’, ‘distance’ or ‘monocrit’,
this is the threshold to apply when forming flat clusters.

For ‘maxclust’ or ‘maxclust_monocrit’ criteria,
this would be max number of clusters requested. See `fcluster`_

for the full description.
criterion (str, optional) – The criterion to use in forming flat clusters. This can be any of the following values: ‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’. See `fcluster`_ for the full description.
merge_scheme (str, default='intra') – The scheme to merge the distances between the trajectories. Possible values are 'intra', 'inter', 'semi_sum', 'max', 'min', 'haus'.
sampling (str, default='diversity') – The sampling scheme to use. Possible values are 'diversity', 'quota', None.
frame_cutoff (int, default=50) – Minimum number of frames to perform sampling.
frac (float, default=0.2) – Fraction of the data to be sampled from each trajectory.
Methods (.. _Linkage) –
_fcluster (..) –

gen_msdmatrix()[source]

Generates the MSD matrix for the trajectories using the merge scheme.

Returns:: distances – The MSD pairwise distances between the trajectories using the merge scheme.
Return type:: array-like of shape (n_samples, n_samples)

group_consecutive_indices(indices)[source]

Group consecutive indices into ranges for labels method

Parameters:: indices (list) – List of indices to group
Returns:: Grouped indices as a string
Return type:: str

labels(condensed=True)[source]

Generate custom labels for the dendrogram plot

Returns:: custom_labels – List of custom labels for the clusters
Return type:: list

plot()[source]

Generates the dendrogram plot of the clustering results.

Returns:: ax – The dendrogram plot
Return type:: matplotlib.axes._subplots.AxesSubplot

process_trajs()[source]

Generates the trajectory dictionary and applies the sampling scheme.

Returns:: pathways – Dictionary containing the sampled trajectories
Return type:: dict

run()[source]

Performs the hierarchical agglomerative clustering on the trajectories.

Returns:

link_matrix (ndarray) – The hierarchical clustering encoded as a linkage matrix.
clusters (ndarray) – An array of length n. T[i] is the flat cluster number to which original observation i belongs.

mdance.cli

mdance.cli.prime_sim

mdance.cli.prime_sim.main()[source]

Main function to run the command line interface for calculating frame similarities.

Return type:: txt file with dictionary of similarities.

References

Chen, L., Mondal, A., Perez, A. & Miranda-Quintana, R.A. “Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices.”. Journal of Chemical Theory and Computation 2024 20 (14), 6303-6315

Examples

$ prime_sim -m union -n 6 -i SM -t 0.1 -d normed_clusters -s ../nani/outputs/summary_6.csv

mdance.cli.prime_rep

mdance.cli.prime_rep.main()[source]

Main function to run the command line interface for generating method max values.

Return type:: txt file with method max values.

Examples

$ prime_rep -m union -s outputs -d normed_clusters -t 0.1 -i SM

mdance.data

mdance.data.top

Sample topology files for testing Molecular Dynamics systems. Works with mdance.data.traj.

mdance.data.traj

Sample trajectory files for testing Molecular Dynamics systems. Works with mdance.data.top.

mdance.data.sim_traj_numpy

Numpy array of sample trajectory data for testing Molecular Dynamics systems. Extracted from mdance.data.top and mdance.data.traj.

mdance.data.cc_sim

Complementary similarity of sample trajectory data for testing Molecular Dynamics systems.

mdance.data.trimmed_sim

Trimmed outliers of sample trajectory data for testing Molecular Dynamics systems.

mdance.data.blob_disk

2D dataset with the multiple blobs shaped of a disk for testing 2D systems.

mdance.data.diamonds

2D dataset shape of nine diamonds for testing 2D systems.

mdance.data.ellipses

2D dataset shape of multiple ellipses for testing 2D systems.

mdance.inputs

mdance.inputs.preprocess

mdance.inputs.preprocess.gen_traj_numpy(prmtopFileName, trajFileName, atomSel, verbose=True)[source]

Reads in a trajectory and returns a 2D numpy array of the coordinates of the selected atoms.

Parameters:

prmtopFileName (str) – The file path of the topology file.
trajFileName (str) – The file path of the trajectory file.
atomSel (str) – The atom selection string. For example, resid 3:12 and name N H CA C O. View details in the MDAnalysis documentation.

Returns:

traj_numpy – The 2D numpy array of shape (n_frames, n_atoms*3) containing the coordinates of the selected atoms.

Return type:

np.ndarray

Examples

>>> from mdance.inputs.preprocess import gen_traj_numpy
>>> traj_numpy = gen_traj_numpy('aligned_tau.pdb', 'aligned_tau.dcd',
                                'resid 3:12 and name N CA C')

mdance.inputs.preprocess.normalize_file(file, break_line=None, norm_type=None)[source]

Normalize a single file and output the normalized data to a new file.

Parameters:

file (str) – The file path of the input data.
output (str) – The file path of the output data.
break_line (int) – The number of columns per line of the input file. (have to n-1 because ignore first line)
norm_type (str) – The type of normalization to use. Can be v2 or v3.
min (float or None, optional) – The minimum value to use for normalization. If not provided, the minimum value of the input data is used. Defaults to None.
max (float or None, optional) – The maximum value to use for normalization. If not provided, the maximum value of the input data is used. Defaults to None.
avg (float or None, optional) – The average value to use for normalization. If not provided, the average value of the input data is used. Defaults to None.

Returns:

The minimum, maximum, and average values of the input data.

Return type:

tuple

mdance.outputs

mdance.outputs.postprocess

mdance.outputs.postprocess.numpy_array_to_crd_traj(matrix, num_columns=10)[source]

Convert a numpy array to a AMBER CRD trajectory.

Parameters:

arrayarray_like of shape (n_samples, n_features): The data to be converted.
num_columnsint, optional: The number of columns per line. Defaults to 10.

returns:: The string representation of the trajectory.
rtype:: str

mdance.outputs.postprocess.unnormalize_data(norm_data, min, max)[source]

Unnormalize data from 0 to 1 to the original range.

Parameters:

norm_data (array_like of shape (n_samples, n_features)) – The normalized data.
min (float) – The minimum value of the original range.
max (float) – The maximum value of the original range.

Returns:

The unnormalized data.

Return type:

array_like of shape (n_samples, n_features)

mdance.prime

mdance.prime.sim_calc

class mdance.prime.sim_calc.FrameSimilarity(cluster_folder=None, summary_file=None, trim_frac=None, n_clusters=None, weighted_by_frames=True, n_ary='RR', weight='nw')[source]

Bases: object

A class to calculate the similarity between clusters.

Parameters:

cluster_folder (str) – The path to the folder containing the normalized cluster files.
summary_file (str) – The path to the summary file containing the number of frames for each cluster.
trim_frac (float) – The fraction of outliers to trim from the top cluster.
n_clusters (int) – The number of clusters to analyze.
weighted_by_frames (bool) – Whether to weight the similarity values by the number of frames in the cluster.
n_ary ({'RR', 'SM'}) – The n_ary similarity metric to use.
weight ({'nw', 'w', 'fraction'}) – The weight to use for the similarity metric.

Returns:

A dictionary containing the average similarity between each pair of clusters.

Return type:

dict

References

Chen, L., Mondal, A., Perez, A. & Miranda-Quintana, R.A. “Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices.”. Journal of Chemical Theory and Computation 2024 20 (14), 6303-6315

Examples

>>> from mdance.prime.sim_calc import FrameSimilarity
>>> sim = FrameSimilarity(cluster_folder="path/to/cluster_folder", summary_file="path/to/summary_file",
...                       trim_frac=0.1, n_clusters=10, weighted_by_frames=True, n_ary='RR', weight='nw')
>>> sim.calculate_pairwise()
>>> sim.calculate_union()
>>> sim.calculate_medoid()
>>> sim.calculate_outlier()

calculate_medoid()[source]

Calculates the pairwise similarity between every frame in c0 and the medoid of each cluster. The pairwise similarity value between each frame in c0 and the medoid of each cluster is calculated using similarity indices.

Returns:: A dictionary containing the average similarity between each pair of clusters. weighted_by_frames=True will return the frame-weighted similarity values. weighted_by_frames=False will return the unweighted similarity values.
Return type:: dict

calculate_outlier()[source]

Calculates the pairwise similarity between every frame in c0 and the outlier of each cluster. The pairwise similarity value between each frame in c0 and the outlier of each cluster is calculated using similarity indices.

Returns:: A dictionary containing the average similarity between each pair of clusters. weighted_by_frames=True will return the frame-weighted similarity values. weighted_by_frames=False will return the unweighted similarity values.
Return type:: dict

calculate_pairwise()[source]

Calculates pairwise similarity between each cluster and all other clusters. The similarity score is calculated as the average of pairwise similarity values between each frame in the cluster and the top c0 cluster.

Returns:: A dictionary containing the average similarity between each pair of clusters. weighted_by_frames=True will return the frame-weighted similarity values. weighted_by_frames=False will return the unweighted similarity values.
Return type:: dict

calculate_union()[source]

Calculates the extended similarity between the union of frame in c0 and cluster k. The similarity score is calculated as the union similarity between all frames in the cluster and the top c0 cluster.

Returns:: A dictionary containing the average similarity between each pair of clusters. weighted_by_frames=True will return the frame-weighted similarity values. weighted_by_frames=False will return the unweighted similarity values.
Return type:: dict

mdance.prime.sim_calc.weight_dict(file_path=None, summary_file=None, dict=None, n_clusters=None)[source]

Calculates frame-weighted similarity values by the number of frames in each cluster.

Parameters:

file_path (str) – The path to the json file containing the unweighted similarity values.
summary_file (str) – The path to the summary file containing the number of frames for each cluster.
dict (dict) – A dictionary containing the unweighted similarity values.
n_clusters (int) – The number of clusters to analyze.

Returns:

A dictionary containing the frame-weighted similarity values between each pair of clusters.

Return type:

dict

mdance.prime.rep_frames

mdance.prime.rep_frames.calculate_max_key(dict)[source]

Find the key with the max value

Parameters:: dict (dict) – Dictionary with keys as strings and values as lists of floats
Returns:: Key with the max value
Return type:: int

mdance.prime.rep_frames.gen_all_methods_max(sim_folder='nw', norm_folder='v3_norm', weighted_by_frames=True, trim_frac=0.1, n_ary='RR', weight='nw', output_name='rep')[source]

Generate the representative frame for each method.

Parameters:

sim_folder (str) – Name of the folder containing the similarity matrices
norm_folder (str) – Name of the folder containing the normalized data
weighted_by_frames (bool) – Similarity is weighted by frames
trim_frac (float) – Fraction of outliers to trim
n_ary ({'RR', 'SM'}) – The n_ary similarity metric to use.
weight ({'nw', 'w'}) – The weight to use.
output_name (str) – Name of the output file

Returns:

File containing the frame number with max values by method: medoid_all, medoid_c0, medoid_c0(trimmed), pairwise, union, medoid, outlier

Return type:

file

mdance.prime.rep_frames.gen_one_method_max(method, sim_folder='nw', norm_folder='v3_norm', weighted_by_frames=True, trim_frac=0.1, n_ary='RR', weight='nw', output_name='rep')[source]

Generate the representative frame for each method.

Parameters:

method ({'medoid_all', 'medoid_c0', 'medoid_c0(trimmed)', 'pairwise', 'union', 'medoid', 'outlier'}) – Method to use
sim_folder (str) – Name of the folder containing the similarity matrices
norm_folder (str) – Name of the folder containing the normalized data
weighted_by_frames (bool) – Similarity is weighted by frames
trim_frac (float) – Fraction of outliers to trim
n_ary ({'RR', 'SM'}) – The n_ary similarity metric to use.
weight ({'nw', 'w'}) – The weight to use.
output_name (str) – Name of the output file

Raises:

ValueError – Invalid method. Choose from medoid_all, medoid_c0, medoid_c0(trimmed), pairwise, union, medoid, outlier.

Returns:

File containing the frame number with max values by method: medoid_all, medoid_c0, medoid_c0(trimmed), pairwise, union, medoid, outlier.

Return type:

file

mdance.tools

mdance.tools.bts

mdance.tools.bts.align_traj(data, N_atoms, align_method=None)[source]

Aligns trajectory using uniform or kronecker alignment.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
N_atoms (int) – Number of atoms in the system.
align_method ({'uni', 'kron'}, default=None) – Alignment method. uni or uniform: Uniform alignment. kron or kronecker: Kronecker alignment.

Raises:

ValueError – if align_method is not uni, kron, or None.

Returns:

matrix of aligned data.

Return type:

numpy.ndarray

References

Klem, H., Hocky, G. M., and McCullagh M., “Size-and-Shape Space Gaussian Mixture Models for Structural Clustering of Molecular Dynamics Trajectories”. Journal of Chemical Theory and Computation 2022 18 (5), 3218-3230

mdance.tools.bts.calculate_comp_sim(matrix, metric, N_atoms=1)[source]

O(N) Complementary similarity calculation for n-ary objects.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str) – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.

Returns:

Array of complementary similarities for each object.

Return type:

numpy.ndarray

Examples

>>> from mdance.tools import bts
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]])
>>> bts.calculate_comp_sim(X, metric='MSD', N_atoms=1)
array([31, 34.375, 36.75, 27.75, 23.875])

mdance.tools.bts.calculate_medoid(matrix, metric, N_atoms=1)[source]

O(N) medoid calculation for n-ary objects.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str) – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.

Returns:

The index of the medoid in the dataset.

Return type:

int

Examples

>>> from mdance.tools import bts
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]])
>>> bts.calculate_medoid(X, metric='MSD', N_atoms=1)
2

mdance.tools.bts.calculate_outlier(matrix, metric, N_atoms=1)[source]

O(N) Outlier calculation for n-ary objects.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.

Returns:

The index of the outlier in the dataset.

Return type:

int

Examples

>>> from mdance.tools import bts
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]])
>>> bts.calculate_outlier(X, metric='MSD', N_atoms=1)
4

mdance.tools.bts.diversity_selection(matrix, percentage: int, metric, N_atoms=1, method='strat', start='medoid')[source]

O(N) method of selecting the most diverse subset of a data matrix using the complementary similarity.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
percentage (int) – If method='strat', percentage indicates how many bins of stratified data will be generated. If method='comp_sim', percentage indicates the percentage of data to be selected.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N_atoms (int, default=1) – Number of atoms in the system used for normalization. N_atoms=1 for non-Molecular Dynamics datasets.
method ({'strat', 'comp_sim'}, default='strat') – The method to use for diversity selection. strat: stratified sampling. comp_sim: maximizing the MSD between the selected objects and the rest of the data.
start ({'medoid', 'outlier', 'random', list}, default='medoid') – The initial seed for initiating diversity selection. Either from one of the options or a list of indices are valid inputs.

Raises:

ValueError – If start is not medoid, outlier, random, or a list.
ValueError – If percentage is too high.

Returns:

List of indices of the diversity selected data.

Return type:

list

Examples

>>> from mdance.tools import bts
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [2, 9], [1, 8], [2, 7]])
>>> bts.diversity_selection(X, percentage=30, metric='MSD', N_atoms=1)
[7 4]

mdance.tools.bts.equil_align(indices, sieve, input_top, input_traj, mdana_atomsel, cpptraj_atomsel, ref_index)[source]

Aligns the frames in the trajectory to the reference frame.

Parameters:

indices (list) – List of indices of the data points in the cluster.
input_top (str) – Path to the input topology file.
input_traj (str) – Path to the input trajectory file.
mdana_atomsel (str) – Atom selection string for MDAnalysis.
cpptraj_atomsel (str) – Atom selection string for cpptraj.
ref_index (int) – Index of the reference frame.

Returns:

aligned_traj_numpy – Numpy array of the aligned trajectory.

Return type:

numpy.ndarray

mdance.tools.bts.extended_comparison(matrix, data_type='full', metric='MSD', N=None, N_atoms=1, **kwargs)[source]

O(N) Extended comparison function for n-ary objects.

Valid values for metric are:

MSD: Mean Square Deviation.

Extended or Instant Similarity Metrics :

AC: Austin-Colwell, BUB: Baroni-Urbani-Buser,
CTn: Consoni-Todschini n, Fai: Faith,
Gle: Gleason, Ja: Jaccard,
Ja0: Jaccard 0-variant, JT: Jaccard-Tanimoto,
RT: Rogers-Tanimoto, RR: Russel-Rao,
SM: Sokal-Michener, SSn: Sokal-Sneath n.

Parameters:

matrix (array-like of shape (n_samples, n_features) or tuple/list of length 1 or 2}) – A feature array of shape (n_samples, n_features) if data_type='full'. Otherwise, tuple or list of length 1 (c_sum) or 2 (c_sum, sq_sum) if data_type='condensed'.
data_type ({'full', 'condensed'}, default='full') – Type of data inputted.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
N (int, optional, default=None) – Number of data points.
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.
c_threshold (int, default=None) – Coincidence threshold for calculating extended similarity. It must be an options allowed by mdance.tools.esim.calculate_counters().
w_factor ({'fraction', 'power_n'}, default='fraction') – The type of weight function for calculating extended similarity. It must be an options allowed by mdance.tools.esim.calculate_counters().

Raises:

TypeError – If data is not a numpy.ndarray or tuple/list of length 2.

Returns:

Extended comparison value.

Return type:

float

Examples

>>> from mdance.tools import bts
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]])
>>> bts.extended_comparison(X, data_type='full', metric='MSD', N_atoms=1)
32.8

mdance.tools.bts.get_new_index_n(matrix, metric, selected_condensed, n, select_from_n, **kwargs)[source]

Extract the new index to add to the list of selected indices.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by mdance.tools.bts.extended_comparison().
selected_condensed (array-like of shape (n_features,)) – Condensed sum of the selected fingerprints.
n (int) – Number of selected objects.
select_from_n (array-like of shape (n_samples,)) – Array of indices to select from.
sq_selected_condensed (array-like of shape (n_features,), optional) – Condensed sum of the squared selected fingerprints. (**kwargs)
N_atoms (int, optional) – Number of atoms in the system used for normalization. N_atoms=1 for non-Molecular Dynamics datasets. (**kwargs)

Returns:

index of the new fingerprint to add to the selected indices.

Return type:

int

mdance.tools.bts.mean_sq_dev(matrix, N_atoms)[source]

O(N) Mean square deviation (MSD) calculation for n-ary objects.

Parameters:

matrix (array-like of shape (n_samples, n_features)) – A feature array.
N_atoms (int) – Number of atoms in the Molecular Dynamics (MD) system. N_atom=1 for non-MD systems.

Returns:

normalized MSD value.

Return type:

float

mdance.tools.isim

Miranda Quintana Group - University of Florida iSIM: instant similarity

Please, cite the original paper on iSIM:

López-Pérez, K., Kim, T.D. & Miranda-Quintana, R.A. Digital Discovery 3, 1160–1171 (2024). https://doi.org/10.1039/D4DD00041B

mdance.tools.isim.calculate_comp_sim(data, n_ary='RR')[source]

Calculate the complementary similarity for RR, JT, or SM

Parameters:

data (np.ndarray) – Array of arrays, each sub-array contains the binary object
n_objects (int) – Number of objects, only necessary if the column wize sum is the input data.
n_ary (str) – String with the initials of the desired similarity index to calculate the iSIM from. Only RR, JT, or SM are available. For other indexes use gen_sim_dict.

Returns:

comp_sims – 1D array with the complementary similarities of all the molecules in the set.

Return type:

nd.array

mdance.tools.isim.calculate_counters(data, n_objects=None, k=1)[source]

Calculate 1-similarity, 0-similarity, and dissimilarity counters

Parameters:

data (np.ndarray) – Array of arrays, each sub-array contains the binary object OR Array with the columnwise sum, if so specify n_objects.
n_objects (int) – Number of objects, only necessary if the column wize sum is the input data.
k (int) – Integer indicating the 1/k power used to approximate the average of the similarity values elevated to 1/k.

Returns:

counters – Dictionary with the weighted and non-weighted counters.

Return type:

dict

mdance.tools.isim.calculate_isim(data, n_objects=None, n_ary='RR')[source]

Calculate the iSIM index for RR, JT, or SM

Parameters:

data (np.ndarray) – Array of arrays, each sub-array contains the binary object OR Array with the columnwise sum, if so specify n_objects
n_objects (int) – Number of objects, only necessary if the columnwise sum is the input data.
n_ary (str) – String with the initials of the desired similarity index to calculate the iSIM from. Only RR, JT, or SM are available. For other indexes use gen_sim_dict.

Returns:

isim – iSIM index for the specified similarity index.

Return type:

float

mdance.tools.isim.calculate_medoid(data, n_ary='RR')[source]

mdance.tools.isim.calculate_outlier(data, n_ary='RR')[source]

mdance.tools.isim.gen_sim_dict(data, n_objects=None, k=1)[source]

Calculate a dictionary containing all the available similarity indexes

Parameters:: counters. (See calculate) –
Returns:: sim_dict – Dictionary with the weighted and non-weighted similarity indexes.
Return type:: dict

mdance.tools.esim

Miranda Quintana Group - University of Florida eSIM: extended similarity indices

Please, cite the original papers on the n-ary indices:

Miranda-Quintana, R.A., Bajusz, D., Rácz, A. & Héberger, K. J Cheminform 13, 32 (2021). https://doi.org/10.1186/s13321-021-00505-3 Miranda-Quintana, R.A., Rácz, A., Bajusz, D. & Héberger, K. J Cheminform 13, 33 (2021). https://doi.org/10.1186/s13321-021-00504-4

class mdance.tools.esim.SimilarityIndex(data, n_objects=None, c_threshold=None, n_ary='RR', w_factor='fraction', weight='nw', return_dict=False)[source]

Bases: object

O(N) similarity index calculation for a set.

Parameters:

data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}) – Coincidence threshold.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
return_dict (bool) – If True, returns a dictionary with all the similarity indices.

Returns:

The similarity index.

Return type:

float or dict

__call__()[source]

The default method to be called when the class is instantiated.

Returns:

float or dict: The similarity index or a dictionary with all the similarity indices.

ac_nw()[source]

ac_w()[source]

bub_nw()[source]

bub_w()[source]

ct1_nw()[source]

ct1_w()[source]

ct2_nw()[source]

ct2_w()[source]

ct3_nw()[source]

ct3_w()[source]

ct4_nw()[source]

ct4_w()[source]

fai_nw()[source]

fai_w()[source]

gen_sim_dict()[source]: Generates a dictionary of all similarity indices.

gle_nw()[source]

gle_w()[source]

ja0_nw()[source]

ja0_w()[source]

ja_nw()[source]

ja_w()[source]

jt_nw()[source]

jt_w()[source]

rr_nw()[source]

rr_nw_nw()[source]

rr_w()[source]

rt_nw()[source]

rt_w()[source]

sm_nw()[source]

sm_nw_nw()[source]

sm_w()[source]

ss1_nw()[source]

ss1_w()[source]

ss2_nw()[source]

ss2_w()[source]

mdance.tools.esim.calc_comp_sim(data, c_threshold=None, n_ary='RR', w_factor='fraction', weight='nw', c_total=None)[source]

O(N) complementary similarity calculation for a set.

Parameters:

data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}) – Coincidence threshold.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.

Raises:

ValueError – If the dimensions of the objects and columnwise sum differ.

Returns:

A list with the complementary similarities of all the molecules in the set.

Return type:

list

mdance.tools.esim.calc_medoid(data, n_ary='RR', w_factor='fraction', weight='nw', c_total=None)[source]

O(N) medoid calculation for a set.

Parameters:

data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.

Raises:

ValueError – If the dimensions of the objects and columnwise sum differ.

Returns:

The index of the medoid.

Return type:

int

mdance.tools.esim.calc_outlier(data, n_ary='RR', w_factor='fraction', weight='nw', c_total=None)[source]

O(N) outlier calculation for a set.

Parameters:

data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.

Raises:

ValueError – If the dimensions of the objects and columnwise sum differ.

Returns:

The index of the outlier.

Return type:

int

mdance.tools.esim.calculate_counters(c_total, n_objects, c_threshold=None, w_factor='fraction')[source]

Calculate 1-similarity, 0-similarity, and dissimilarity counters.

Parameters:

c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}, default=None) –
Coincidence threshold.
- None : Default, c_threshold = n_objects % 2
- dissimilar : c_threshold = ceil(n_objects / 2)
- int : Integer number < n_objects
- float : Real number in the (0, 1) interval. Indicates the % of the total data that will serve as threshold.
w_factor ({"fraction", "power_n"}, default='fraction') –
Type of weight function that will be used.
- fractionsimilarity = d[k]/n
  dissimilarity = 1 - (d[k] - n_objects % 2)/n_objects
- power_nsimilarity = n**-(n_objects - d[k])
  dissimilarity = n**-(d[k] - n_objects % 2)
- other values : similarity = dissimilarity = 1

Returns:

counters – Dictionary with the weighted and non-weighted counters.

Return type:

dict

mdance.tools.esim.gen_sim_dict(c_total, n_objects, c_threshold=None, w_factor='fraction')[source]

Generate a dictionary with the similarity indices.

Valid values for metric are:

MSD: Mean Square Deviation.

Extended or Instant Similarity Metrics

AC: Austin-Colwell, BUB: Baroni-Urbani-Buser,
CTn: Consoni-Todschini n, Fai: Faith,
Gle: Gleason, Ja: Jaccard,
Ja0: Jaccard 0-variant, JT: Jaccard-Tanimoto,
RT: Rogers-Tanimoto, RR: Russel-Rao,
SM: Sokal-Michener, SSn: Sokal-Sneath n.

Parameters:

c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}) – Coincidence threshold.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.

Returns:

Dictionary with the similarity indices.

Return type:

dict