mdance.cluster
mdance.cluster.nani
- class mdance.cluster.nani.KmeansNANI(data, n_clusters, metric, N_atoms, init_type='strat_all', **kwargs)[source]
Bases:
objectk-means NANI clustering alogorithm (N-Ary Natural Initialization).
Valid Values for
init_types: (k means number of clusters) |strat_all: A number of bins are computed based on specified percentage of the data. Stratified sampling is then applied, and the first k points from the stratified data are selected as the initial centers. |strat_reduced: Identifies high-density regions using complementary similarity, selecting a specified percentage of points. Applies stratified sampling to this subset using a number of bins based on the subset size, and selects the first k points as initial centers. |comp_sim: Identifies high-density regions using complementary similarity, selecting a percentage% of the data. From this subset, diversity selection (withcomp_simas the sampling method) is used to choose the first k points as the initial centers. |div_select: Applies diversity selection (usingcomp_simas the sampling method) on specified percentage% of points. First k points are the initial centers. |quota: Uses quota sampling to select initial centers based on complementary similarity values divided into bins. |k-means++selects the initial centers based on the greedy k-means++ algorithm. |randomselects the initial centers randomly. |vanilla_kmeans++selects the initial centers based on the vanilla k-means++ algorithm.- Parameters:
data (array-like of shape (n_samples, n_features)) – A feature array.
n_clusters (int) – Number of clusters.
metric (str) – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.init_type (str, default='comp_sim') – Type of initiator selection for initiating k-means. It must be an options allowed by
mdance.cluster.nani.KmeansNANI.percentage (int, default=10) – Percentage of the dataset to be used for the initial selection of the initial centers. (**kwargs)
- labels
An array of the labels of each point.
- Type:
array-like of shape (n_samples,)
- centers
An array of the cluster centers.
- Type:
array-like of shape (n_clusters, n_features)
- n_iter
Number of iterations until coverage.
- Type:
int
- cluster_dict
Dictionary of the clusters and their corresponding indices.
- Type:
dict
- compute_scores(labels)[source]
Computes the Davies-Bouldin and Calinski-Harabasz scores.
- Parameters:
labels (array-like of shape (n_samples,)) – Cluster labels.
- Returns:
Davies-Bouldin and Calinski-Harabasz scores.
- Return type:
tuple
- create_cluster_dict(labels)[source]
Creates a dictionary with the labels as keys and the indices of the data as values.
- Parameters:
labels (array-like of shape (n_samples,)) – Cluster labels.
- Returns:
Dictionary with the labels as keys and the indices of the data as values.
- Return type:
dict
- execute_kmeans_all()[source]
Function to complete all steps of NANI for all different
init_typeoptions.- Returns:
Labels, centers and number of iterations.
- Return type:
tuple
- initiate_kmeans(**kwargs)[source]
Initializes the k-means algorithm with the selected initiators.
- Raises:
ValueError – If the number of initiators is less than the number of clusters.
- Returns:
The initial centers for k-means of shape (n_clusters, n_features).
- Return type:
numpy.ndarray
- kmeans_clustering(initiators)[source]
Executes the k-means algorithm with the selected initiators.
- Parameters:
initiators ({numpy.ndarray, 'k-means++', 'random'}) – Method for selecting initial centers.
k-means++selects initial centers in a smart way to speed up convergence.randomselects initial centers randomly. numpy.ndarray selects initial centers based on the input array.- Returns:
Labels, centers and number of iterations.
- Return type:
tuple
mdance.cluster.equal
- class mdance.cluster.equal.ExtendedQuality(data, threshold, metric, N_atoms, seed_method='greedy', n_seeds=1, check_sim=False, reject_lowd=True, **kwargs)[source]
Bases:
objectExtended quality clustering algorithm is an extension of the radial threshold algorithm. It grows clusters from seeds and can rejects low density clusters.
- Parameters:
data (array-like of shape (n_samples, n_features)) – A feature array.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int) – Number of atoms in the system used for normalization.
N_atoms=1for non-Molecular Dynamics datasets.threshold (float) – The distance between the seed of the subcluster and a new sample should be lesser than the threshold.
n_seeds ({float, int}) – Number of seeds to be used per iteration. Default is 1. float: Real number between (0, 1). Indicates the % of the total data. int: Number of seeds.
seed_method ({'comp_sim', 'greedy', 'medoid', 'mini_batch_kmeans', 'vanilla'}) – Method used to select the initial seeds.
check_sim (bool, default False) – If True, validates the proposed cluster against a similarity threshold to ensure it meets acceptable criteria.
reject_lowd (bool, default False) – If True, will reject low density clusters if they are below the minimum cluster size.
align_method ({'uni', 'kron', None}, optional) – Alignment method used for the data. Default is None, which means no alignment. ‘uni’ is a uniform alignment method. ‘kron’ is a Kronecker alignment method.
percentage (int, default=10) – Percentage of the dataset to be used for the initial selection of the initial centers. (**kwargs)
sim_threshold (float) – The largest similarity value that is acceptable for the proposed cluster. (**kwargs)
min_samples ({float, int}, default=10) – Minimum number of data points required in a cluster. (**kwargs) float: Real number between (0, 1). Indicates the % of the total data. int: Number of data points.
- calculate_best_frames(clusters, n_structures=10, sorted_by='frame')[source]
Extract the best n structures for each cluster.
- Parameters:
clusters (dict) – Dictionary of the clusters and their corresponding indices.
n_structures (int, default=10) – Number of structures to be extracted for each cluster.
sorted_by ({'frame', 'similarity'}, default='frame') – Sort the best structures by frame number or similarity value.
- Returns:
Array of the best frames.
- Return type:
numpy.ndarray
- calculate_populations(clusters)[source]
Calculate the populations of the clusters.
- Returns:
Key: cluster number, value: cluster population.
- Return type:
dict
- comp_sim_seeds()[source]
Selects the inital centers based on the diversity in the high density region of the data using the n-ary similarity.
- Returns:
(n_seeds, n_features) array of the initial seeds.
- Return type:
numpy.ndarray
Notes
A complementary similarity is calculated for each point in the dataset. Next, the top n% of the points are selected for diversity selection. The first
n_seedsnumber of points are selected as the seeds.
- find_best_frames_indices(best_frames, sieve)[source]
Find the indices of the best frames.
- Parameters:
best_frames (numpy.ndarray) – Array of the best frames.
sieve (int) – The sieve value used to select the frames.
- Returns:
Array of the best frames indices.
- Return type:
numpy.ndarray
- find_medoids()[source]
Finds the seeds by selecting the medoids using the complementary similarity.
- Returns:
(n_seeds, n_features) array of the initial seeds.
- Return type:
numpy.ndarray
Notes
A complementary similarity is calculated for each point in the dataset. Then, the first
n_seedsnumber of points are selected as the seeds.
- greedy_seeds()[source]
Select the initial centers using the greedy k-means++ algorithm. (Arthur and Vassilvitskii, 2007).
- Returns:
(n_seeds, n_features) array of the initial seeds.
- Return type:
numpy.ndarray
- grow_clusters()[source]
The heart of the
ExtendedQualityalgorithm.- Returns:
Key: iteration number, value: numpy.ndarray of the cluster members.
- Return type:
dict
Notes
Initial seeds are selected using the method in
seed_methodattribute.Each seed proposes a cluster by adding available objects within the radial threshold.
- The winner seed cluster is the most dense cluster. If there are multiple,
the one with the lowest similarity is chosen.
Objects in the winner seed cluster are removed from the data.
If
check_simis True, clusters above the similarity threshold are rejected.if
reject_lowdis True, clusters below themin_samplesare rejected.- Repeat steps 1-6 until there are less than 2 objects left in the data because
it is not possible to determine seeds with 2 or less objects.
- labels(clusters, sieve)[source]
Assigns labels to the clusters.
- Parameters:
clusters (dict) – Dictionary of the clusters and their corresponding indices.
- Returns:
Array of the cluster labels.
- Return type:
numpy.ndarray
- mini_batch_kmeans_seeds()[source]
Select the initial centers using the mini-batch k-means algorithm.
- Returns:
(n_seeds, n_features) array of the initial seeds.
- Return type:
numpy.ndarray
- mdance.cluster.equal.compute_scores(results)[source]
Compute the Calinski-Harabasz (CH) and Davies-Bouldin (DB) scores for the clusters.
- Parameters:
data (array-like of shape (n_samples, n_features)) – Input dataset.
results (dict) – Dictionary of the clusters and their corresponding indices.
- Returns:
A tuple of the CH and DB scores in that order.
- Return type:
tuple
Notes
- Labels are assigned based on number of clusters. If there is only one cluster,
the CH and DB scores cannot be calculated and None is returned.
Example
>>> import numpy as np >>> from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score >>> from mdance.cluster.equal import compute_scores >>> data = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> results = {0: [0, 1, 2], 1: [3, 4, 5]} >>> ch, db = compute_scores(data, results) >>> print(ch, db) 3.375 0.8888888888888888
mdance.cluster.helm
- class mdance.cluster.helm.HELM(cluster_dict, metric, N_atoms, merge_scheme='inter', n_clusters=None, eps=None, trim_start=False, align_method=None, min_samples=0.01, **kwargs)[source]
Bases:
objectHierarchical Extended Linkage Method (HELM) is a class that performs hierarchical clustering of clusters. It is uses the n-ary similarity framework to merge clusters based on the HELM merge schemes.
- Parameters:
cluster_dict (dict) – dictionary of clusters following the format in the Notes section.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.merge_scheme (str, default='inter') – The merge scheme to use when merging two clusters. Options are
intra,inter, andhalf.n_clusters (int) – Number of clusters to terminate the clustering process.
eps (float) – epsilon MSD value to terminate the clustering process.
trim_start (bool, default=False) – If True, the initial clusters are trimmed based on the
trim_valortrim_k.trim_val (float, default=None) – If
trim_startis True, then this value is used to trim the initial clusters.trim_k (int, default=None) – If
trim_startis True, then this value is used to trim the initial clusters.align_method (str, default=None) – If
uni, the clusters are aligned using the uniform alignment method. Ifkron, the clusters are aligned using the Kronecker alignment method.input_top (str, default=None) – The topology file of the MD system.
input_traj (str, default=None) – The trajectory file of the MD system.
save_pairwise_sim (bool, default=False) – If True, the pairwise similarity matrix is saved.
link (str, default=None) – The linkage algorithm to use. See the Linkage Methods for full descriptions.
Notes
The
cluster_dictdictionary should be in the following format:Clusters = {N1: clustersN1, N2: clustersN2, ...}Nk : intNumber of clusters in the k-th iteration
- clustersNklist of lists
Contains the info about clusters in *k*th iteration
clustersNk = [C1k, C2k, ...]- Ciklist of lists
Contains information about *i*th cluster in *k*th iteration
Cik = [Indicesik, (c_sumik, sq_sumik), Nik]- Indicesiklist
Cluster indices of merged clusters. For example,
[0, 1]means cluster 0 and 1 are merged- c_sumikarray-like of (n_features,)
A feature array of the column-wsie sum of the data.
- sq_sumik: array-like of (n_features,)
A feature array of the column-wise sum of the squared data.
- Nikint
Number of elements in the cluster
- calc(previous_clusters, i, j)[source]
Calculates the similarity between two clusters
- Parameters:
previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration
clustersNk = [C1k, C2k, ...]i (int) – Index of the first cluster
j (int) – Index of the second cluster
- Returns:
similarity between two clusters using the HELM merge schemes
- Return type:
float
- gen_cluster_dists(previous_clusters)[source]
Generates pairwise similarity matrix for the initial clusters
- Parameters:
previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration
clustersNk = [C1k, C2k, ...]- Returns:
pairwise similarity matrix
- Return type:
array-like
- gen_link_matrix()[source]
Generates the linkage matrix only for ward linkage
- Returns:
linkage matrix
- Return type:
array-like
- gen_new_cluster(previous_clusters)[source]
Generates new cluster by merging two most similar clusters.
- initial_pairwise_matrix()[source]
Generates pairwise similarity matrix for the initial clusters
- Parameters:
previous_clusters (list of lists) – Contains the info about clusters in *k*th iteration
clustersNk = [C1k, C2k, ...]- Returns:
pairwise similarity matrix
- Return type:
array-like
mdance.cluster.shine
- class mdance.cluster.shine.Shine(trajs, metric, N_atoms, t, criterion, link='ward', merge_scheme='intra', sampling='diversity', **kwargs)[source]
Bases:
objectSHINE (Sampling Hierarchical Intrinsic N-ary Ensembles) is a class that performs hierarchical clustering on a set of pathways. It uses the n-ary similarity framework to sample/calculate the pairwise distances between the trajectories. The class also provides a method to generate a dendrogram plot of the clustering results.
- Parameters:
trajs (list) – List of tuples containing (idx, traj) where idx is the trajectory index and traj is the trajectory data
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1)) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.link (str, default='ward') – The linkage algorithm to use. See the Linkage Methods for full descriptions.
t (scalar) –
- For criteria ‘inconsistent’, ‘distance’ or ‘monocrit’,
this is the threshold to apply when forming flat clusters.
- For ‘maxclust’ or ‘maxclust_monocrit’ criteria,
this would be max number of clusters requested. See `fcluster`_
for the full description.
criterion (str, optional) – The criterion to use in forming flat clusters. This can be any of the following values: ‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’. See `fcluster`_ for the full description.
merge_scheme (str, default='intra') – The scheme to merge the distances between the trajectories. Possible values are
'intra','inter','semi_sum','max','min','haus'.sampling (str, default='diversity') – The sampling scheme to use. Possible values are
'diversity','quota',None.frame_cutoff (int, default=50) – Minimum number of frames to perform sampling.
frac (float, default=0.2) – Fraction of the data to be sampled from each trajectory.
Methods (.. _Linkage) –
_fcluster (..) –
- gen_msdmatrix()[source]
Generates the MSD matrix for the trajectories using the merge scheme.
- Returns:
distances – The MSD pairwise distances between the trajectories using the merge scheme.
- Return type:
array-like of shape (n_samples, n_samples)
- group_consecutive_indices(indices)[source]
Group consecutive indices into ranges for
labelsmethod- Parameters:
indices (list) – List of indices to group
- Returns:
Grouped indices as a string
- Return type:
str
- labels(condensed=True)[source]
Generate custom labels for the dendrogram plot
- Returns:
custom_labels – List of custom labels for the clusters
- Return type:
list
- plot()[source]
Generates the dendrogram plot of the clustering results.
- Returns:
ax – The dendrogram plot
- Return type:
matplotlib.axes._subplots.AxesSubplot
mdance.cli
mdance.cli.prime_sim
- mdance.cli.prime_sim.main()[source]
Main function to run the command line interface for calculating frame similarities.
- Return type:
txt file with dictionary of similarities.
References
Chen, L., Mondal, A., Perez, A. & Miranda-Quintana, R.A. “Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices.”. Journal of Chemical Theory and Computation 2024 20 (14), 6303-6315
Examples
$ prime_sim -m union -n 6 -i SM -t 0.1 -d normed_clusters -s ../nani/outputs/summary_6.csv
mdance.cli.prime_rep
mdance.data
mdance.data.top
Sample topology files for testing Molecular Dynamics systems. Works with mdance.data.traj.
mdance.data.traj
Sample trajectory files for testing Molecular Dynamics systems. Works with mdance.data.top.
mdance.data.sim_traj_numpy
Numpy array of sample trajectory data for testing Molecular Dynamics systems. Extracted from mdance.data.top and mdance.data.traj.
mdance.data.cc_sim
Complementary similarity of sample trajectory data for testing Molecular Dynamics systems.
mdance.data.trimmed_sim
Trimmed outliers of sample trajectory data for testing Molecular Dynamics systems.
mdance.data.blob_disk
2D dataset with the multiple blobs shaped of a disk for testing 2D systems.
mdance.data.diamonds
2D dataset shape of nine diamonds for testing 2D systems.
mdance.data.ellipses
2D dataset shape of multiple ellipses for testing 2D systems.
mdance.inputs
mdance.inputs.preprocess
- mdance.inputs.preprocess.gen_traj_numpy(prmtopFileName, trajFileName, atomSel, verbose=True)[source]
Reads in a trajectory and returns a 2D numpy array of the coordinates of the selected atoms.
- Parameters:
prmtopFileName (str) – The file path of the topology file.
trajFileName (str) – The file path of the trajectory file.
atomSel (str) – The atom selection string. For example,
resid 3:12 and name N H CA C O. View details in the MDAnalysis documentation.
- Returns:
traj_numpy – The 2D numpy array of shape (n_frames, n_atoms*3) containing the coordinates of the selected atoms.
- Return type:
np.ndarray
Examples
>>> from mdance.inputs.preprocess import gen_traj_numpy >>> traj_numpy = gen_traj_numpy('aligned_tau.pdb', 'aligned_tau.dcd', 'resid 3:12 and name N CA C')
- mdance.inputs.preprocess.normalize_file(file, break_line=None, norm_type=None)[source]
Normalize a single file and output the normalized data to a new file.
- Parameters:
file (str) – The file path of the input data.
output (str) – The file path of the output data.
break_line (int) – The number of columns per line of the input file. (have to n-1 because ignore first line)
norm_type (str) – The type of normalization to use. Can be
v2orv3.min (float or None, optional) – The minimum value to use for normalization. If not provided, the minimum value of the input data is used. Defaults to None.
max (float or None, optional) – The maximum value to use for normalization. If not provided, the maximum value of the input data is used. Defaults to None.
avg (float or None, optional) – The average value to use for normalization. If not provided, the average value of the input data is used. Defaults to None.
- Returns:
The minimum, maximum, and average values of the input data.
- Return type:
tuple
mdance.outputs
mdance.outputs.postprocess
- mdance.outputs.postprocess.numpy_array_to_crd_traj(matrix, num_columns=10)[source]
Convert a numpy array to a AMBER CRD trajectory.
Parameters:
- arrayarray_like of shape (n_samples, n_features)
The data to be converted.
- num_columnsint, optional
The number of columns per line. Defaults to 10.
- returns:
The string representation of the trajectory.
- rtype:
str
- mdance.outputs.postprocess.unnormalize_data(norm_data, min, max)[source]
Unnormalize data from 0 to 1 to the original range.
- Parameters:
norm_data (array_like of shape (n_samples, n_features)) – The normalized data.
min (float) – The minimum value of the original range.
max (float) – The maximum value of the original range.
- Returns:
The unnormalized data.
- Return type:
array_like of shape (n_samples, n_features)
mdance.prime
mdance.prime.sim_calc
- class mdance.prime.sim_calc.FrameSimilarity(cluster_folder=None, summary_file=None, trim_frac=None, n_clusters=None, weighted_by_frames=True, n_ary='RR', weight='nw')[source]
Bases:
objectA class to calculate the similarity between clusters.
- Parameters:
cluster_folder (str) – The path to the folder containing the normalized cluster files.
summary_file (str) – The path to the summary file containing the number of frames for each cluster.
trim_frac (float) – The fraction of outliers to trim from the top cluster.
n_clusters (int) – The number of clusters to analyze.
weighted_by_frames (bool) – Whether to weight the similarity values by the number of frames in the cluster.
n_ary ({'RR', 'SM'}) – The n_ary similarity metric to use.
weight ({'nw', 'w', 'fraction'}) – The weight to use for the similarity metric.
- Returns:
A dictionary containing the average similarity between each pair of clusters.
- Return type:
dict
References
Chen, L., Mondal, A., Perez, A. & Miranda-Quintana, R.A. “Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices.”. Journal of Chemical Theory and Computation 2024 20 (14), 6303-6315
Examples
>>> from mdance.prime.sim_calc import FrameSimilarity >>> sim = FrameSimilarity(cluster_folder="path/to/cluster_folder", summary_file="path/to/summary_file", ... trim_frac=0.1, n_clusters=10, weighted_by_frames=True, n_ary='RR', weight='nw') >>> sim.calculate_pairwise() >>> sim.calculate_union() >>> sim.calculate_medoid() >>> sim.calculate_outlier()
- calculate_medoid()[source]
Calculates the pairwise similarity between every frame in c0 and the medoid of each cluster. The pairwise similarity value between each frame in c0 and the medoid of each cluster is calculated using similarity indices.
- Returns:
A dictionary containing the average similarity between each pair of clusters.
weighted_by_frames=Truewill return the frame-weighted similarity values.weighted_by_frames=Falsewill return the unweighted similarity values.- Return type:
dict
- calculate_outlier()[source]
Calculates the pairwise similarity between every frame in c0 and the outlier of each cluster. The pairwise similarity value between each frame in c0 and the outlier of each cluster is calculated using similarity indices.
- Returns:
A dictionary containing the average similarity between each pair of clusters.
weighted_by_frames=Truewill return the frame-weighted similarity values.weighted_by_frames=Falsewill return the unweighted similarity values.- Return type:
dict
- calculate_pairwise()[source]
Calculates pairwise similarity between each cluster and all other clusters. The similarity score is calculated as the average of pairwise similarity values between each frame in the cluster and the top c0 cluster.
- Returns:
A dictionary containing the average similarity between each pair of clusters.
weighted_by_frames=Truewill return the frame-weighted similarity values.weighted_by_frames=Falsewill return the unweighted similarity values.- Return type:
dict
- calculate_union()[source]
Calculates the extended similarity between the union of frame in c0 and cluster k. The similarity score is calculated as the union similarity between all frames in the cluster and the top c0 cluster.
- Returns:
A dictionary containing the average similarity between each pair of clusters.
weighted_by_frames=Truewill return the frame-weighted similarity values.weighted_by_frames=Falsewill return the unweighted similarity values.- Return type:
dict
- mdance.prime.sim_calc.weight_dict(file_path=None, summary_file=None, dict=None, n_clusters=None)[source]
Calculates frame-weighted similarity values by the number of frames in each cluster.
- Parameters:
file_path (str) – The path to the json file containing the unweighted similarity values.
summary_file (str) – The path to the summary file containing the number of frames for each cluster.
dict (dict) – A dictionary containing the unweighted similarity values.
n_clusters (int) – The number of clusters to analyze.
- Returns:
A dictionary containing the frame-weighted similarity values between each pair of clusters.
- Return type:
dict
mdance.prime.rep_frames
- mdance.prime.rep_frames.calculate_max_key(dict)[source]
Find the key with the max value
- Parameters:
dict (dict) – Dictionary with keys as strings and values as lists of floats
- Returns:
Key with the max value
- Return type:
int
- mdance.prime.rep_frames.gen_all_methods_max(sim_folder='nw', norm_folder='v3_norm', weighted_by_frames=True, trim_frac=0.1, n_ary='RR', weight='nw', output_name='rep')[source]
Generate the representative frame for each method.
- Parameters:
sim_folder (str) – Name of the folder containing the similarity matrices
norm_folder (str) – Name of the folder containing the normalized data
weighted_by_frames (bool) – Similarity is weighted by frames
trim_frac (float) – Fraction of outliers to trim
n_ary ({'RR', 'SM'}) – The n_ary similarity metric to use.
weight ({'nw', 'w'}) – The weight to use.
output_name (str) – Name of the output file
- Returns:
File containing the frame number with max values by method: medoid_all, medoid_c0, medoid_c0(trimmed), pairwise, union, medoid, outlier
- Return type:
file
- mdance.prime.rep_frames.gen_one_method_max(method, sim_folder='nw', norm_folder='v3_norm', weighted_by_frames=True, trim_frac=0.1, n_ary='RR', weight='nw', output_name='rep')[source]
Generate the representative frame for each method.
- Parameters:
method ({'medoid_all', 'medoid_c0', 'medoid_c0(trimmed)', 'pairwise', 'union', 'medoid', 'outlier'}) – Method to use
sim_folder (str) – Name of the folder containing the similarity matrices
norm_folder (str) – Name of the folder containing the normalized data
weighted_by_frames (bool) – Similarity is weighted by frames
trim_frac (float) – Fraction of outliers to trim
n_ary ({'RR', 'SM'}) – The n_ary similarity metric to use.
weight ({'nw', 'w'}) – The weight to use.
output_name (str) – Name of the output file
- Raises:
ValueError – Invalid method. Choose from
medoid_all,medoid_c0,medoid_c0(trimmed),pairwise,union,medoid,outlier.- Returns:
File containing the frame number with max values by method: medoid_all, medoid_c0, medoid_c0(trimmed), pairwise, union, medoid, outlier.
- Return type:
file
mdance.tools
mdance.tools.bts
- mdance.tools.bts.align_traj(data, N_atoms, align_method=None)[source]
Aligns trajectory using uniform or kronecker alignment.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
N_atoms (int) – Number of atoms in the system.
align_method ({'uni', 'kron'}, default=None) – Alignment method.
unioruniform: Uniform alignment.kronorkronecker: Kronecker alignment.
- Raises:
ValueError – if align_method is not
uni,kron, orNone.- Returns:
matrix of aligned data.
- Return type:
numpy.ndarray
References
Klem, H., Hocky, G. M., and McCullagh M., “Size-and-Shape Space Gaussian Mixture Models for Structural Clustering of Molecular Dynamics Trajectories”. Journal of Chemical Theory and Computation 2022 18 (5), 3218-3230
- mdance.tools.bts.calculate_comp_sim(matrix, metric, N_atoms=1)[source]
O(N) Complementary similarity calculation for n-ary objects.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str) – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.
- Returns:
Array of complementary similarities for each object.
- Return type:
numpy.ndarray
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]]) >>> bts.calculate_comp_sim(X, metric='MSD', N_atoms=1) array([31, 34.375, 36.75, 27.75, 23.875])
- mdance.tools.bts.calculate_medoid(matrix, metric, N_atoms=1)[source]
O(N) medoid calculation for n-ary objects.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str) – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.
- Returns:
The index of the medoid in the dataset.
- Return type:
int
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]]) >>> bts.calculate_medoid(X, metric='MSD', N_atoms=1) 2
- mdance.tools.bts.calculate_outlier(matrix, metric, N_atoms=1)[source]
O(N) Outlier calculation for n-ary objects.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.
- Returns:
The index of the outlier in the dataset.
- Return type:
int
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]]) >>> bts.calculate_outlier(X, metric='MSD', N_atoms=1) 4
- mdance.tools.bts.diversity_selection(matrix, percentage: int, metric, N_atoms=1, method='strat', start='medoid')[source]
O(N) method of selecting the most diverse subset of a data matrix using the complementary similarity.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
percentage (int) – If
method='strat', percentage indicates how many bins of stratified data will be generated. Ifmethod='comp_sim', percentage indicates the percentage of data to be selected.metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1) – Number of atoms in the system used for normalization.
N_atoms=1for non-Molecular Dynamics datasets.method ({'strat', 'comp_sim'}, default='strat') – The method to use for diversity selection.
strat: stratified sampling.comp_sim: maximizing the MSD between the selected objects and the rest of the data.start ({'medoid', 'outlier', 'random', list}, default='medoid') – The initial seed for initiating diversity selection. Either from one of the options or a list of indices are valid inputs.
- Raises:
ValueError – If
startis notmedoid,outlier,random, or a list.ValueError – If
percentageis too high.
- Returns:
List of indices of the diversity selected data.
- Return type:
list
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [2, 9], [1, 8], [2, 7]]) >>> bts.diversity_selection(X, percentage=30, metric='MSD', N_atoms=1) [7 4]
- mdance.tools.bts.equil_align(indices, sieve, input_top, input_traj, mdana_atomsel, cpptraj_atomsel, ref_index)[source]
Aligns the frames in the trajectory to the reference frame.
- Parameters:
indices (list) – List of indices of the data points in the cluster.
input_top (str) – Path to the input topology file.
input_traj (str) – Path to the input trajectory file.
mdana_atomsel (str) – Atom selection string for MDAnalysis.
cpptraj_atomsel (str) – Atom selection string for cpptraj.
ref_index (int) – Index of the reference frame.
- Returns:
aligned_traj_numpy – Numpy array of the aligned trajectory.
- Return type:
numpy.ndarray
- mdance.tools.bts.extended_comparison(matrix, data_type='full', metric='MSD', N=None, N_atoms=1, **kwargs)[source]
O(N) Extended comparison function for n-ary objects.
Valid values for metric are:
MSD: Mean Square Deviation.Extended or Instant Similarity Metrics :
AC: Austin-Colwell,BUB: Baroni-Urbani-Buser,CTn: Consoni-Todschini n,Fai: Faith,Gle: Gleason,Ja: Jaccard,Ja0: Jaccard 0-variant,JT: Jaccard-Tanimoto,RT: Rogers-Tanimoto,RR: Russel-Rao,SM: Sokal-Michener,SSn: Sokal-Sneath n.- Parameters:
matrix (array-like of shape (n_samples, n_features) or tuple/list of length 1 or 2}) – A feature array of shape (n_samples, n_features) if
data_type='full'. Otherwise, tuple or list of length 1 (c_sum) or 2 (c_sum, sq_sum) ifdata_type='condensed'.data_type ({'full', 'condensed'}, default='full') – Type of data inputted.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N (int, optional, default=None) – Number of data points.
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.c_threshold (int, default=None) – Coincidence threshold for calculating extended similarity. It must be an options allowed by
mdance.tools.esim.calculate_counters().w_factor ({'fraction', 'power_n'}, default='fraction') – The type of weight function for calculating extended similarity. It must be an options allowed by
mdance.tools.esim.calculate_counters().
- Raises:
TypeError – If data is not a numpy.ndarray or tuple/list of length 2.
- Returns:
Extended comparison value.
- Return type:
float
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]]) >>> bts.extended_comparison(X, data_type='full', metric='MSD', N_atoms=1) 32.8
- mdance.tools.bts.get_new_index_n(matrix, metric, selected_condensed, n, select_from_n, **kwargs)[source]
Extract the new index to add to the list of selected indices.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().selected_condensed (array-like of shape (n_features,)) – Condensed sum of the selected fingerprints.
n (int) – Number of selected objects.
select_from_n (array-like of shape (n_samples,)) – Array of indices to select from.
sq_selected_condensed (array-like of shape (n_features,), optional) – Condensed sum of the squared selected fingerprints. (**kwargs)
N_atoms (int, optional) – Number of atoms in the system used for normalization.
N_atoms=1for non-Molecular Dynamics datasets. (**kwargs)
- Returns:
index of the new fingerprint to add to the selected indices.
- Return type:
int
- mdance.tools.bts.mean_sq_dev(matrix, N_atoms)[source]
O(N) Mean square deviation (MSD) calculation for n-ary objects.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
N_atoms (int) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.
- Returns:
normalized MSD value.
- Return type:
float
See also
msd_condensedCondensed version of MSD calculation for n-ary objects.
extended_comparisonsn-ary similarity calculation for all indices/metrics.
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8]]) >>> bts.mean_sq_dev(X, N_atoms=1) 32.8
- mdance.tools.bts.msd_condensed(c_sum, sq_sum, N, N_atoms)[source]
Condensed version of Mean square deviation (MSD) calculation for n-ary objects.
- Parameters:
c_sum (array-like of shape (n_features,)) – A feature array of the column-wsie sum of the data.
sq_sum (array-like of shape (n_features,)) – A feature array of the column-wise sum of the squared data.
N (int) – Number of data points.
N_atoms (int) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.
- Returns:
normalized MSD value.
- Return type:
float
See also
mean_sq_devFull version of MSD calculation for n-ary objects.
extended_comparisonsn-ary similarity calculation for all indices/metrics.
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> c_sum = np.array([21, 22]) >>> sq_sum = np.array([137, 130]) >>> bts.msd_condensed(c_sum, sq_sum, N=5, N_atoms=1) 32.8
- mdance.tools.bts.quota_sampling(data, metric, percentage=10, n_bins=10, hard_cap=True, N_atoms=1, comp_sim=None)[source]
Quota sampling according to complementary similarity values.
Divides the range of comp_sim values in n_bins and then uniformly selects nsample frames, consecutively taking one from each bin.
- Parameters:
data (array-like of shape (n_samples, n_features)) – A feature array.
metric (str) – The metric to when calculating distance between n objects in an array.
percentage (int, default=10) – Percentage of objects to sample.
n_bins (int, default=10) – Number of bins to divide the comp_sim range into.
hard_cap (bool, default=True) – Whether to strictly enforce the number of samples.
N_atoms (int, default=1) – Number of atoms in the MD system.
comp_sim (array-like, optional) – Pre-computed complementary similarity values.
- Returns:
Indices of the sampled objects.
- Return type:
numpy.ndarray
- mdance.tools.bts.refine_dis_matrix(matrix)[source]
Refine a distance matrix by setting the diagonal to zero and symmetrizing the matrix.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
- Returns:
A refined 2D matrix.
- Return type:
numpy.ndarray
- mdance.tools.bts.rep_sample(data, metric='MSD', N_atoms=1, n_bins=10, n_samples=100, hard_cap=True)[source]
Representative sampling according to comp_sim values.
Divides the range of comp_sim values in nbins and then uniformly selects n_samples molecules, consecutively taking one from each bin
- Parameters:
data (array-like of shape (n_samples, n_features)) – The data to be sampled.
metric (str, default='MSD') – The metric to be used for the comparison.
N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.n_bins (int, default=10) – Number of bins to divide the comp_sim values.
n_samples (int, default=100) – Number of samples to be selected.
hard_cap (bool, default=True) – If True, the number of samples will be exactly n_samples. If False, the number of samples may not be exactly n_samples.
- Returns:
sampled_mols – List of indices of the sampled objects in the original data
- Return type:
list
- mdance.tools.bts.trim_outliers(matrix, n_trimmed, metric, N_atoms, criterion='comp_sim')[source]
O(N) method of trimming a desired percentage of outliers (most dissimilar) from a data matrix through complementary similarity.
- Parameters:
matrix (array-like of shape (n_samples, n_features)) – A feature array.
n_trimmed (float or int) – The desired fraction of outliers to be removed or the number of outliers to be removed. float : Fraction of outliers to be removed. int : Number of outliers to be removed.
metric (str, default='MSD') – The metric to when calculating distance between n objects in an array. It must be an options allowed by
mdance.tools.bts.extended_comparison().N_atoms (int, default=1) – Number of atoms in the Molecular Dynamics (MD) system.
N_atom=1for non-MD systems.criterion ({'comp_sim', 'sim_to_medoid'}, default='comp_sim') – Criterion to use for data trimming.
comp_simcriterion removes the most dissimilar objects based on the complement similarity.sim_to_medoidcriterion removes the most dissimilar objects based on the similarity to the medoid.
- Returns:
A ndarray with desired fraction of outliers removed.
- Return type:
numpy.ndarray
Examples
>>> from mdance.tools import bts >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]]) >>> output = bts.trim_outliers(X, n_trimmed=0.6, metric='MSD', N_atoms=1) >>> output array([[2, 3], [8, 7], [8, 8]])
mdance.tools.isim
Miranda Quintana Group - University of Florida iSIM: instant similarity
Please, cite the original paper on iSIM:
López-Pérez, K., Kim, T.D. & Miranda-Quintana, R.A. Digital Discovery 3, 1160–1171 (2024). https://doi.org/10.1039/D4DD00041B
- mdance.tools.isim.calculate_comp_sim(data, n_ary='RR')[source]
Calculate the complementary similarity for RR, JT, or SM
- Parameters:
data (np.ndarray) – Array of arrays, each sub-array contains the binary object
n_objects (int) – Number of objects, only necessary if the column wize sum is the input data.
n_ary (str) – String with the initials of the desired similarity index to calculate the iSIM from. Only RR, JT, or SM are available. For other indexes use gen_sim_dict.
- Returns:
comp_sims – 1D array with the complementary similarities of all the molecules in the set.
- Return type:
nd.array
- mdance.tools.isim.calculate_counters(data, n_objects=None, k=1)[source]
Calculate 1-similarity, 0-similarity, and dissimilarity counters
- Parameters:
data (np.ndarray) – Array of arrays, each sub-array contains the binary object OR Array with the columnwise sum, if so specify
n_objects.n_objects (int) – Number of objects, only necessary if the column wize sum is the input data.
k (int) – Integer indicating the 1/k power used to approximate the average of the similarity values elevated to 1/k.
- Returns:
counters – Dictionary with the weighted and non-weighted counters.
- Return type:
dict
- mdance.tools.isim.calculate_isim(data, n_objects=None, n_ary='RR')[source]
Calculate the iSIM index for RR, JT, or SM
- Parameters:
data (np.ndarray) – Array of arrays, each sub-array contains the binary object OR Array with the columnwise sum, if so specify n_objects
n_objects (int) – Number of objects, only necessary if the columnwise sum is the input data.
n_ary (str) – String with the initials of the desired similarity index to calculate the iSIM from. Only RR, JT, or SM are available. For other indexes use gen_sim_dict.
- Returns:
isim – iSIM index for the specified similarity index.
- Return type:
float
mdance.tools.esim
Miranda Quintana Group - University of Florida eSIM: extended similarity indices
Please, cite the original papers on the n-ary indices:
Miranda-Quintana, R.A., Bajusz, D., Rácz, A. & Héberger, K. J Cheminform 13, 32 (2021). https://doi.org/10.1186/s13321-021-00505-3 Miranda-Quintana, R.A., Rácz, A., Bajusz, D. & Héberger, K. J Cheminform 13, 33 (2021). https://doi.org/10.1186/s13321-021-00504-4
- class mdance.tools.esim.SimilarityIndex(data, n_objects=None, c_threshold=None, n_ary='RR', w_factor='fraction', weight='nw', return_dict=False)[source]
Bases:
objectO(N) similarity index calculation for a set.
- Parameters:
data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}) – Coincidence threshold.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
return_dict (bool) – If True, returns a dictionary with all the similarity indices.
- Returns:
The similarity index.
- Return type:
float or dict
- mdance.tools.esim.calc_comp_sim(data, c_threshold=None, n_ary='RR', w_factor='fraction', weight='nw', c_total=None)[source]
O(N) complementary similarity calculation for a set.
- Parameters:
data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}) – Coincidence threshold.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
- Raises:
ValueError – If the dimensions of the objects and columnwise sum differ.
- Returns:
A list with the complementary similarities of all the molecules in the set.
- Return type:
list
- mdance.tools.esim.calc_medoid(data, n_ary='RR', w_factor='fraction', weight='nw', c_total=None)[source]
O(N) medoid calculation for a set.
- Parameters:
data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
- Raises:
ValueError – If the dimensions of the objects and columnwise sum differ.
- Returns:
The index of the medoid.
- Return type:
int
- mdance.tools.esim.calc_outlier(data, n_ary='RR', w_factor='fraction', weight='nw', c_total=None)[source]
O(N) outlier calculation for a set.
- Parameters:
data (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_ary (str) – string with the initials of the desired similarity index to calculate the medoid from.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
weight (str) – Type of weight function that will be used.
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
- Raises:
ValueError – If the dimensions of the objects and columnwise sum differ.
- Returns:
The index of the outlier.
- Return type:
int
- mdance.tools.esim.calculate_counters(c_total, n_objects, c_threshold=None, w_factor='fraction')[source]
Calculate 1-similarity, 0-similarity, and dissimilarity counters.
- Parameters:
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}, default=None) –
- Coincidence threshold.
None : Default, c_threshold = n_objects % 2
dissimilar: c_threshold = ceil(n_objects / 2)int : Integer number < n_objects
float : Real number in the (0, 1) interval. Indicates the % of the total data that will serve as threshold.
w_factor ({"fraction", "power_n"}, default='fraction') –
- Type of weight function that will be used.
fractionsimilarity = d[k]/ndissimilarity = 1 - (d[k] - n_objects % 2)/n_objects
power_nsimilarity = n**-(n_objects - d[k])dissimilarity = n**-(d[k] - n_objects % 2)
other values : similarity = dissimilarity = 1
- Returns:
counters – Dictionary with the weighted and non-weighted counters.
- Return type:
dict
- mdance.tools.esim.gen_sim_dict(c_total, n_objects, c_threshold=None, w_factor='fraction')[source]
Generate a dictionary with the similarity indices.
Valid values for metric are:
MSD: Mean Square Deviation.Extended or Instant Similarity Metrics
AC: Austin-Colwell,BUB: Baroni-Urbani-Buser,CTn: Consoni-Todschini n,Fai: Faith,Gle: Gleason,Ja: Jaccard,Ja0: Jaccard 0-variant,JT: Jaccard-Tanimoto,RT: Rogers-Tanimoto,RR: Russel-Rao,SM: Sokal-Michener,SSn: Sokal-Sneath n.- Parameters:
c_total (array-like of shape (n_objects, n_features)) – Vector containing the sums of each column of the fingerprint matrix.
n_objects (int) – Number of objects to be compared.
c_threshold ({None, 'dissimilar', int}) – Coincidence threshold.
w_factor ({'fraction', 'power_n'}) – Type of weight function that will be used.
- Returns:
Dictionary with the similarity indices.
- Return type:
dict