kmeans

kmeans#

Run K Means Clustering and Principal Component Analysis

This module contains functions to run K Means Clustering on SSF results and visualize the clusters with barplots, silhouette analysis, and PCA scatterplots.

Functions

`create_kmeans_input`(data_arrays)	Blinds SSF Data (removes trajectory labels) for input to K Means
`plot_cluster_trj_data`(cluster_file, outfile)	Plots the output of run_kmeans() to a PNG file.
`plot_pca`(blinded_data, dataset_names[, ...])	Creates PCA Plot to compare systems in 2D
`plot_silhouette`(n_clusters, blind_data[, outdir])	Creates Silhouette plots to determine the best number of clusters
`read_and_preprocess_data`()	Reads and preprocesses SSF data for K Means analysis per dataset.
`run_kmeans`(data_arrays, n_clusters[, ...])	Performs KMeans clustering on blinded SSF data and saves the results.

read_and_preprocess_data((file1, file2, ...))[source]#

Reads and preprocesses SSF data for K Means analysis per dataset.

Reads SSF data from txt files for each dataset, decompresses the data, and attaches each Trajectory to its frame-wise SSF results. The values are flattened SSF lists, so rather than a 3200 frames x 127 res x 127 res, it’s a 3200 frames x 16129 res-res pairs. For example, a 2-residue, 2-frame SSF of

[ [[1, 2], [3, 4]],

[[5, 6], [7, 8]] ]

is flattened to:

[[1, 2, 3, 4], [5, 6, 7, 8]]

Parameters:

file1, file2, …list of str: List of filenams to read and preprocess. Outputted from -s ssf -d output.txt. Should be in the format {datapath}/{traj_name}.txt.gz

Returns:

data_arraysdict: Dictionary where keys are dataset names and values are the processed data arrays.

See also

create_kmeans_input: Stacks SSF data into a single 2D Numpy array.

Examples

>>> import stacker as st
>>> dataset_names = ['testing/5JUP_N2_tGGG_aCCU_+1GCU.txt.gz', 'testing/5JUP_N2_tGGG_aCCU_+1CGU.txt.gz']  # 3200 frames, SSFs of 127 x 127 residues
>>> data_arrays = st.read_and_preprocess_data(dataset_names)
>>> print(data_arrays['dataset1'].shape)
(3200, 16129)

create_kmeans_input(data_arrays)[source]#

Blinds SSF Data (removes trajectory labels) for input to K Means

Stacks SSF data into a single 2D numpy array from all frames of all trajectories without labels for each frame. Used for input to KMeans Clustering

Parameters:

data_arraysdict: Output of read_and_preprocess_data(). Dictionary where keys are dataset names and values are the processed data arrays.

Returns:

blinded_datanp.typing.ArrayLike: A 2D numpy array containing all frames stacked together.

See also

read_and_preprocess_data: Reads and preprocesses data for each dataset

Examples

>>> import stacker as st
>>> data_arrays = {
...     'dataset1': np.random.rand(3200, 16129),
...     'dataset2': np.random.rand(3200, 16129)
... }
>>> kmeans_input = st.create_kmeans_input(data_arrays)
>>> print(kmeans_input.shape)
(6400, 16129)

run_kmeans(data_arrays, n_clusters, max_iter=1000, n_init=20, random_state=1, outdir='')[source]#

Performs KMeans clustering on blinded SSF data and saves the results.

This function applies the KMeans clustering algorithm to the provided blinded SSF data, assigns each frame to a cluster, and counts the number of frames in each cluster for each dataset. The results are printed and saved to a file.

Parameters:

data_arraysdict: Output of read_and_preprocess_data(). Dictionary where keys are dataset names and values are the processed data arrays.
n_clustersint: The number of clusters to form
max_iterint, default=1000: Maximum number of iterations of the k-means algorithm for a single run.
n_initint, default=20: Number of times the k-means algorithm will be run with different centroid seeds.
random_stateint, default=1: Determines random number generation for centroid initialization.
outdirstr, default=’’: Directory to save the clustering results. If empty, just prints to standard output.

Returns:

np.ndarray: The labels of the clusters for each frame.

See also

create_kmeans_input: blinds SSF Data for input to K Means
read_and_preprocess_data: reads and preprocesses SSF data for K Means analysis per dataset

Examples

>>> import stacker as st
>>> data_arrays = {
...     'dataset1': np.random.rand(3200, 16129),
...     'dataset2': np.random.rand(3200, 16129)
... }
>>> blinded_data = st.create_kmeans_input(data_arrays)
>>> st.run_kmeans(blinded_data, n_clusters=4)
Reading data: dataset1
Reading data: dataset2
(6400, 16129)
{'dataset1': array([800, 800, 800, 800]), 'dataset2': array([800, 800, 800, 800])}
Dataset: dataset1
    Cluster 1: 800 matrices
    Cluster 2: 800 matrices
    Cluster 3: 800 matrices
    Cluster 4: 800 matrices
Dataset: dataset2
    Cluster 1: 800 matrices
    Cluster 2: 800 matrices
    Cluster 3: 800 matrices
    Cluster 4: 800 matrices

plot_cluster_trj_data(cluster_file, outfile, x_labels_map=None)[source]#

Plots the output of run_kmeans() to a PNG file.

Creates a grouped bar plot of the number of frames from each trajectory in each cluster following KMeans clustering. Writes the plot output to a PNG file.

Parameters:

input_filestr: Path to clustering results written by run_kmeans()
outfilestr: Filepath where the plot PNG file will be saved.
x_labels_mapdict, optional: Dictionary to remap x labels. Keys are original labels and values are new labels.

Returns:

None

Examples

This will read the clustering results from ‘clustering_results.txt’, create a bar plot, and save it as ‘kmeans_plot.cluster_4.png’ in the specified output directory.

>>> import stacker as st
>>> st.plot_cluster_trj_data('clustering_results.txt', "../testing/kmeans_plot.png", {'5JUP_N2_tGGG_aCCU_+1CGU_data': 'tGGG_aCCU_+1CGU', '5JUP_N2_tGGG_aCCU_+1GCU_data': 'tGGG_aCCU_+1GCU'})

plot_silhouette(n_clusters, blind_data, outdir='')[source]#

Creates Silhouette plots to determine the best number of clusters

Parameters:

n_clustersint, default = 0: The number of clusters to form.
blind_datanp.typing.ArrayLike: A 2D numpy array containing all frames stacked together. Output of create_kmeans_input()
outfilestr: Filepath where the plot PNG file will be saved.

plot_pca(blinded_data, dataset_names, coloring='dataset', outdir='', cluster_labels=None, new_dataset_names=None)[source]#

Creates PCA Plot to compare systems in 2D

Creates a PCA plot that can be colored by the KMeans clustering result or by dataset. Compares SSFs similarly to K Means.

Parameters:

blinded_datanp.ndarray: A 2D numpy array containing all frames stacked together. Output of create_kmeans_input()
dataset_nameslist of str: List of filenames to read and preprocess. Outputted from stacker -s ssf -d output.txt.gz. Should be in the format {datapath}/{traj_name}.txt.gz
coloring{‘dataset’, ‘kmeans’, ‘facet’}: Method to color the points on the scatterplot. Options: - dataset: Plot all points on the same scatterplot and color by dataset of origin. - kmeans: Plot all points on the same scatterplot and color by KMeans Cluster with n_clusters - facet: Same as dataset but plot each dataset on a different coordinate grid.
outdirstr, default=’’: Directory to save the clustering results.
cluster_labelsnp.ndarray, optional: The labels of the clusters for each frame, output from run_kmeans. Used if coloring = “kmeans” to color points by cluster
new_dataset_namesdict, optional: Dictionary to remap dataset names. Keys are original filenames in dataset_names and values are shortened names.

Returns:

None

See also

create_kmeans_input: blinds SSF Data for input to K Means
read_and_preprocess_data: reads and preprocesses SSF data for K Means analysis per dataset
sklearn.decomposition.PCA: Runs PCA

kmeans

Contents

kmeans#