kmeans#

Run K Means Clustering and Principal Component Analysis

This module contains functions to run K Means Clustering on SSF results and visualize the clusters with barplots, silhouette analysis, and PCA scatterplots.

Functions

create_kmeans_input(data_arrays)

Blinds SSF Data (removes trajectory labels) for input to K Means

plot_cluster_trj_data(cluster_file, outfile)

Plots the output of run_kmeans() to a PNG file.

plot_pca(blinded_data, dataset_names[, ...])

Creates PCA Plot to compare systems in 2D

plot_silhouette(n_clusters, blind_data[, outdir])

Creates Silhouette plots to determine the best number of clusters

read_and_preprocess_data()

Reads and preprocesses SSF data for K Means analysis per dataset.

run_kmeans(data_arrays, n_clusters[, ...])

Performs KMeans clustering on blinded SSF data and saves the results.

read_and_preprocess_data((file1, file2, ...))[source]#

Reads and preprocesses SSF data for K Means analysis per dataset.

Reads SSF data from txt files for each dataset, decompresses the data, and attaches each Trajectory to its frame-wise SSF results. The values are flattened SSF lists, so rather than a 3200 frames x 127 res x 127 res, it’s a 3200 frames x 16129 res-res pairs. For example, a 2-residue, 2-frame SSF of

[ [[1, 2], [3, 4]],

[[5, 6], [7, 8]] ]

is flattened to:

[[1, 2, 3, 4], [5, 6, 7, 8]]

Parameters:
file1, file2, …list of str

List of filenams to read and preprocess. Outputted from -s ssf -d output.txt. Should be in the format {datapath}/{traj_name}.txt.gz

Returns:
data_arraysdict

Dictionary where keys are dataset names and values are the processed data arrays.

See also

create_kmeans_input

Stacks SSF data into a single 2D Numpy array.

Examples

>>> import stacker as st
>>> dataset_names = ['testing/5JUP_N2_tGGG_aCCU_+1GCU.txt.gz', 'testing/5JUP_N2_tGGG_aCCU_+1CGU.txt.gz']  # 3200 frames, SSFs of 127 x 127 residues
>>> data_arrays = st.read_and_preprocess_data(dataset_names)
>>> print(data_arrays['dataset1'].shape)
(3200, 16129)
create_kmeans_input(data_arrays)[source]#

Blinds SSF Data (removes trajectory labels) for input to K Means

Stacks SSF data into a single 2D numpy array from all frames of all trajectories without labels for each frame. Used for input to KMeans Clustering

Parameters:
data_arraysdict

Output of read_and_preprocess_data(). Dictionary where keys are dataset names and values are the processed data arrays.

Returns:
blinded_datanp.typing.ArrayLike

A 2D numpy array containing all frames stacked together.

See also

read_and_preprocess_data

Reads and preprocesses data for each dataset

Examples

>>> import stacker as st
>>> data_arrays = {
...     'dataset1': np.random.rand(3200, 16129),
...     'dataset2': np.random.rand(3200, 16129)
... }
>>> kmeans_input = st.create_kmeans_input(data_arrays)
>>> print(kmeans_input.shape)
(6400, 16129)
run_kmeans(data_arrays, n_clusters, max_iter=1000, n_init=20, random_state=1, outdir='')[source]#

Performs KMeans clustering on blinded SSF data and saves the results.

This function applies the KMeans clustering algorithm to the provided blinded SSF data, assigns each frame to a cluster, and counts the number of frames in each cluster for each dataset. The results are printed and saved to a file.

Parameters:
data_arraysdict

Output of read_and_preprocess_data(). Dictionary where keys are dataset names and values are the processed data arrays.

n_clustersint

The number of clusters to form

max_iterint, default=1000

Maximum number of iterations of the k-means algorithm for a single run.

n_initint, default=20

Number of times the k-means algorithm will be run with different centroid seeds.

random_stateint, default=1

Determines random number generation for centroid initialization.

outdirstr, default=’’

Directory to save the clustering results. If empty, just prints to standard output.

Returns:
np.ndarray

The labels of the clusters for each frame.

See also

create_kmeans_input

blinds SSF Data for input to K Means

read_and_preprocess_data

reads and preprocesses SSF data for K Means analysis per dataset

Examples

>>> import stacker as st
>>> data_arrays = {
...     'dataset1': np.random.rand(3200, 16129),
...     'dataset2': np.random.rand(3200, 16129)
... }
>>> blinded_data = st.create_kmeans_input(data_arrays)
>>> st.run_kmeans(blinded_data, n_clusters=4)
Reading data: dataset1
Reading data: dataset2
(6400, 16129)
{'dataset1': array([800, 800, 800, 800]), 'dataset2': array([800, 800, 800, 800])}
Dataset: dataset1
    Cluster 1: 800 matrices
    Cluster 2: 800 matrices
    Cluster 3: 800 matrices
    Cluster 4: 800 matrices
Dataset: dataset2
    Cluster 1: 800 matrices
    Cluster 2: 800 matrices
    Cluster 3: 800 matrices
    Cluster 4: 800 matrices
plot_cluster_trj_data(cluster_file, outfile, x_labels_map=None)[source]#

Plots the output of run_kmeans() to a PNG file.

Creates a grouped bar plot of the number of frames from each trajectory in each cluster following KMeans clustering. Writes the plot output to a PNG file.

Parameters:
input_filestr

Path to clustering results written by run_kmeans()

outfilestr

Filepath where the plot PNG file will be saved.

x_labels_mapdict, optional

Dictionary to remap x labels. Keys are original labels and values are new labels.

Returns:
None

Examples

This will read the clustering results from ‘clustering_results.txt’, create a bar plot, and save it as ‘kmeans_plot.cluster_4.png’ in the specified output directory.

>>> import stacker as st
>>> st.plot_cluster_trj_data('clustering_results.txt', "../testing/kmeans_plot.png", {'5JUP_N2_tGGG_aCCU_+1CGU_data': 'tGGG_aCCU_+1CGU', '5JUP_N2_tGGG_aCCU_+1GCU_data': 'tGGG_aCCU_+1GCU'})
plot_silhouette(n_clusters, blind_data, outdir='')[source]#

Creates Silhouette plots to determine the best number of clusters

Parameters:
n_clustersint, default = 0

The number of clusters to form.

blind_datanp.typing.ArrayLike

A 2D numpy array containing all frames stacked together. Output of create_kmeans_input()

outfilestr

Filepath where the plot PNG file will be saved.

plot_pca(blinded_data, dataset_names, coloring='dataset', outdir='', cluster_labels=None, new_dataset_names=None)[source]#

Creates PCA Plot to compare systems in 2D

Creates a PCA plot that can be colored by the KMeans clustering result or by dataset. Compares SSFs similarly to K Means.

Parameters:
blinded_datanp.ndarray

A 2D numpy array containing all frames stacked together. Output of create_kmeans_input()

dataset_nameslist of str

List of filenames to read and preprocess. Outputted from stacker -s ssf -d output.txt.gz. Should be in the format {datapath}/{traj_name}.txt.gz

coloring{‘dataset’, ‘kmeans’, ‘facet’}

Method to color the points on the scatterplot. Options: - dataset: Plot all points on the same scatterplot and color by dataset of origin. - kmeans: Plot all points on the same scatterplot and color by KMeans Cluster with n_clusters - facet: Same as dataset but plot each dataset on a different coordinate grid.

outdirstr, default=’’

Directory to save the clustering results.

cluster_labelsnp.ndarray, optional

The labels of the clusters for each frame, output from run_kmeans. Used if coloring = “kmeans” to color points by cluster

new_dataset_namesdict, optional

Dictionary to remap dataset names. Keys are original filenames in dataset_names and values are shortened names.

Returns:
None

See also

create_kmeans_input

blinds SSF Data for input to K Means

read_and_preprocess_data

reads and preprocesses SSF data for K Means analysis per dataset

sklearn.decomposition.PCA

Runs PCA