kmeans#
Run K Means Clustering and Principal Component Analysis
This module contains functions to run K Means Clustering on SSF results and visualize the clusters with barplots, silhouette analysis, and PCA scatterplots.
Functions
|
Blinds SSF Data (removes trajectory labels) for input to K Means |
|
Plots the output of run_kmeans() to a PNG file. |
|
Creates PCA Plot to compare systems in 2D |
|
Creates Silhouette plots to determine the best number of clusters |
Reads and preprocesses SSF data for K Means analysis per dataset. |
|
|
Performs KMeans clustering on blinded SSF data and saves the results. |
- read_and_preprocess_data((file1, file2, ...))[source]#
Reads and preprocesses SSF data for K Means analysis per dataset.
Reads SSF data from txt files for each dataset, decompresses the data, and attaches each Trajectory to its frame-wise SSF results. The values are flattened SSF lists, so rather than a 3200 frames x 127 res x 127 res, it’s a 3200 frames x 16129 res-res pairs. For example, a 2-residue, 2-frame SSF of
[ [[1, 2], [3, 4]],
[[5, 6], [7, 8]] ]
is flattened to:
[[1, 2, 3, 4], [5, 6, 7, 8]]
- Parameters:
- file1, file2, …list of str
List of filenams to read and preprocess. Outputted from -s ssf -d output.txt. Should be in the format {datapath}/{traj_name}.txt.gz
- Returns:
- data_arraysdict
Dictionary where keys are dataset names and values are the processed data arrays.
See also
create_kmeans_input
Stacks SSF data into a single 2D Numpy array.
Examples
>>> import stacker as st >>> dataset_names = ['testing/5JUP_N2_tGGG_aCCU_+1GCU.txt.gz', 'testing/5JUP_N2_tGGG_aCCU_+1CGU.txt.gz'] # 3200 frames, SSFs of 127 x 127 residues >>> data_arrays = st.read_and_preprocess_data(dataset_names) >>> print(data_arrays['dataset1'].shape) (3200, 16129)
- create_kmeans_input(data_arrays)[source]#
Blinds SSF Data (removes trajectory labels) for input to K Means
Stacks SSF data into a single 2D numpy array from all frames of all trajectories without labels for each frame. Used for input to KMeans Clustering
- Parameters:
- data_arraysdict
Output of read_and_preprocess_data(). Dictionary where keys are dataset names and values are the processed data arrays.
- Returns:
- blinded_datanp.typing.ArrayLike
A 2D numpy array containing all frames stacked together.
See also
read_and_preprocess_data
Reads and preprocesses data for each dataset
Examples
>>> import stacker as st >>> data_arrays = { ... 'dataset1': np.random.rand(3200, 16129), ... 'dataset2': np.random.rand(3200, 16129) ... } >>> kmeans_input = st.create_kmeans_input(data_arrays) >>> print(kmeans_input.shape) (6400, 16129)
- run_kmeans(data_arrays, n_clusters, max_iter=1000, n_init=20, random_state=1, outdir='')[source]#
Performs KMeans clustering on blinded SSF data and saves the results.
This function applies the KMeans clustering algorithm to the provided blinded SSF data, assigns each frame to a cluster, and counts the number of frames in each cluster for each dataset. The results are printed and saved to a file.
- Parameters:
- data_arraysdict
Output of read_and_preprocess_data(). Dictionary where keys are dataset names and values are the processed data arrays.
- n_clustersint
The number of clusters to form
- max_iterint, default=1000
Maximum number of iterations of the k-means algorithm for a single run.
- n_initint, default=20
Number of times the k-means algorithm will be run with different centroid seeds.
- random_stateint, default=1
Determines random number generation for centroid initialization.
- outdirstr, default=’’
Directory to save the clustering results. If empty, just prints to standard output.
- Returns:
- np.ndarray
The labels of the clusters for each frame.
See also
create_kmeans_input
blinds SSF Data for input to K Means
read_and_preprocess_data
reads and preprocesses SSF data for K Means analysis per dataset
Examples
>>> import stacker as st >>> data_arrays = { ... 'dataset1': np.random.rand(3200, 16129), ... 'dataset2': np.random.rand(3200, 16129) ... } >>> blinded_data = st.create_kmeans_input(data_arrays) >>> st.run_kmeans(blinded_data, n_clusters=4) Reading data: dataset1 Reading data: dataset2 (6400, 16129) {'dataset1': array([800, 800, 800, 800]), 'dataset2': array([800, 800, 800, 800])} Dataset: dataset1 Cluster 1: 800 matrices Cluster 2: 800 matrices Cluster 3: 800 matrices Cluster 4: 800 matrices Dataset: dataset2 Cluster 1: 800 matrices Cluster 2: 800 matrices Cluster 3: 800 matrices Cluster 4: 800 matrices
- plot_cluster_trj_data(cluster_file, outfile, x_labels_map=None)[source]#
Plots the output of run_kmeans() to a PNG file.
Creates a grouped bar plot of the number of frames from each trajectory in each cluster following KMeans clustering. Writes the plot output to a PNG file.
- Parameters:
- input_filestr
Path to clustering results written by run_kmeans()
- outfilestr
Filepath where the plot PNG file will be saved.
- x_labels_mapdict, optional
Dictionary to remap x labels. Keys are original labels and values are new labels.
- Returns:
- None
Examples
This will read the clustering results from ‘clustering_results.txt’, create a bar plot, and save it as ‘kmeans_plot.cluster_4.png’ in the specified output directory.
>>> import stacker as st >>> st.plot_cluster_trj_data('clustering_results.txt', "../testing/kmeans_plot.png", {'5JUP_N2_tGGG_aCCU_+1CGU_data': 'tGGG_aCCU_+1CGU', '5JUP_N2_tGGG_aCCU_+1GCU_data': 'tGGG_aCCU_+1GCU'})
- plot_silhouette(n_clusters, blind_data, outdir='')[source]#
Creates Silhouette plots to determine the best number of clusters
- Parameters:
- n_clustersint, default = 0
The number of clusters to form.
- blind_datanp.typing.ArrayLike
A 2D numpy array containing all frames stacked together. Output of create_kmeans_input()
- outfilestr
Filepath where the plot PNG file will be saved.
- plot_pca(blinded_data, dataset_names, coloring='dataset', outdir='', cluster_labels=None, new_dataset_names=None)[source]#
Creates PCA Plot to compare systems in 2D
Creates a PCA plot that can be colored by the KMeans clustering result or by dataset. Compares SSFs similarly to K Means.
- Parameters:
- blinded_datanp.ndarray
A 2D numpy array containing all frames stacked together. Output of create_kmeans_input()
- dataset_nameslist of str
List of filenames to read and preprocess. Outputted from stacker -s ssf -d output.txt.gz. Should be in the format {datapath}/{traj_name}.txt.gz
- coloring{‘dataset’, ‘kmeans’, ‘facet’}
Method to color the points on the scatterplot. Options: - dataset: Plot all points on the same scatterplot and color by dataset of origin. - kmeans: Plot all points on the same scatterplot and color by KMeans Cluster with n_clusters - facet: Same as dataset but plot each dataset on a different coordinate grid.
- outdirstr, default=’’
Directory to save the clustering results.
- cluster_labelsnp.ndarray, optional
The labels of the clusters for each frame, output from run_kmeans. Used if coloring = “kmeans” to color points by cluster
- new_dataset_namesdict, optional
Dictionary to remap dataset names. Keys are original filenames in
dataset_names
and values are shortened names.
- Returns:
- None
See also
create_kmeans_input
blinds SSF Data for input to K Means
read_and_preprocess_data
reads and preprocesses SSF data for K Means analysis per dataset
sklearn.decomposition.PCA
Runs PCA