Making System Stacking Fingerprints (SSFs)

Making System Stacking Fingerprints (SSFs)#

This pipeline creates System Stacking Fingerprints (SSFs) like the one below which can be used to analyze system-wide pi-stacking in a molecular dynamics trajectory.

Load Trajectory#

To analyze pi-stacking between two nucleotides, we compare their center of geometry (COG) distance, the distance between the centers of their 6-membered (pyrimidine) rings (see r below). For this, the only atoms we need are C2, C4, and C6 to triangulate the COG of each ring.

For reference, the Carbon Numbering of each nucleotide:

In this example pipeline, we use a trajectory of the Ribosome CAR-mRNA Interaction Surface found in the StACKER GitHub Repository. We filter this trajectory to the pi-stacking residue pair.

MD Files are provided for testing convenience in the testing folder:

first10_5JUP_N2_tUAG_aCUA_+1GCU_nowat.mdcrd: A 10-frame trajectory file with all atoms/residues.
5JUP_N2_tUAG_aCUA_+1GCU_nowat.prmtop: The associated Topology File with the above trajectory.

We filter to atoms C2, C4, and C6 using filter_traj():

>>> import stacker as st
>>> filtered_traj = st.filter_traj("first10_5JUP_N2_tUAG_aCUA_+1GCU_nowat.mdcrd",
...                 top_file = "5JUP_N2_tUAG_aCUA_+1GCU_nowat.prmtop",
...                 atoms = {"C2", "C4", "C6"})
WARNING: Residue Indices are expected to be 1-indexed
Reading trajectory...
Reading topology...
Filtering trajectory...
WARNING: Output filtered traj atom, residue, and chain indices are zero-indexed
>>> filtered_traj
<mdtraj.Trajectory with 10 frames, 756 atoms, 252 residues, without unitcells at 0x1164eab10>

Now the Python variable filtered_traj contains 10 frames of C2, C4, C6 information for all nucleotides.

Calculate Distance Between Residues#

Immediately, we can check if a residue pair is pi-stacking in a given frame. We calculate the distance between the COG of two residues using calculate_residue_distance(). COG distance close to 3.5 Å indicates pi-stacking. For instance, the A-site mRNA codons (residues 422, 423, and 424) are likely pi-stacking:

>>> distance_vec = st.calculate_residue_distance(
...     filtered_traj,
...     res1 = 422,
...     res2 = 423,
...     frame = 3
... )
>>> distance_vec.magnitude()
3.509042

Create a System Stacking Fingerprint (SSF)#

An SSF calculates the COG distance for all nucleotide pairs in a trajectory frame. The result is a square matrix where position (i, j) represents the distance from residue i to residue j. This is done with get_residue_distance_for_frame()

The filtered_traj object has 252 residues, so we create a 252 x 252 SSF:

>>> ssf = st.get_residue_distance_for_frame(filtered_traj, frame = 2, write_output = True)
Loading: [####################################################################################################] Current Residue: 252/252 (100.0%)
Frame 2 done.
>>> ssf.shape
(252, 252)

We can calculate the SSF for multiple frames of a trajectory using system_stacking_fingerprints(), which accepts smart indexing of frames (eg. 1-3,15-17 = 1,2,3,15,16,17):

>>> ssfs = st.system_stacking_fingerprints(filtered_traj, frames = '1-3')
>>> ssfs.shape
(3, 252, 252)

ssfs is now a list, where ssfs[i] is the SSF for frame i (0-indexed frame). If frames is empty, the SSF will be calculated for all frames. For multi-frame trajectories, it is recommended to use the threads option to parallelize, calculating the SSF for multiple frames at once. When parralelizing, turn off standard output with write_output:

>>> ssfs = st.system_stacking_fingerprints(
...     filtered_traj,
...     frames = '1-10',
...     threads = 10,
...     write_output = False
... )
>>> ssfs.shape
(10, 252, 252)

Get the Average SSF for a Trajectory#

Single-frame SSFs are rarely as illuminating as the average SSF for all frames of a trajectory. Users can create this using get_frame_average(), using the output from system_stacking_fingerprints() in the step above:

>>> ssfs = st.system_stacking_fingerprints(
...     filtered_traj,
...     frames = '1-10',
...     threads = 10,
...     write_output = False
... )
>>> avg_ssf = st.get_frame_average(ssfs)
>>> avg_ssf.shape
(252, 252)

avg_ssf contains averaged stacking information throughout the trajectory. The next step shows how to analyze these results.

How to Visualize an SSF#

display_ssfs() will visualize SSFs in Python output, with each SSF frame as a frame of the video:

>>> import stacker as st
>>> filtered_traj = st.filter_traj("first10_5JUP_N2_tUAG_aCUA_+1GCU_nowat.mdcrd",
...                 top_file = "5JUP_N2_tUAG_aCUA_+1GCU_nowat.prmtop",
...                 atoms = {"C2", "C4", "C6"})
WARNING: Residue Indices are expected to be 1-indexed
Reading trajectory...
Reading topology...
Filtering trajectory...
WARNING: Output filtered traj atom, residue, and chain indices are zero-indexed
>>> ssfs = st.system_stacking_fingerprints(
...     filtered_traj,
...     frames = '1-10',
...     threads = 10,
...     write_output = False
... )
>>> resSeqs = [res.resSeq for res in filtered_traj.topology.residues]
>>> st.display_ssfs(
...     ssfs,
...     res_indicies = resSeqs,
...     seconds_per_frame = 2
... )

It is recommended to run get_frame_average() and then save the output as a .png for comparison with other trajectories:

>>> import stacker as st
>>> filtered_traj = st.filter_traj("first10_5JUP_N2_tUAG_aCUA_+1GCU_nowat.mdcrd",
...                 top_file = "5JUP_N2_tUAG_aCUA_+1GCU_nowat.prmtop",
...                 atoms = {"C2", "C4", "C6"})
WARNING: Residue Indices are expected to be 1-indexed
Reading trajectory...
Reading topology...
Filtering trajectory...
WARNING: Output filtered traj atom, residue, and chain indices are zero-indexed
>>> ssfs = st.system_stacking_fingerprints(
...     filtered_traj,
...     frames = '1-10',
...     threads = 10,
...     write_output = False
... )
>>> avg_ssf = st.get_frame_average(ssfs)
>>> ssfs = [avg_ssf]
>>> resSeqs = [res.resSeq for res in filtered_traj.topology.residues]
>>> st.display_ssfs(
...     ssfs,
...     resSeqs,
...     tick_distance = 20,
...     outfile = "SSF_test.png",
...     scale_limits = (2,5),
...     xy_line = False
... )

Command Line Options offers many future analyses, including combining SSF images along the x=y line for visual comparison with stacker -s ssf -B and stacker -s compare.

How to Analyze an SSF#

get_top_stacking() will give the stacking pairs with the most pi-stacking (ie. closest to 3.5Å):

>>> st.get_top_stacking(
...     filtered_traj,
...     avg_ssf
...     )
Res1    Res2    Avg_Dist
117     108     3.43
153     54      3.35
56      151     3.34
94      127     3.67
93      130     3.68

It is recommended to save the output to a .csv and use the other parameters for get_top_stacking() to prepare for future analyses using the Command Line Options:

>>> st.get_top_stacking(
...     filtered_traj,
...     avg_ssf,
...     csv = 'top_stacking.csv',
...     n_events = -1,
...     include_adjacent = True
... )

In the command line, [user]$ stacker -s compare can give the residue pairs that changed the most between trajectories, by inputting two csv outputs from get_top_stacking(). Use [user]$ stacker -s compare --help for more information.

Using K Means#

You can compare multiple System Stacking Fingerprints via K-Means. The algorithm takes multiple SSFs with no knowledge of their trajectory and groups similar SSFs. Thus, the algorithm groups trajectories via similar system-wide stacking.

Note: the SSFs must be comparable, meaning an equal amount of residues.

With StACKER’s Command Line Options, the .txt files below were created:

testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz : numpy array with 3200 SSFs
testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz : numpy array with 3200 SSFs

These files are too large to provide, but can be made with stacker -s ssf -d.

They can be read as Python objects and reshaped:

>>> import numpy as np
>>> import math
>>> GCU_3200_flattened_ssfs = np.loadtxt('../testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz')
>>> GCU_3200_flattened_ssfs.shape
(3200, 16129)
>>> GCU_3200_ssfs = GCU_3200_flattened_ssfs.reshape(GCU_3200_flattened_ssfs.shape[0], math.isqrt(GCU_3200_flattened_ssfs.shape[1]), math.isqrt(GCU_3200_flattened_ssfs.shape[1]))
>>> GCU_3200_ssfs.shape
(3200, 127, 127)

Or the files can be read and prepared for KMeans data

with read_and_preprocess_data() and run_kmeans():

>>> import stacker as st
>>> data_arrays  = st.read_and_preprocess_data(
...     ["testing/5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz",
...     "testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz"]
... )
>>> st.run_kmeans(data_arrays, n_clusters = 2)
(6400, 16129)
For n_clusters = 2 The average silhouette_score is : 0.10815289849518733
Dataset: 5JUP_N2_tGGG_aCCU_+1CGU_data
        Cluster 1: 0 matrices
        Cluster 2: 3200 matrices
Dataset: 5JUP_N2_tGGG_aCCU_+1GCU_data
        Cluster 1: 3120 matrices
        Cluster 2: 80 matrices

So the SSFs from the two datasets are distinct enough to separate into two clusters. We can visualize this with plot_cluster_trj_data():

>>> import stacker as st
>>> data_arrays  = st.read_and_preprocess_data(
...     ["testing/5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz",
...     "testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz"]
... )
>>> st.run_kmeans(data_arrays, n_clusters = 7, outdir = 'testing/')
(6400, 16129)
For n_clusters = 7 The average silhouette_score is : 0.09968523895626617
Dataset: 5JUP_N2_tGGG_aCCU_+1CGU_data
        Cluster 1: 560 matrices
        Cluster 2: 0 matrices
        Cluster 3: 240 matrices
        Cluster 4: 0 matrices
        Cluster 5: 1138 matrices
        Cluster 6: 2 matrices
        Cluster 7: 1260 matrices
Dataset: 5JUP_N2_tGGG_aCCU_+1GCU_data
        Cluster 1: 80 matrices
        Cluster 2: 1057 matrices
        Cluster 3: 0 matrices
        Cluster 4: 738 matrices
        Cluster 5: 0 matrices
        Cluster 6: 1245 matrices
        Cluster 7: 80 matrices
Results written to: testing/clustering_results_7.txt
>>> st.plot_cluster_trj_data("testing/clustering_results_7.txt", "testing/kmeans_plot.png", {'5JUP_N2_tGGG_aCCU_+1CGU_data': 'tGGG_aCCU_+1CGU', '5JUP_N2_tGGG_aCCU_+1GCU_data': 'tGGG_aCCU_+1GCU'})

The SSFs still split into distinct clusters (there is no cluster with a significant contribution from both trajectories), meaning the structures have different pi-stacking.

Picking the Right Number of Clusters#

StACKER is equipped to run Silhouette Analysis in order to select the correct number of clusters. plot_silhouette() can create Silhouette Plots:

>>> import stacker as st
>>> data_arrays  = st.read_and_preprocess_data(
...     ["testing/5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz",
...     "testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz"]
... )
Reading data: 5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz
Reading data: 5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz
>>> blinded_data = st.create_kmeans_input(data_arrays)
(6400, 16129)
>>> for n_cluster in range(2,8):
...     st.run_kmeans(data_arrays, n_clusters = n_cluster, outdir = 'testing/')
...     st.plot_silhouette(n_cluster, blinded_data, outdir = 'testing/')
(6400, 16129)
For n_clusters = 2 The average silhouette_score is : 0.10815289849518733
Dataset: 5JUP_N2_tGGG_aCCU_+1CGU_data
        Cluster 1: 0 matrices
        Cluster 2: 3200 matrices
Dataset: 5JUP_N2_tGGG_aCCU_+1GCU_data
        Cluster 1: 3120 matrices
        Cluster 2: 80 matrices
Results written to: ../testing/clustering_results_2.txt
File saved to: ../testing/silhouette2.png
(6400, 16129)
For n_clusters = 3 The average silhouette_score is : 0.11018525011212786
Dataset: 5JUP_N2_tGGG_aCCU_+1CGU_data
        Cluster 1: 720 matrices
        Cluster 2: 0 matrices
        Cluster 3: 2480 matrices
Dataset: 5JUP_N2_tGGG_aCCU_+1GCU_data
        Cluster 1: 0 matrices
        Cluster 2: 3194 matrices
        Cluster 3: 6 matrices
Results written to: ../testing/clustering_results_3.txt
File saved to: ../testing/silhouette3.png
...

Silhouette Plots show the distance of all points in a cluster from the cluster’s centroid. We want plots with clusters of similar size along the y-axis and with most points around the same distance from 0 on the x-axis. A Cluster choice of 6 is appropriate:

testing/silhouette6.png

While a cluster choice of 7 is not:

testing/silhouette7.png

Principal Component Analysis#

KMeans on SSF gathers Euclidean Distance on multi-dimensional vectors that are impossible to plot. We can, however, do a heuristic plotting of these SSFs using Principal Component Analysis (PCA).

First, we can plot the un-blinded SSFs and color them by their trajectory of origin. We have 3200 frames worth of SSFs for each trajectory, and we plot them using plot_pca():

>>> import stacker as st
>>> dataset_names = ["testing/5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz",
...     "testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz"]
>>> data_arrays  = st.read_and_preprocess_data(
...     dataset_names
... )
Reading data: 5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz
Reading data: 5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz
>>> blinded_data = st.create_kmeans_input(data_arrays)
(6400, 16129)
>>> st.plot_pca(blinded_data, dataset_names, 'dataset')

This creates a PCA plot where each point is a frame of a trajectory, plotted using the information solely from the SSF. The points are colored by the trajectory dataset of origin. This outputs to a .png:

Here the SSFs from the two datasets separate cleanly, indicating significantly different system-wide stacking. The PCA Plots can also be colored by kmeans result. Here, the frames are colored by their KMeans result with 6 clusters:

>>> import stacker as st
>>> data_arrays  = st.read_and_preprocess_data(
...     ["testing/5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz",
...     "testing/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz"]
...     )
Reading data: 5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz
Reading data: 5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz
>>> blinded_data = st.create_kmeans_input(data_arrays)
(6400, 16129)
>>> kmeans_results = st.run_kmeans(data_arrays, n_clusters=6)
(6400, 16129)
For n_clusters = 6 The average silhouette_score is : 0.09343055568735036
Dataset: 5JUP_N2_tGGG_aCCU_+1CGU_data
        Cluster 1: 0 matrices
        Cluster 2: 0 matrices
        Cluster 3: 1459 matrices
        Cluster 4: 4 matrices
        Cluster 5: 1017 matrices
        Cluster 6: 720 matrices
Dataset: 5JUP_N2_tGGG_aCCU_+1GCU_data
        Cluster 1: 1179 matrices
        Cluster 2: 824 matrices
        Cluster 3: 0 matrices
        Cluster 4: 1197 matrices
        Cluster 5: 0 matrices
        Cluster 6: 0 matrices
>>> st.plot_pca(blinded_data, dataset_names, coloring = 'kmeans', cluster_labels=kmeans_results)

Each trajectory has three clusters, with no cluster capturing significant frames from both trajectories.

Finally, they can be colored by dataset but use facet to show each dataset’s points on a separate PCA Grid. This helps with viewing multiple datasets:

>>> import stacker as st
>>> dataset_names = ["/home66/esakkas/STACKER/DATA/5JUP_N2_tGGG_aCCU_+1CGU_data.txt.gz",
...     "/home66/esakkas/STACKER/DATA/5JUP_N2_tGGG_aCCU_+1GCU_data.txt.gz",
...     "/home66/esakkas/STACKER/DATA/5JUP_N2_tGGG_aCCC_+1CGU_data.txt.gz",
...     "/home66/esakkas/STACKER/DATA/5JUP_N2_tGGG_aCCC_+1GCU_data.txt.gz"
...     ]
>>> data_arrays = st.read_and_preprocess_data(dataset_names)
>>> blinded_data = st.create_kmeans_input(data_arrays)
(12800, 16129)
>>> st.plot_pca(blinded_data, dataset_names, 'facet')

Here we see that the tGGG_aCCU_+1GCU trajectory occupies its own space. We can combine this result with the KMeans and PCA plots below to strongly support the conclusion that tGGG_aCCU_+1GCU has a different system-wide stacking, and therefore a different structure, than the other three trajectories. The K-Means results with the best silhouette plot is shown here: