Estimating cell cycle phases#

In this vignette we will demonstrate how to identify cell cycle phases using single-cell RNA-Seq data

%load_ext autotime

import scarf
scarf.__version__
'0.26.3'
time: 1.03 s (started: 2023-05-24 15:25:35 +00:00)

1) Fetch pre-analyzed data#

Here we use the data from Bastidas-Ponce et al., 2019 Development for E15.5 stage of differentiation of endocrine cells from a pool of endocrine progenitors-precursors.

We have stored this data on Scarf’s online repository for quick access. We processed the data to identify the highly variable genes (top 2000) and create a neighbourhood graph of cells. A UMAP embedding was calculated for the cells.

scarf.fetch_dataset(
    dataset_name='bastidas-ponce_4K_pancreas-d15_rnaseq',
    save_path='./scarf_datasets',
    as_zarr=True,
)
time: 19.2 s (started: 2023-05-24 15:25:36 +00:00)
ds = scarf.DataStore(
    f"scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.zarr",
    nthreads=4, 
    default_assay='RNA'
)
time: 30.4 ms (started: 2023-05-24 15:25:55 +00:00)
ds.plot_layout(
    layout_key='RNA_UMAP', 
    color_by='clusters'
)
../_images/a4e8f3a7a200a870e1cac464108688a7090742b098e2798f230584f148733a3d.png
time: 1.26 s (started: 2023-05-24 15:25:55 +00:00)

2) Run cell cycle scoring#

The cell cycle scoring function in Scarf is highly inspired by the equivalent function in Scanpy. The cell cycle phase of each individual cell is identified following steps below:

  • A list of S and G2M phase is provided to the function (Scarf already has a generic list of genes that works both for human and mouse data)

  • Average expression of all the genes (separately for S and G2M lists) in across cell_key cells is calculated

  • The log average expression is divided in n_bins bins

  • A control set of genes is identified by sampling genes from same expression bins where phase’s genes are present.

  • The average expression of phase genes (Ep) and control genes (Ec) is calculated per cell.

  • A phase score is calculated as: Ep-Ec Cell cycle phase is assigned to each cell based on following rule set:

  • G1 phase: S score < -1 > G2M sore

  • S phase: S score > G2M score

  • G2M phase: G2M score > S score

ds.run_cell_cycle_scoring()
WARNING: 1 values were not found in the table column names
time: 1.49 s (started: 2023-05-24 15:25:56 +00:00)

3) Visualize cell cycle phases#

By default the cell cycle phase information in stored under cell attribute table under column/key RNA_cell_cycle_phase. We can color the UMAP plot based on these values.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    cmap='tab20',
)
../_images/99f07c6e9b41cd55e61a8c38ab2c202e42ce201c953dc20f4c3e254145eed0bb.png
time: 368 ms (started: 2023-05-24 15:25:58 +00:00)

We can clearly see that cycling group of cells in the ‘ductal’ group. You can provide your own custom color mappings like below:

color_key = {
    'G1': 'grey',
    'S': 'orange',
    'G2M': 'crimson',
}

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    color_key=color_key,
)
../_images/1cde85194ff89e4fa89e126612e41e067fd5b9181e972b84020a0fb64a785aa8.png
time: 313 ms (started: 2023-05-24 15:25:58 +00:00)

4) Visualize phase specific scores#

The individual and S and G2M scores for each cell are stored under columns RNA_S_score and RNA_G2M_score. We can visualize the distribution of these scores on the UMAP plots

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by=['RNA_S_score', 'RNA_G2M_score'],
    width=4.5, height=4.5,
)
../_images/e2f0cb1a7da735e6af697601c5e1fefc29862c10d7d59d59b09329915728bc56.png
time: 1.26 s (started: 2023-05-24 15:25:58 +00:00)

5) Compare with scores calculated using Scanpy#

The dataset we downloaded, already had cell cycle scores calculated using Scanpy. For example, the S phase scores are stored under the column S_score. We can plot these scores on the UMAP.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='S_score',
)
../_images/4b58319eb5727a446c868d4b8b952048835a1ac8582296b825ee4e6de2b976fc.png
time: 573 ms (started: 2023-05-24 15:26:00 +00:00)

Unsurprisingly, these scores look very similar to those obtained through Scarf. Let’s quantify the concordance below

import matplotlib.pyplot as plt
from scipy.stats import linregress

fig, axis  = plt.subplots(1, 2, figsize=(6,3))
for n,i in enumerate(['S_score', 'G2M_score']):
    x = ds.cells.fetch(f"RNA_{i}")
    y = ds.cells.fetch(i)
    res = linregress(x, y)
    
    ax = axis[n]
    ax.scatter(x, y, color=color_key[i.split('_')[0]])
    ax.plot(x, res.intercept + res.slope*x, label='fitted line', c='k')
    ax.set_xlabel(f"{i} (Scarf)")
    ax.set_ylabel(f"{i} (Scanpy)")
    ax.set_title(f"Corr. coef.: {round(res.rvalue, 2)} (pvalue: {res.pvalue})")

plt.tight_layout()
plt.show()
../_images/1c8de9d0b8ad9f788d8943a99c528e20fa618956209bd2f515dcfe7771553227.png
time: 457 ms (started: 2023-05-24 15:26:00 +00:00)

High correlation coefficients indicate a large degree of concordance between the scores obtained using Scanpy and Scarf


That is all for this vignette.