Estimating cell cycle phases#

In this vignette we will demonstrate how to identify cell cycle phases using single-cell RNA-Seq data

%load_ext autotime

import scarf
scarf.__version__
'0.28.9'
time: 1.1 s (started: 2024-01-24 15:32:40 +00:00)

1) Fetch pre-analyzed data#

Here we use the data from Bastidas-Ponce et al., 2019 Development for E15.5 stage of differentiation of endocrine cells from a pool of endocrine progenitors-precursors.

We have stored this data on Scarf’s online repository for quick access. We processed the data to identify the highly variable genes (top 2000) and create a neighbourhood graph of cells. A UMAP embedding was calculated for the cells.

scarf.fetch_dataset(
    dataset_name='bastidas-ponce_4K_pancreas-d15_rnaseq',
    save_path='./scarf_datasets',
    as_zarr=True,
)
time: 22.6 s (started: 2024-01-24 15:32:41 +00:00)
ds = scarf.DataStore(
    f"scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.zarr",
    nthreads=4, 
    default_assay='RNA'
)
time: 36.9 ms (started: 2024-01-24 15:33:04 +00:00)
ds.plot_layout(
    layout_key='RNA_UMAP', 
    color_by='clusters'
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/1eb13b28095333c87038aa1e82048ed6779636ba5a122ca39af32f40bb6b7c4c.png
time: 1.31 s (started: 2024-01-24 15:33:04 +00:00)

2) Run cell cycle scoring#

The cell cycle scoring function in Scarf is highly inspired by the equivalent function in Scanpy. The cell cycle phase of each individual cell is identified following steps below:

  • A list of S and G2M phase is provided to the function (Scarf already has a generic list of genes that works both for human and mouse data)

  • Average expression of all the genes (separately for S and G2M lists) in across cell_key cells is calculated

  • The log average expression is divided in n_bins bins

  • A control set of genes is identified by sampling genes from same expression bins where phase’s genes are present.

  • The average expression of phase genes (Ep) and control genes (Ec) is calculated per cell.

  • A phase score is calculated as: Ep-Ec Cell cycle phase is assigned to each cell based on following rule set:

  • G1 phase: S score < -1 > G2M sore

  • S phase: S score > G2M score

  • G2M phase: G2M score > S score

ds.run_cell_cycle_scoring()
WARNING: 1 values were not found in the table column names
time: 1.45 s (started: 2024-01-24 15:33:05 +00:00)

3) Visualize cell cycle phases#

By default the cell cycle phase information in stored under cell attribute table under column/key RNA_cell_cycle_phase. We can color the UMAP plot based on these values.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    cmap='tab20',
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/36df5dae39209da92a1d1099469ed834fd96c2ee77906f0f1a9c7fb167ce6e49.png
time: 372 ms (started: 2024-01-24 15:33:07 +00:00)

We can clearly see that cycling group of cells in the ‘ductal’ group. You can provide your own custom color mappings like below:

color_key = {
    'G1': 'grey',
    'S': 'orange',
    'G2M': 'crimson',
}

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    color_key=color_key,
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/7ae9555b9ffb0234ae39ac4afe78036e1988529a078e6a6cca493e18e47598b9.png
time: 276 ms (started: 2024-01-24 15:33:07 +00:00)

4) Visualize phase specific scores#

The individual and S and G2M scores for each cell are stored under columns RNA_S_score and RNA_G2M_score. We can visualize the distribution of these scores on the UMAP plots

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by=['RNA_S_score', 'RNA_G2M_score'],
    width=4.5, height=4.5,
)
../_images/d6cf7410e51a24e3df5667ef0e9da64b41df582e023ba536880954706b048179.png
time: 936 ms (started: 2024-01-24 15:33:07 +00:00)

5) Compare with scores calculated using Scanpy#

The dataset we downloaded, already had cell cycle scores calculated using Scanpy. For example, the S phase scores are stored under the column S_score. We can plot these scores on the UMAP.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='S_score',
)
../_images/ea639ee53c88a08e92b6107e8e61fd8debfe960a8a2255f3fc31eafba5938ae8.png
time: 431 ms (started: 2024-01-24 15:33:08 +00:00)

Unsurprisingly, these scores look very similar to those obtained through Scarf. Let’s quantify the concordance below

import matplotlib.pyplot as plt
from scipy.stats import linregress

fig, axis  = plt.subplots(1, 2, figsize=(6,3))
for n,i in enumerate(['S_score', 'G2M_score']):
    x = ds.cells.fetch(f"RNA_{i}")
    y = ds.cells.fetch(i)
    res = linregress(x, y)
    
    ax = axis[n]
    ax.scatter(x, y, color=color_key[i.split('_')[0]])
    ax.plot(x, res.intercept + res.slope*x, label='fitted line', c='k')
    ax.set_xlabel(f"{i} (Scarf)")
    ax.set_ylabel(f"{i} (Scanpy)")
    ax.set_title(f"Corr. coef.: {round(res.rvalue, 2)} (pvalue: {res.pvalue})")

plt.tight_layout()
plt.show()
../_images/f641daaabe65f8e01058f6f02ad71d02c88c8169db10b3e65af2e4da3f2e9bb4.png
time: 413 ms (started: 2024-01-24 15:33:09 +00:00)

High correlation coefficients indicate a large degree of concordance between the scores obtained using Scanpy and Scarf


That is all for this vignette.