Estimating cell cycle phases#

In this vignette we will demonstrate how to identify cell cycle phases using single-cell RNA-Seq data

%load_ext autotime

import scarf
scarf.__version__
'0.19.8'
time: 964 ms (started: 2022-08-15 17:11:22 +00:00)

1) Fetch pre-analyzed data#

Here we use the data from Bastidas-Ponce et al., 2019 Development for E15.5 stage of differentiation of endocrine cells from a pool of endocrine progenitors-precursors.

We have stored this data on Scarf’s online repository for quick access. We processed the data to identify the highly variable genes (top 2000) and create a neighbourhood graph of cells. A UMAP embedding was calculated for the cells.

scarf.fetch_dataset(
    dataset_name='bastidas-ponce_4K_pancreas-d15_rnaseq',
    save_path='./scarf_datasets',
    as_zarr=True,
)
time: 9.75 s (started: 2022-08-15 17:11:23 +00:00)
ds = scarf.DataStore(
    f"scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.zarr",
    nthreads=4, 
    default_assay='RNA'
)
time: 21.9 ms (started: 2022-08-15 17:11:33 +00:00)
ds.plot_layout(
    layout_key='RNA_UMAP', 
    color_by='clusters'
)
../_images/299b49d0896ed02e3a356c88aee0b2ff9a019013f0d717c2fe05f2363e992d50.png
time: 1.41 s (started: 2022-08-15 17:11:33 +00:00)

2) Run cell cycle scoring#

The cell cycle scoring function in Scarf is highly inspired by the equivalent function in Scanpy. The cell cycle phase of each individual cell is identified following steps below:

  • A list of S and G2M phase is provided to the function (Scarf already has a generic list of genes that works both for human and mouse data)

  • Average expression of all the genes (separately for S and G2M lists) in across cell_key cells is calculated

  • The log average expression is divided in n_bins bins

  • A control set of genes is identified by sampling genes from same expression bins where phase’s genes are present.

  • The average expression of phase genes (Ep) and control genes (Ec) is calculated per cell.

  • A phase score is calculated as: Ep-Ec Cell cycle phase is assigned to each cell based on following rule set:

  • G1 phase: S score < -1 > G2M sore

  • S phase: S score > G2M score

  • G2M phase: G2M score > S score

ds.run_cell_cycle_scoring()
WARNING: 1 values were not found in the table column names
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.19.8/lib/python3.8/site-packages/scarf/feat_utils.py:85: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  for cut in np.unique(obs_cut[feature_list]):
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.19.8/lib/python3.8/site-packages/scarf/feat_utils.py:85: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  for cut in np.unique(obs_cut[feature_list]):
time: 1.61 s (started: 2022-08-15 17:11:34 +00:00)

3) Visualize cell cycle phases#

By default the cell cycle phase information in stored under cell attribute table under column/key RNA_cell_cycle_phase. We can color the UMAP plot based on these values.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    cmap='tab20',
)
../_images/f0e3b3b5081102b571737166b767cef837c0567fcc8768155b7853c27eab74a4.png
time: 244 ms (started: 2022-08-15 17:11:36 +00:00)

We can clearly see that cycling group of cells in the ‘ductal’ group. You can provide your own custom color mappings like below:

color_key = {
    'G1': 'grey',
    'S': 'orange',
    'G2M': 'crimson',
}

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    color_key=color_key,
)
../_images/2f17104fb62af5ef2150f5e1dc94dfc58121786fc8ac677fb90600846e3eb911.png
time: 251 ms (started: 2022-08-15 17:11:36 +00:00)

4) Visualize phase specific scores#

The individual and S and G2M scores for each cell are stored under columns RNA_S_score and RNA_G2M_score. We can visualize the distribution of these scores on the UMAP plots

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by=['RNA_S_score', 'RNA_G2M_score'],
    width=4.5, height=4.5,
)
../_images/7645c0eab372b741f737d888159c8f8ca8ae97f3530324f0e796f8a62559e76e.png
time: 1.03 s (started: 2022-08-15 17:11:36 +00:00)

5) Compare with scores calculated using Scanpy#

The dataset we downloaded, already had cell cycle scores calculated using Scanpy. For example, the S phase scores are stored under the column S_score. We can plot these scores on the UMAP.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='S_score',
)
../_images/211c6a8b94ec03aed492da61d0f6e9db15143a4c2a6151e350628adc953abfa9.png
time: 510 ms (started: 2022-08-15 17:11:37 +00:00)

Unsurprisingly, these scores look very similar to those obtained through Scarf. Let’s quantify the concordance below

import matplotlib.pyplot as plt
from scipy.stats import linregress

fig, axis  = plt.subplots(1, 2, figsize=(6,3))
for n,i in enumerate(['S_score', 'G2M_score']):
    x = ds.cells.fetch(f"RNA_{i}")
    y = ds.cells.fetch(i)
    res = linregress(x, y)
    
    ax = axis[n]
    ax.scatter(x, y, color=color_key[i.split('_')[0]])
    ax.plot(x, res.intercept + res.slope*x, label='fitted line', c='k')
    ax.set_xlabel(f"{i} (Scarf)")
    ax.set_ylabel(f"{i} (Scanpy)")
    ax.set_title(f"Corr. coef.: {round(res.rvalue, 2)} (pvalue: {res.pvalue})")

plt.tight_layout()
plt.show()
../_images/4a3712c4fda22e6b8660c9ff0f1b23e330fee4c9058921e49153f6ef87027d99.png
time: 309 ms (started: 2022-08-15 17:11:38 +00:00)

High correlation coefficients indicate a large degree of concordance between the scores obtained using Scanpy and Scarf


That is all for this vignette.