Estimating cell cycle phases#

In this vignette we will demonstrate how to identify cell cycle phases using single-cell RNA-Seq data

%load_ext autotime

import scarf
scarf.__version__

'0.23.5'

time: 1.67 s (started: 2023-05-24 14:28:11 +00:00)

1) Fetch pre-analyzed data#

Here we use the data from Bastidas-Ponce et al., 2019 Development for E15.5 stage of differentiation of endocrine cells from a pool of endocrine progenitors-precursors.

We have stored this data on Scarf’s online repository for quick access. We processed the data to identify the highly variable genes (top 2000) and create a neighbourhood graph of cells. A UMAP embedding was calculated for the cells.

scarf.fetch_dataset(
    dataset_name='bastidas-ponce_4K_pancreas-d15_rnaseq',
    save_path='./scarf_datasets',
    as_zarr=True,
)

time: 16.8 s (started: 2023-05-24 14:28:13 +00:00)

ds = scarf.DataStore(
    f"scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.zarr",
    nthreads=4, 
    default_assay='RNA'
)

time: 51.4 ms (started: 2023-05-24 14:28:30 +00:00)

ds.plot_layout(
    layout_key='RNA_UMAP', 
    color_by='clusters'
)

../_images/a4e8f3a7a200a870e1cac464108688a7090742b098e2798f230584f148733a3d.png

time: 2.07 s (started: 2023-05-24 14:28:30 +00:00)

2) Run cell cycle scoring#

The cell cycle scoring function in Scarf is highly inspired by the equivalent function in Scanpy. The cell cycle phase of each individual cell is identified following steps below:

A list of S and G2M phase is provided to the function (Scarf already has a generic list of genes that works both for human and mouse data)
Average expression of all the genes (separately for S and G2M lists) in across cell_key cells is calculated
The log average expression is divided in n_bins bins
A control set of genes is identified by sampling genes from same expression bins where phase’s genes are present.
The average expression of phase genes (Ep) and control genes (Ec) is calculated per cell.
A phase score is calculated as: Ep-Ec Cell cycle phase is assigned to each cell based on following rule set:
G1 phase: S score < -1 > G2M sore
S phase: S score > G2M score
G2M phase: G2M score > S score

ds.run_cell_cycle_scoring()

WARNING: 1 values were not found in the table column names

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 ds.run_cell_cycle_scoring()

File ~/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/datastore/datastore.py:725, in DataStore.run_cell_cycle_scoring(self, from_assay, cell_key, s_genes, g2m_genes, n_bins, rand_seed, s_score_label, g2m_score_label, phase_label)
   g2m_genes = list(g2m_phase_genes)
control_size = min(len(s_genes), len(g2m_genes))
--> 725 s_score = assay.score_features(
   s_genes, cell_key, control_size, n_bins, rand_seed
)
s_score_label = self._col_renamer(from_assay, cell_key, s_score_label)
self.cells.insert(s_score_label, s_score, key=cell_key, overwrite=True)

File ~/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/assay.py:701, in Assay.score_features(self, feature_names, cell_key, ctrl_size, n_bins, rand_seed)
identifier = self._load_stats_loc(cell_key)
obs_avg = pd.Series(self.feats.fetch_all(f"{identifier}_avg"))
--> 701 control_idx = binned_sampling(
   obs_avg, list(feature_idx), ctrl_size, n_bins, rand_seed
)
cell_idx, _ = self._get_cell_feat_idx(cell_key, "I")
return _calc_mean(feature_idx) - _calc_mean(control_idx)

File ~/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/feat_utils.py:80, in binned_sampling(values, feature_list, ctrl_size, n_bins, rand_seed)
obs_cut: pd.Series = values.fillna(0).rank(method="min").divide(n_items).astype(int)
control_genes = set()
---> 80 for cut in np.unique(obs_cut[feature_list]):
   # Replaced np.random.shuffle with pandas' sample method
   r_genes = (
       obs_cut[obs_cut == cut].sample(n=ctrl_size, random_state=rand_seed).index
   )
   control_genes.update(set(r_genes))

File ~/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/pandas/core/series.py:993, in Series.__getitem__(self, key)
def __getitem__(self, key):
--> 993     check_dict_or_set_indexers(key)
   key = com.apply_if_callable(key, self)
   if key is Ellipsis:

File ~/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/pandas/core/indexing.py:2611, in check_dict_or_set_indexers(key)
"""
Check if the indexer is or contains a dict or set, which is no longer allowed.
"""
if (
   isinstance(key, set)
   or isinstance(key, tuple)
   and any(isinstance(x, set) for x in key)
):
-> 2611     raise TypeError(
       "Passing a set as an indexer is not supported. Use a list instead."
   )
if (
   isinstance(key, dict)
   or isinstance(key, tuple)
   and any(isinstance(x, dict) for x in key)
):
   raise TypeError(
       "Passing a dict as an indexer is not supported. Use a list instead."
   )

TypeError: Passing a set as an indexer is not supported. Use a list instead.

time: 2.13 s (started: 2023-05-24 14:28:32 +00:00)

3) Visualize cell cycle phases#

By default the cell cycle phase information in stored under cell attribute table under column/key RNA_cell_cycle_phase. We can color the UMAP plot based on these values.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    cmap='tab20',
)

../_images/99f07c6e9b41cd55e61a8c38ab2c202e42ce201c953dc20f4c3e254145eed0bb.png

time: 487 ms (started: 2023-05-24 14:28:34 +00:00)

We can clearly see that cycling group of cells in the ‘ductal’ group. You can provide your own custom color mappings like below:

color_key = {
    'G1': 'grey',
    'S': 'orange',
    'G2M': 'crimson',
}

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_cell_cycle_phase',
    color_key=color_key,
)

../_images/1cde85194ff89e4fa89e126612e41e067fd5b9181e972b84020a0fb64a785aa8.png

time: 489 ms (started: 2023-05-24 14:28:35 +00:00)

4) Visualize phase specific scores#

The individual and S and G2M scores for each cell are stored under columns RNA_S_score and RNA_G2M_score. We can visualize the distribution of these scores on the UMAP plots

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by=['RNA_S_score', 'RNA_G2M_score'],
    width=4.5, height=4.5,
)

../_images/e2f0cb1a7da735e6af697601c5e1fefc29862c10d7d59d59b09329915728bc56.png

time: 2.14 s (started: 2023-05-24 14:28:35 +00:00)

5) Compare with scores calculated using Scanpy#

The dataset we downloaded, already had cell cycle scores calculated using Scanpy. For example, the S phase scores are stored under the column S_score. We can plot these scores on the UMAP.

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='S_score',
)

../_images/4b58319eb5727a446c868d4b8b952048835a1ac8582296b825ee4e6de2b976fc.png

time: 990 ms (started: 2023-05-24 14:28:37 +00:00)

Unsurprisingly, these scores look very similar to those obtained through Scarf. Let’s quantify the concordance below

import matplotlib.pyplot as plt
from scipy.stats import linregress

fig, axis  = plt.subplots(1, 2, figsize=(6,3))
for n,i in enumerate(['S_score', 'G2M_score']):
    x = ds.cells.fetch(f"RNA_{i}")
    y = ds.cells.fetch(i)
    res = linregress(x, y)
    
    ax = axis[n]
    ax.scatter(x, y, color=color_key[i.split('_')[0]])
    ax.plot(x, res.intercept + res.slope*x, label='fitted line', c='k')
    ax.set_xlabel(f"{i} (Scarf)")
    ax.set_ylabel(f"{i} (Scanpy)")
    ax.set_title(f"Corr. coef.: {round(res.rvalue, 2)} (pvalue: {res.pvalue})")

plt.tight_layout()
plt.show()

../_images/1c8de9d0b8ad9f788d8943a99c528e20fa618956209bd2f515dcfe7771553227.png

time: 912 ms (started: 2023-05-24 14:28:38 +00:00)

High correlation coefficients indicate a large degree of concordance between the scores obtained using Scanpy and Scarf

That is all for this vignette.

Estimating cell cycle phases

Sections