--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.11.4 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- ## Estimating cell cycle phases In this vignette we will demonstrate how to identify cell cycle phases using single-cell RNA-Seq data ```{code-cell} ipython3 %load_ext autotime import scarf scarf.__version__ ``` --- ### 1) Fetch pre-analyzed data Here we use the data from [Bastidas-Ponce et al., 2019 Development](https://journals.biologists.com/dev/article/146/12/dev173849/19483/) for E15.5 stage of differentiation of endocrine cells from a pool of endocrine progenitors-precursors. We have stored this data on Scarf's online repository for quick access. We processed the data to identify the highly variable genes (top 2000) and create a neighbourhood graph of cells. A UMAP embedding was calculated for the cells. ```{code-cell} ipython3 scarf.fetch_dataset( dataset_name='bastidas-ponce_4K_pancreas-d15_rnaseq', save_path='./scarf_datasets', as_zarr=True, ) ``` ```{code-cell} ipython3 ds = scarf.DataStore( f"scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.zarr", nthreads=4, default_assay='RNA' ) ``` ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='clusters' ) ``` --- ### 2) Run cell cycle scoring The cell cycle scoring function in Scarf is highly inspired by the [equivalent function](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html) in Scanpy. The cell cycle phase of each individual cell is identified following steps below: - A list of S and G2M phase is provided to the function (Scarf already has a generic list of genes that works both for human and mouse data) - Average expression of all the genes (separately for S and G2M lists) in across `cell_key` cells is calculated - The log average expression is divided in `n_bins` bins - A control set of genes is identified by sampling genes from same expression bins where phase's genes are present. - The average expression of phase genes (Ep) and control genes (Ec) is calculated per cell. - A phase score is calculated as: Ep-Ec Cell cycle phase is assigned to each cell based on following rule set: - G1 phase: S score < -1 > G2M sore - S phase: S score > G2M score - G2M phase: G2M score > S score ```{code-cell} ipython3 ds.run_cell_cycle_scoring() ``` --- ### 3) Visualize cell cycle phases By default the cell cycle phase information in stored under cell attribute table under column/key `RNA_cell_cycle_phase`. We can color the UMAP plot based on these values. ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_cell_cycle_phase', cmap='tab20', ) ``` We can clearly see that cycling group of cells in the 'ductal' group. You can provide your own custom color mappings like below: ```{code-cell} ipython3 color_key = { 'G1': 'grey', 'S': 'orange', 'G2M': 'crimson', } ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_cell_cycle_phase', color_key=color_key, ) ``` --- ### 4) Visualize phase specific scores The individual and S and G2M scores for each cell are stored under columns `RNA_S_score` and `RNA_G2M_score`. We can visualize the distribution of these scores on the UMAP plots ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by=['RNA_S_score', 'RNA_G2M_score'], width=4.5, height=4.5, ) ``` --- ### 5) Compare with scores calculated using Scanpy The dataset we downloaded, already had cell cycle scores calculated using Scanpy. For example, the S phase scores are stored under the column `S_score`. We can plot these scores on the UMAP. ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='S_score', ) ``` Unsurprisingly, these scores look very similar to those obtained through Scarf. Let's quantify the concordance below ```{code-cell} ipython3 import matplotlib.pyplot as plt from scipy.stats import linregress fig, axis = plt.subplots(1, 2, figsize=(6,3)) for n,i in enumerate(['S_score', 'G2M_score']): x = ds.cells.fetch(f"RNA_{i}") y = ds.cells.fetch(i) res = linregress(x, y) ax = axis[n] ax.scatter(x, y, color=color_key[i.split('_')[0]]) ax.plot(x, res.intercept + res.slope*x, label='fitted line', c='k') ax.set_xlabel(f"{i} (Scarf)") ax.set_ylabel(f"{i} (Scanpy)") ax.set_title(f"Corr. coef.: {round(res.rvalue, 2)} (pvalue: {res.pvalue})") plt.tight_layout() plt.show() ``` High correlation coefficients indicate a large degree of concordance between the scores obtained using Scanpy and Scarf +++ --- That is all for this vignette.