--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.11.4 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- ## Cell subsampling using TopACeDo ```{code-cell} ipython3 %load_ext autotime import scarf scarf.__version__ ``` --- ### 1) Installing dependencies +++ We need to install the TopACeDo algorithm to perform subsampling: ```{code-cell} ipython3 !pip install git+https://github.com/fraenkel-lab/pcst_fast.git@deb3236cc26ee9fee77d5af40fac3f12bb753850 !pip install -U topacedo ``` --- ### 2) Fetching pre-processed data ```{code-cell} ipython3 # Loading preanalyzed dataset that was processed in the `basic_tutorial` vignette scarf.fetch_dataset( dataset_name='tenx_5K_pbmc_rnaseq', as_zarr=True, save_path='scarf_datasets' ) ``` ```{code-cell} ipython3 ds = scarf.DataStore('scarf_datasets/tenx_5K_pbmc_rnaseq/data.zarr') ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_cluster' ) ``` --- ### 3) Run TopACeDo downsampler +++ UMAP, clustering and marker identification together allow a good understanding of cellular diversity. However, one can still choose from a plethora of other analysis on the data. For example, identification of cell differentiation trajectories. One of the major challenges to run these analysis could be the size of the data. Scarf performs a topology conserving downsampling of the data based on the cell neighbourhood graph. This downsampling aims to maximize the heterogeneity while sampling cells from the data. Here we run the TopACeDo downsampling algorithm that leverages Scarf's KNN graph to perform a manifold preserving subsampling of cells. The subsampler can be invoked directly from Scarf's DataStore object. ```{code-cell} ipython3 ds.run_topacedo_sampler( cluster_key='RNA_cluster', max_sampling_rate=0.1 ) ``` As a result of subsampling the subsampled cells are marked True under the cell metadata column `RNA_sketched`. We can visualize these cells using `plot_layout` ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_cluster', subselection_key='RNA_sketched' ) ``` It may also be interesting to visualize the cells that were marked as `seed cells` used when PCST was run. These cells are marked under the column `RNA_sketch_seeds`. ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_cluster', subselection_key='RNA_sketch_seeds' ) ``` --- ### 4) Intermediate parameters of downsampling +++ To identify the seed cells, the subsampling algorithm calculates cell densities based on neighbourhood degrees. Regions of higher cell density get a sampling penalty. The neighbourhood degree of individual cells are stored under the column `RNA_cell_density`. ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_cell_density' ) ``` The dowsampling algorithm also identifies regions of the graph where cells form tightly connected groups by calculating mean shared nearest neighbours of each cell's nieghbours. The tightly connected regions get a sampling award. These values can be accessed from under the cell metadata column `RNA_snn_value`. ```{code-cell} ipython3 ds.plot_layout( layout_key='RNA_UMAP', color_by='RNA_snn_value' ) ``` --- That is all for this vignette.