--- jupyter: jupytext: cell_metadata_filter: -all formats: ipynb,md text_representation: extension: .md format_name: markdown format_version: '1.3' jupytext_version: 1.11.3 kernelspec: display_name: Python 3 language: python name: python3 --- ## Projection of cells across datasets Scarf allows projections (aka mapping) of cells from one dataset to another. Such projection can help in understanding how cells are related between the two datasets. Projection/mapping is a lightweight alternative to full-blown data integration which focuses on biological interpretation. In this notebook we use data from [Kang et. al.](https://www.nature.com/articles/nbt.4042). We have already preprocessed the raw count matrix to generate UMAPs and clustering of the data ([notebook here](https://github.com/parashardhapola/scarf_vignettes/blob/main/kang_et_al_processing.ipynb)). We will use two datasets from this study: control and IFN-B treated PBMCs. ```python %config InlineBackend.figure_format = 'retina' %load_ext autotime import scarf scarf.__version__ ``` --- ### 1) Fetch datasets in Zarr format ```python scarf.fetch_dataset('kang_15K_pbmc_rnaseq', save_path='scarf_datasets', as_zarr=True) scarf.fetch_dataset('kang_14K_ifnb-pbmc_rnaseq', save_path='scarf_datasets', as_zarr=True) ``` ```python # Control/untreated PBMC data ds_ctrl = scarf.DataStore('scarf_datasets/kang_15K_pbmc_rnaseq/data.zarr', nthreads=4) ds_ctrl.plot_layout(layout_key='RNA_UMAP', color_by='cluster_labels') ``` ```python # Interferon beta stimulated PBMC data ds_stim = scarf.DataStore('scarf_datasets/kang_14K_ifnb-pbmc_rnaseq/data.zarr', nthreads=4) ds_stim.plot_layout(layout_key='RNA_UMAP', color_by='cluster_labels') ``` --- ### 2) K-Nearest Neighbours (KNN) mapping The ``run_mapping`` method of DataStore class performs KNN mapping/projection of target cells over reference cells. Reference cells are the ones that are present in the object where `run_mapping` is being called. The `Assay` object of target cells is provided as an argument to `run_mapping`. This step will load the latest graph of the reference cells and query the Approximate Nearest Neighbour (ANN) index of the reference cells for nearest neighbours of each target cell. Since the ANN index doesn't contain any target cells, nearest neighbours of target cells will exclusively be reference cells. Under the hood, `run_mapping` method makes sure that the feature order in the target cells is same as that in the reference cells. By default, `run_mapping` will impute zero values for missing features in the target order to preserve the feature order. Here we have set `run_coral` parameter to True which activates CORAL normalization of target data. CORAL aligns the the feature distribution between reference and target cells thus removing systemic difference between reference and target cells. Read more about CORAL [here](https://arxiv.org/pdf/1612.01939.pdf). Here we use control PMBCs as reference because we invoke `run_mapping` on control PBMCs' DataStore object and provide stimulated PBMC's `RNA` assay as target.

Reference cells: The cells from the dataset that forms the basis of mapping. A KNN graph must already be calculated for this dataset.

Target cells: The cells to be projected onto reference cells. This dataset is not required to have a graph calculated.

```python # CORAL algorithm can be very slow with large number of features (> 5000). # Hence here it is recommended for only scRNA-Seq datasets. ds_ctrl.run_mapping(target_assay=ds_stim.RNA, target_name='stim', target_feat_key='hvgs_ctrl', save_k=5, run_coral=True) ``` --- ### 3) Mapping scores We can use `mapping scores` to perform cross-dataset cluster similarity inspection. `mapping scores` are scores assigned to each reference cell based on how frequently it was identified as one of the nearest neighbour of the target cells. ``get_mapping_score`` method allows generating these scores. We use an optional parameter of `get_mapping_score`, `target_groups`. `target_groups` takes grouping information for target cells such that mapping scores are calculated for one group at a time. Here we provide the cluster information of stimulated cells as group information and mapping scores will be obtained for each target cluster independently. The UMAPs below show how much mapping score each control cell received upon mapping from cells from one of the IFN-B stimulated cell clusters. ```python # Here we will generate plots for IFB-B stimulated cells from NK and CD14 monocyte clusters. for g, ms in ds_ctrl.get_mapping_score(target_name='stim', target_groups=ds_stim.cells.fetch('cluster_labels'), log_transform=True): if g in ['NK', 'CD 14 Mono']: print (f"Target cluster {g}") ds_ctrl.plot_layout(layout_key='RNA_UMAP', color_by='cluster_labels', size_vals=ms*10, height=4, width=4, legend_onside=False) ``` --- ### 4) Label transfer Using the nearest neighbours of the target cells in the reference data, we can transfer labels from reference cells to target cells based on majority voting. This means that if a target cell has 'most' of its total edge weight shared with cells from one cell type, then that cell type label is tranferred to the target cell. The default threshold for 'most' is 0.5, i.e. half of all edge weight. `get_target_classes` method returns the transferred labels for each cell from a given mapped target dataset. The `reference_class_group` parameter decides which labels to transfer. This can be any column from the cell attribute table that has categorical values, generally users would use `RNA_leiden_cluster` or `RNA_cluster` but they can also use other labels. Here, for example, we use the custom labels stored under `cluster_labels` column. ```python transferred_labels = ds_ctrl.get_target_classes(target_name='stim', reference_class_group='cluster_labels') transferred_labels ``` We can now save these transferred labels into the stimulated cell dataset and visualize them of its UMAP. ```python ds_stim.cells.insert('transferred_labels', transferred_labels.values, overwrite=True) ``` ```python ds_stim.plot_layout(layout_key='RNA_UMAP', color_by='transferred_labels') ``` It can be quite interesting to check how the predicted/transferred labels compare to the actual labels of the target cells: ```python import pandas as pd df = pd.crosstab(ds_stim.cells.fetch('cluster_labels'), ds_stim.cells.fetch('transferred_labels')) df ``` This cross-tabulation can be presented as percentage accuracy, where the values indicate the percentage of the transferred values that were correct. ```python (100 * df / df.sum(axis=0)).round(1) ``` --- ### 5) Unified UMAPs Scarf introduces Unified UMAPs, a strategy to embed target cells onto the reference manifold. To do so, we take the results of KNN projection and spike the graph of reference cells with target cells. We can control the weight of target-reference edges also, as well as the number of edges per target to retain. We rerun UMAP on this 'unified graph' to obtain a unified embdding. Following code shows how to call `run_unified_umap` method. ```python ds_ctrl.run_unified_umap(target_names=['stim'], ini_embed_with='RNA_UMAP', target_weight=1, use_k=5, fit_n_epochs=100, tx_n_epochs=10) ``` Since the results of unified embedding contain 'foreign' cells, `plot_layout` function cannot be used to visualize all the cells. A specialized method, `plot_unified_layout` takes care of this issue. The following example shows co-embedded control (reference) and stimulated (target) PBMCs. ```python ds_ctrl.plot_unified_layout(layout_key='unified_UMAP', show_target_only=False, ref_name='ctrl') ``` We can visualize only the target cells, i.e IFN-B stimulated cells, in the unified embedding. The target cells can be colored based on their original cluster identity. Target cells of similar types are close together on the unified embedding and overlap with the cell types of the reference data ```python ds_ctrl.plot_unified_layout(layout_key='unified_UMAP', show_target_only=True, legend_ondata=True, target_groups=ds_stim.cells.fetch('cluster_labels')) ``` --- That is all for this vignette.