Projection of cells across datasets#

Scarf allows projections (aka mapping) of cells from one dataset to another. Such projection can help in understanding how cells are related between the two datasets. Projection/mapping is a lightweight alternative to full-blown data integration which focuses on biological interpretation. In this notebook we use data from Kang et. al.. We have already preprocessed the raw count matrix to generate UMAPs and clustering of the data (notebook here). We will use two datasets from this study: control and IFN-B treated PBMCs.

%load_ext autotime

import scarf
scarf.__version__
'0.28.9'
time: 1.09 s (started: 2024-01-24 15:33:48 +00:00)

1) Fetch datasets in Zarr format#

scarf.fetch_dataset(
    dataset_name='kang_15K_pbmc_rnaseq',
    save_path='scarf_datasets',
    as_zarr=True
)

scarf.fetch_dataset(
    dataset_name='kang_14K_ifnb-pbmc_rnaseq',
    save_path='scarf_datasets',
    as_zarr=True
)
time: 30.5 s (started: 2024-01-24 15:33:49 +00:00)
# Control/untreated PBMC data
ds_ctrl = scarf.DataStore(
    'scarf_datasets/kang_15K_pbmc_rnaseq/data.zarr',
    nthreads=4
)

ds_ctrl.plot_layout(
    layout_key='RNA_UMAP',
    color_by='cluster_labels'
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/b60b532576069693b13dfb2fb1d90b43cd1805f40cb1d2380a3a53618ff6e82a.png
time: 1.51 s (started: 2024-01-24 15:34:20 +00:00)
# Interferon beta stimulated PBMC data
ds_stim = scarf.DataStore(
    'scarf_datasets/kang_14K_ifnb-pbmc_rnaseq/data.zarr',
    nthreads=4
)

ds_stim.plot_layout(
    layout_key='RNA_UMAP',
    color_by='cluster_labels'
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/edbceb66d2767a460b31107fc5b5187766596ac30461fe95a38c373feccc3e0d.png
time: 608 ms (started: 2024-01-24 15:34:22 +00:00)

2) K-Nearest Neighbours (KNN) mapping#

The run_mapping method of DataStore class performs KNN mapping/projection of target cells over reference cells. Reference cells are the ones that are present in the object where run_mapping is being called. The Assay object of target cells is provided as an argument to run_mapping. This step will load the latest graph of the reference cells and query the Approximate Nearest Neighbour (ANN) index of the reference cells for nearest neighbours of each target cell. Since the ANN index doesn’t contain any target cells, nearest neighbours of target cells will exclusively be reference cells. Under the hood, run_mapping method makes sure that the feature order in the target cells is same as that in the reference cells. By default, run_mapping will impute zero values for missing features in the target order to preserve the feature order. Here we have set run_coral parameter to True which activates CORAL normalization of target data. CORAL aligns the the feature distribution between reference and target cells thus removing systemic difference between reference and target cells. Read more about CORAL here. Here we use control PMBCs as reference because we invoke run_mapping on control PBMCs’ DataStore object and provide stimulated PBMC’s RNA assay as target.

Reference cells: The cells from the dataset that forms the basis of mapping. A KNN graph must already be calculated for this dataset.

Target cells: The cells to be projected onto reference cells. This dataset is not required to have a graph calculated.

# CORAL algorithm can be very slow with large number of features (> 5000).
# Hence here it is recommended for only scRNA-Seq datasets.

ds_ctrl.run_mapping(
    target_assay=ds_stim.RNA,
    target_name='stim',
    target_feat_key='hvgs_ctrl',
    save_k=5, 
    run_coral=True
)
INFO: 0 features missing in target data
INFO: Using existing normalized data with cell key I and feat key I__hvgs
INFO: Using existing loadings for pca with 25 dims
INFO: Using existing ANN index
INFO: using existing kmeans cluster centers
INFO: KNN graph already exists will not recompute.
INFO: Calculating fractional power of covariance matrices. This might take a while... 
INFO: Fractional power calculation complete
time: 59.1 s (started: 2024-01-24 15:34:22 +00:00)

3) Mapping scores#

We can use mapping scores to perform cross-dataset cluster similarity inspection. mapping scores are scores assigned to each reference cell based on how frequently it was identified as one of the nearest neighbour of the target cells. get_mapping_score method allows generating these scores. We use an optional parameter of get_mapping_score, target_groups. target_groups takes grouping information for target cells such that mapping scores are calculated for one group at a time. Here we provide the cluster information of stimulated cells as group information and mapping scores will be obtained for each target cluster independently. The UMAPs below show how much mapping score each control cell received upon mapping from cells from one of the IFN-B stimulated cell clusters.

# Here we will generate plots for IFB-B stimulated cells from NK  and CD14 monocyte clusters.

for g, ms in ds_ctrl.get_mapping_score(
    target_name='stim',
    target_groups=ds_stim.cells.fetch('cluster_labels'),
    log_transform=True
):
    
    if g in ['NK', 'CD 14 Mono']:
        print (f"Target cluster {g}")
        ds_ctrl.plot_layout(
            layout_key='RNA_UMAP',
            color_by='cluster_labels',
            size_vals=ms*10,
            height=4, 
            width=4,
            legend_onside=False
        )
Target cluster CD 14 Mono
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/3ef4db66314232942e41aa5018d9e091fe054740d61502d8306d2121c05e4f8e.png
Target cluster NK
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/94570e82efb4ddefe9915daebbccc68e290b250e89ed71f4474258c6f07d03bb.png
time: 588 ms (started: 2024-01-24 15:35:21 +00:00)

4) Label transfer#

Using the nearest neighbours of the target cells in the reference data, we can transfer labels from reference cells to target cells based on majority voting. This means that if a target cell has ‘most’ of its total edge weight shared with cells from one cell type, then that cell type label is tranferred to the target cell. The default threshold for ‘most’ is 0.5, i.e. half of all edge weight. get_target_classes method returns the transferred labels for each cell from a given mapped target dataset.

The reference_class_group parameter decides which labels to transfer. This can be any column from the cell attribute table that has categorical values, generally users would use RNA_leiden_cluster or RNA_cluster but they can also use other labels. Here, for example, we use the custom labels stored under cluster_labels column.

transferred_labels = ds_ctrl.get_target_classes(
    target_name='stim',
    reference_class_group='cluster_labels'
)

transferred_labels
0                  NK
1                 pDC
2         CD4 naive T
3                   B
4         CD4 naive T
             ...     
10106      CD 14 Mono
10107    CD4 Memory T
10108               B
10109      CD 14 Mono
10110     CD4 naive T
Length: 10111, dtype: object
time: 96.6 ms (started: 2024-01-24 15:35:22 +00:00)

We can now save these transferred labels into the stimulated cell dataset and visualize them of its UMAP.

ds_stim.cells.insert(
    'transferred_labels',
    transferred_labels.values,
    overwrite=True
)
time: 14.5 ms (started: 2024-01-24 15:35:22 +00:00)
ds_stim.plot_layout(
    layout_key='RNA_UMAP',
    color_by='transferred_labels'
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/b370971dc9db2d523f4c3d0f18ba4e3f11beb97453c63b67927dfd806830568f.png
time: 609 ms (started: 2024-01-24 15:35:22 +00:00)

It can be quite interesting to check how the predicted/transferred labels compare to the actual labels of the target cells:

import pandas as pd

df = pd.crosstab(
    ds_stim.cells.fetch('cluster_labels'),
    ds_stim.cells.fetch('transferred_labels')
)
df
col_0 B B activated CD 14 Mono CD16 Mono CD4 Memory T CD4 naive T CD8 T DC Eryth Mk NA NK T activated pDC
row_0
B 649 26 19 0 15 20 0 1 0 4 3 1 6 2
B activated 24 171 3 0 1 2 1 0 0 0 2 1 0 0
CD 14 Mono 0 0 2381 132 3 1 0 7 0 6 2 0 0 0
CD16 Mono 1 0 230 717 0 0 0 1 0 7 0 0 0 0
CD4 Memory T 5 0 45 1 923 217 72 0 1 5 18 0 10 0
CD4 naive T 14 4 36 0 356 1896 16 2 1 16 19 2 28 0
CD8 T 10 0 23 0 59 14 461 0 0 2 5 76 5 0
DC 0 0 73 4 0 0 0 137 0 0 0 0 0 0
Eryth 0 0 11 0 1 23 1 0 46 0 1 0 0 0
Mk 1 0 46 1 17 25 1 2 2 149 0 0 3 0
NK 1 0 9 0 3 0 37 0 0 2 1 376 1 0
T activated 2 1 4 0 34 24 1 1 0 1 0 0 214 0
pDC 0 0 1 0 0 1 0 3 0 0 0 1 0 68
time: 36.4 ms (started: 2024-01-24 15:35:23 +00:00)

This cross-tabulation can be presented as percentage accuracy, where the values indicate the percentage of the transferred values that were correct.

(100 * df / df.sum(axis=0)).round(1)
col_0 B B activated CD 14 Mono CD16 Mono CD4 Memory T CD4 naive T CD8 T DC Eryth Mk NA NK T activated pDC
row_0
B 91.8 12.9 0.7 0.0 1.1 0.9 0.0 0.6 0.0 2.1 5.9 0.2 2.2 2.9
B activated 3.4 84.7 0.1 0.0 0.1 0.1 0.2 0.0 0.0 0.0 3.9 0.2 0.0 0.0
CD 14 Mono 0.0 0.0 82.6 15.4 0.2 0.0 0.0 4.5 0.0 3.1 3.9 0.0 0.0 0.0
CD16 Mono 0.1 0.0 8.0 83.9 0.0 0.0 0.0 0.6 0.0 3.6 0.0 0.0 0.0 0.0
CD4 Memory T 0.7 0.0 1.6 0.1 65.4 9.8 12.2 0.0 2.0 2.6 35.3 0.0 3.7 0.0
CD4 naive T 2.0 2.0 1.2 0.0 25.2 85.3 2.7 1.3 2.0 8.3 37.3 0.4 10.5 0.0
CD8 T 1.4 0.0 0.8 0.0 4.2 0.6 78.1 0.0 0.0 1.0 9.8 16.6 1.9 0.0
DC 0.0 0.0 2.5 0.5 0.0 0.0 0.0 89.0 0.0 0.0 0.0 0.0 0.0 0.0
Eryth 0.0 0.0 0.4 0.0 0.1 1.0 0.2 0.0 92.0 0.0 2.0 0.0 0.0 0.0
Mk 0.1 0.0 1.6 0.1 1.2 1.1 0.2 1.3 4.0 77.6 0.0 0.0 1.1 0.0
NK 0.1 0.0 0.3 0.0 0.2 0.0 6.3 0.0 0.0 1.0 2.0 82.3 0.4 0.0
T activated 0.3 0.5 0.1 0.0 2.4 1.1 0.2 0.6 0.0 0.5 0.0 0.0 80.1 0.0
pDC 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0.0 0.0 0.0 0.2 0.0 97.1
time: 26 ms (started: 2024-01-24 15:35:23 +00:00)

5) Unified UMAPs#

Scarf introduces Unified UMAPs, a strategy to embed target cells onto the reference manifold. To do so, we take the results of KNN projection and spike the graph of reference cells with target cells. We can control the weight of target-reference edges also, as well as the number of edges per target to retain. We rerun UMAP on this ‘unified graph’ to obtain a unified embdding. Following code shows how to call run_unified_umap method.

ds_ctrl.run_unified_umap(
    target_names=['stim'],
    ini_embed_with='RNA_UMAP',
    target_weight=1,
    use_k=5,
    n_epochs=100
)
	completed  0  /  100 epochs
	completed  10  /  100 epochs
	completed  20  /  100 epochs
	completed  30  /  100 epochs
	completed  40  /  100 epochs
	completed  50  /  100 epochs
	completed  60  /  100 epochs
	completed  70  /  100 epochs
	completed  80  /  100 epochs
	completed  90  /  100 epochs
time: 15.5 s (started: 2024-01-24 15:35:23 +00:00)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/metadata.py:327: RuntimeWarning: overflow encountered in cast
  a = np.empty(self.N).astype(values.dtype)

Since the results of unified embedding contain ‘foreign’ cells, plot_layout function cannot be used to visualize all the cells. A specialized method, plot_unified_layout takes care of this issue. The following example shows co-embedded control (reference) and stimulated (target) PBMCs.

ds_ctrl.plot_unified_layout(
    layout_key='unified_UMAP',
    show_target_only=False,
    ref_name='ctrl'
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/002a41a88ea074afdc0a7d897abcef4c8165ad871cc835d152aa34d8dec0d99f.png
time: 472 ms (started: 2024-01-24 15:35:38 +00:00)

We can visualize only the target cells, i.e IFN-B stimulated cells, in the unified embedding. The target cells can be colored based on their original cluster identity. Target cells of similar types are close together on the unified embedding and overlap with the cell types of the reference data

ds_ctrl.plot_unified_layout(
    layout_key='unified_UMAP',
    show_target_only=True, 
    legend_ondata=True,
    target_groups=ds_stim.cells.fetch('cluster_labels')
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/c41c57915f5744855cc405097ac69a985b4ea79cb2dfde2d15825e1ef05df406.png
time: 610 ms (started: 2024-01-24 15:35:39 +00:00)

That is all for this vignette.