Merging datasets and partial training#

This vignette demonstrates how to merge datasets, which are present in different zarr files. The vignette will also demonstrate the steps for performing partial training. Partial PCA training is a lightweight alternative to perform batch effect correction, that often helps obtain a well-integrated embedding and clustering.

%load_ext autotime

import scarf
scarf.__version__
'0.23.5'
time: 1.66 s (started: 2023-05-24 14:33:35 +00:00)

1) Fetch datasets in Zarr format#

Here we will use the same datasets are we use in the ‘data projection’ vignette. We download the files in zarr format.

scarf.fetch_dataset(
    dataset_name='kang_15K_pbmc_rnaseq',
    save_path='scarf_datasets',
    as_zarr=True
)

scarf.fetch_dataset(
    dataset_name='kang_14K_ifnb-pbmc_rnaseq', 
    save_path='scarf_datasets',
    as_zarr=True
)
time: 25.2 s (started: 2023-05-24 14:33:37 +00:00)

The Zarr files need to be loaded as a DataStore before they can be merged:

ds_ctrl = scarf.DataStore(
    'scarf_datasets/kang_15K_pbmc_rnaseq/data.zarr/',
    nthreads=4
)

ds_ctrl
DataStore has 8487 (14619) cells with 1 assays: RNA
   Cell metadata:
            'I', 'ids', 'names', 'RNA_UMAP1', 'RNA_UMAP2', 
            'RNA_leiden_cluster', 'RNA_nCounts', 'RNA_nFeatures', 'RNA_percentMito', 'RNA_percentRibo', 
            'RNA_unified_UMAP1', 'RNA_unified_UMAP2', 'cluster_labels'
   RNA assay has 11352 (35635) features and following metadata:
            'I', 'ids', 'names', 'I__hvgs', 'dropOuts', 
            'nCells'
      Projected samples:
            'stim'
      Co-embeddings:
            'unified_UMAP'
time: 67.3 ms (started: 2023-05-24 14:34:02 +00:00)
ds_stim = scarf.DataStore(
    'scarf_datasets/kang_14K_ifnb-pbmc_rnaseq/data.zarr',
    nthreads=4
)

ds_stim
DataStore has 10111 (14446) cells with 1 assays: RNA
   Cell metadata:
            'I', 'ids', 'names', 'RNA_UMAP1', 'RNA_UMAP2', 
            'RNA_leiden_cluster', 'RNA_nCounts', 'RNA_nFeatures', 'RNA_percentMito', 'RNA_percentRibo', 
            'cluster_labels', 'transferred_labels'
   RNA assay has 11051 (35635) features and following metadata:
            'I', 'ids', 'names', 'I__hvgs', 'dropOuts', 
            'nCells'
time: 40.8 ms (started: 2023-05-24 14:34:02 +00:00)

2) Merging datasets#

The merging step will make sure that the features are in the same order as in the merged file. The merged data will be dumped into a new Zarr file. ZarrMerge class allows merging multiple samples at the same time. Though only one kind of assays can be added at a time, other modalities for the same cells can be added at a later point.

#Can be used to merge multiple assays
scarf.ZarrMerge(
    # Path where merged Zarr files will be saved
    zarr_path='scarf_datasets/kang_merged_pbmc_rnaseq.zarr',  
    
    # assays to be merged
    assays=[ds_ctrl.RNA, ds_stim.RNA],
    
    # these names will be preprended to the cell ids with '__' delimiter
    names=['ctrl', 'stim'],
    
    # Name of the merged assay. `overwrite` will remove an existing Zarr file.
    merge_assay_name='RNA',
    overwrite=True
).dump()
time: 37.1 s (started: 2023-05-24 14:34:02 +00:00)

Load the merged Zarr file as a DataStore:

ds = scarf.DataStore(
    'scarf_datasets/kang_merged_pbmc_rnaseq.zarr',
    nthreads=4
)
WARNING: Minimum cell count (562) is lower than size factor multiplier (1000)
time: 31.7 s (started: 2023-05-24 14:34:39 +00:00)

So now we print the merged datastore. The merging removed all the precalculated data. Even the information on which cells were filtered out is lost in the process. This is done deliberately, to allow users to start fresh with the merged dataset.

ds
DataStore has 29065 (29065) cells with 1 assays: RNA
   Cell metadata:
            'I', 'ids', 'names', 'RNA_nCounts', 'RNA_nFeatures', 
            'RNA_percentMito', 'RNA_percentRibo', 'orig_RNA_UMAP1', 'orig_RNA_UMAP2', 'orig_RNA_leiden_cluster', 
            'orig_RNA_nCounts', 'orig_RNA_nFeatures', 'orig_RNA_percentMito', 'orig_RNA_percentRibo', 'orig_RNA_unified_UMAP1', 
            'orig_RNA_unified_UMAP2', 'orig_cluster_labels', 'orig_transferred_labels'
   RNA assay has 12450 (35635) features and following metadata:
            'I', 'ids', 'names', 'dropOuts', 'nCells', 
          
time: 12.9 ms (started: 2023-05-24 14:35:11 +00:00)

If we have a look at the cell attributes table, we can clearly see the that the sample identity is shown in the ids column, prepended to the barcode.

ds.cells.head()
I ids names RNA_nCounts RNA_nFeatures RNA_percentMito RNA_percentRibo orig_RNA_UMAP1 orig_RNA_UMAP2 orig_RNA_leiden_cluster orig_RNA_nCounts orig_RNA_nFeatures orig_RNA_percentMito orig_RNA_percentRibo orig_RNA_unified_UMAP1 orig_RNA_unified_UMAP2 orig_cluster_labels orig_transferred_labels
0 True ctrl__AAACATACAATGCC-1 AAACATACAATGCC-1 2191.0 852.0 0.273848 32.131447 2.224133 -7.265071 2 2191.0 852.0 0.273848 32.131447 2.506444 -7.018074 CD4 naive T nan
1 True ctrl__AAACATACATTTCC-1 AAACATACATTTCC-1 3018.0 878.0 0.099404 15.440689 -2.979369 14.908970 10 3018.0 878.0 0.099404 15.440689 -3.043844 14.048541 CD 14 Mono nan
2 True ctrl__AAACATACCAGAAA-1 AAACATACCAGAAA-1 2481.0 713.0 0.241838 4.957678 4.714710 9.148883 4 2481.0 713.0 0.241838 4.957678 7.736292 10.976259 CD 14 Mono nan
3 True ctrl__AAACATACCAGCTA-1 AAACATACCAGCTA-1 3157.0 950.0 0.031676 11.308204 3.747393 13.955492 4 3157.0 950.0 0.031676 11.308204 5.030070 14.112005 CD 14 Mono nan
4 True ctrl__AAACATACCATGCA-1 AAACATACCATGCA-1 703.0 337.0 0.426743 10.953058 NaN NaN -1 703.0 337.0 0.426743 10.953058 NaN NaN nan nan
time: 79.7 ms (started: 2023-05-24 14:35:11 +00:00)

It can be a good idea to keep track of the cells from different samples, we can fetch out the dataset id from cell-barcodes and add them separately in a new column (this step might get automated in the future).

ds.cells.insert(
    column_name='sample_id',
    values=[x.split('__')[0] for x in ds.cells.fetch_all('ids')],
    overwrite=True
)
WARNING: 'values' parameter is of `list` type and not `np.ndarray` as expected. The correct dtype may not be assigned to the column
time: 54.5 ms (started: 2023-05-24 14:35:11 +00:00)

Rather than performing a fresh round of annotation, we will also import the cluster labels from the unmerged datasets. This help us at later steps to evaluate our results.

ctrl_labels = list(ds_ctrl.cells.fetch_all('cluster_labels'))
stim_labels = list(ds_stim.cells.fetch_all('cluster_labels'))

ds.cells.insert(
    column_name='imported_labels',
    values=ctrl_labels + stim_labels,
    overwrite=True
)
WARNING: 'values' parameter is of `list` type and not `np.ndarray` as expected. The correct dtype may not be assigned to the column
time: 45.5 ms (started: 2023-05-24 14:35:11 +00:00)

As well as re-using annotations, we import the information about which cells where kept and which ones where filtered out.

ctrl_valid_cells = list(ds_ctrl.cells.fetch_all('I'))
stim_valid_cells = list(ds_stim.cells.fetch_all('I'))

ds.cells.update_key(
    values=ctrl_valid_cells + stim_valid_cells,
    key='I'
)
time: 10.2 ms (started: 2023-05-24 14:35:11 +00:00)

Now we can check the number of cells from each of the samples:

ds.cells.to_pandas_dataframe(
    ['sample_id'],
    key='I'
)['sample_id'].value_counts()
sample_id
stim    10111
ctrl     8487
Name: count, dtype: int64
time: 24.9 ms (started: 2023-05-24 14:35:11 +00:00)

3) Naive analysis of merged datasets#

By naive, we mean that we make no attempt to remove/account for the latent factors that might contribute to batch effect or treatment-specific effect. It is usually a good idea to perform a ‘naive’ pipeline to get an idea about the degree of batch effects.

We start with detecting the highly variable genes:

ds.mark_hvgs(
    min_cells=10,
    top_n=2000,
    min_mean=-3, 
    max_mean=2,
    max_var=6
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/metadata.py:304: RuntimeWarning: invalid value encountered in cast
  a = np.empty(self.N).astype(type(values[0]))
INFO: 2000 genes marked as HVGs
../_images/0e3c9d8483b2da2581c1e2b0d0eae87b9f6fdd42f86b19f8b72d65c1ef10362b.png
time: 30.9 s (started: 2023-05-24 14:35:11 +00:00)

Next, we create a graph of cells in a standard way.

ds.make_graph(
    feat_key='hvgs',
    k=21, 
    dims=25,
    n_centroids=100
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
INFO: ANN recall: 99.80%
time: 1min 39s (started: 2023-05-24 14:35:42 +00:00)

Calculating UMAP embedding of cells:

ds.run_umap(
    n_epochs=250, 
    spread=5,
    min_dist=1,
    parallel=True
)
time: 1min 13s (started: 2023-05-24 14:37:22 +00:00)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/metadata.py:304: RuntimeWarning: overflow encountered in cast
  a = np.empty(self.N).astype(type(values[0]))
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/metadata.py:304: RuntimeWarning: invalid value encountered in cast
  a = np.empty(self.N).astype(type(values[0]))
ds.cells.head()
I ids names RNA_UMAP1 RNA_UMAP2 RNA_nCounts RNA_nFeatures RNA_percentMito RNA_percentRibo imported_labels ... orig_RNA_leiden_cluster orig_RNA_nCounts orig_RNA_nFeatures orig_RNA_percentMito orig_RNA_percentRibo orig_RNA_unified_UMAP1 orig_RNA_unified_UMAP2 orig_cluster_labels orig_transferred_labels sample_id
0 True ctrl__AAACATACAATGCC-1 AAACATACAATGCC-1 23.413788 9.931429 2191.0 852.0 0.273848 32.131447 CD4 naive T ... 2 2191.0 852.0 0.273848 32.131447 2.506444 -7.018074 CD4 naive T nan ctrl
1 True ctrl__AAACATACATTTCC-1 AAACATACATTTCC-1 -16.339067 24.781307 3018.0 878.0 0.099404 15.440689 CD 14 Mono ... 10 3018.0 878.0 0.099404 15.440689 -3.043844 14.048541 CD 14 Mono nan ctrl
2 True ctrl__AAACATACCAGAAA-1 AAACATACCAGAAA-1 -7.055210 31.433378 2481.0 713.0 0.241838 4.957678 CD 14 Mono ... 4 2481.0 713.0 0.241838 4.957678 7.736292 10.976259 CD 14 Mono nan ctrl
3 True ctrl__AAACATACCAGCTA-1 AAACATACCAGCTA-1 -8.790221 27.328379 3157.0 950.0 0.031676 11.308204 CD 14 Mono ... 4 3157.0 950.0 0.031676 11.308204 5.030070 14.112005 CD 14 Mono nan ctrl
4 False ctrl__AAACATACCATGCA-1 AAACATACCATGCA-1 NaN NaN 703.0 337.0 0.426743 10.953058 nan ... -1 703.0 337.0 0.426743 10.953058 NaN NaN nan nan ctrl

5 rows × 22 columns

time: 94.1 ms (started: 2023-05-24 14:38:36 +00:00)

Visualization of cells from the two samples in the 2D UMAP space:

ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='sample_id',
    cmap='RdBu', 
    legend_ondata=False
)
../_images/7b1c95f940fd98d9605a9049516f97ae1f2222d612f2eb9cff58476e2bb4ae6b.png
time: 817 ms (started: 2023-05-24 14:38:36 +00:00)

Visualization of cluster labels in the 2D UMAP space:

ds.plot_layout(
    layout_key='RNA_UMAP', 
    color_by='imported_labels',
    legend_ondata=False
)
../_images/b5a0ed56fa2e6c5bfae4a518b08f8d9778adea0a6180bef91d8a0e23a43132b0.png
time: 1.26 s (started: 2023-05-24 14:38:37 +00:00)

4) Partial PCA training to reduce batch effects#

The plots above clearly show that the cells from the two samples are distinct on the UMAP space and have not integrated. This clearly indicates a treatment-specific or simply a batch effect between the cells from the two samples. Another interesting pattern in the UMAP plot above is the ‘mirror effect’, i.e. the equivalent clusters from the two samples look like mirror images. This is often seen in the datasets where the heterogenity/cell population composition is not strongly affected by the treatment.

We will now attempt to integrate the cells from the two samples so that we obtain same cell types that do not form separate clusters. One can do this by training the PCA on cells from only one of the samples. Training PCA on cells from only one of the samples will diminish the contribution of genes differentially expressed between the two samples.

First, we need to create a boolean column in the cell attribute table. This column will indicate whether a cell belongs to one of the samples. Here we will create a new column is_ctrl and mark the values as True when a cell belongs to the ctrl sample.

ds.cells.insert(
    column_name=f'is_ctrl',
    values=(ds.cells.fetch_all('sample_id') == 'ctrl'),
    overwrite=True
)
time: 9.41 ms (started: 2023-05-24 14:38:38 +00:00)

The next step is to perform the partial PCA training. PCA is trained during the graph creation step. We will now use pca_cell_key parameter and set it to is_ctrl so that only ‘ctrl’ cells are used for PCA training.

ds.make_graph(
    feat_key='hvgs',
    k=21, 
    dims=25,
    n_centroids=100,
    pca_cell_key='is_ctrl'
)
INFO: Using existing normalized data with cell key I and feat key I__hvgs
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
INFO: ANN recall: 99.77%
time: 39.8 s (started: 2023-05-24 14:38:38 +00:00)

We run UMAP as usual, but the UMAP embeddings are saved in a new cell attribute column so as to not overwrite the previous UMAP values. The new column will be called RNA_pUMAP; ‘RNA’ is automatically prepend because the assay name is RNA

ds.run_umap(
    n_epochs=250, 
    spread=5,
    min_dist=1,
    parallel=True,
    label='pUMAP'
)
time: 1min 15s (started: 2023-05-24 14:39:18 +00:00)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/metadata.py:304: RuntimeWarning: overflow encountered in cast
  a = np.empty(self.N).astype(type(values[0]))
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/scarf/metadata.py:304: RuntimeWarning: invalid value encountered in cast
  a = np.empty(self.N).astype(type(values[0]))

Visualize the new UMAP

ds.plot_layout(
    layout_key='RNA_pUMAP',
    color_by='sample_id',
    cmap='RdBu',
    legend_ondata=False
)
../_images/3289d09bd07897df1398459bda33c34130502175ac52d616f1c840d83010c02f.png
time: 829 ms (started: 2023-05-24 14:40:33 +00:00)

Visualization of cluster labels in the new UMAP space shows that the cells from the same cell-type do not split into separate clusters like they did before.

ds.plot_layout(
    layout_key='RNA_pUMAP',
    color_by='imported_labels',
    legend_ondata=False
)
../_images/0ed2573216adf665195658939e50925bd7c8b55834ade5bff193a29feb2f6461.png
time: 1.28 s (started: 2023-05-24 14:40:34 +00:00)

That is all for this vignette.