---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.11.4
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

## Handling datasets with multiple modalities

```{code-cell} ipython3
%load_ext autotime

import scarf
scarf.__version__
```

---
### 1) Fetch and convert data

For this tutorial we will use CITE-Seq data from 10x genomics. This dataset contains two modalities: gene expression and surface protein abundance. Throughout this tutorial we will refer to gene expression modality as `RNA` and surface protein as `ADT`. We start by downloading the data and converting it into Zarr format:

```{code-cell} ipython3
scarf.fetch_dataset(
    'tenx_8K_pbmc_citeseq',
    save_path='scarf_datasets'
)
```

```{code-cell} ipython3
reader = scarf.CrH5Reader(
    'scarf_datasets/tenx_8K_pbmc_citeseq/data.h5'
)
```

We can also quickly check the different kinds of assays present in the file and the number of features from each of them.

```{code-cell} ipython3
reader.assayFeats
```

The **nFeatures** column shows the number of features present in each assay. `CrH5Reader` will automatically pull this information from H5 file and rename the 'Gene Expression' assay to **RNA**. Here it also found another assay: 'Antibody Capture' and named it to **assay2**. We will rename this to **ADT**.

```{code-cell} ipython3
reader.rename_assays({'assay2': 'ADT'})
reader.assayFeats
```

Now the data is converted into Zarr format. Like single assay datasets, all the data is saved under one Zarr file.

```{code-cell} ipython3
writer = scarf.CrToZarr(
    reader,
    zarr_fn='scarf_datasets/tenx_8K_pbmc_citeseq/data.zarr',
    chunk_size=(2000, 1000),
)
writer.dump(batch_size=1000)
```

---
### 2) Create a multimodal DataStore

The next step is to create a Scarf `DataStore` object. This object will be the primary way to interact with the data and all its constituent assays. The first time a Zarr file is loaded, we need to set the default assay. Here we set the 'RNA' assay as the default assay. When a Zarr file is loaded, Scarf checks if some per-cell statistics have been calculated. If not, then **nFeatures** (number of features per cell) and **nCounts** (total sum of feature counts per cell) are calculated. Scarf will also attempt to calculate the percent of mitochondrial and ribosomal content per cell.

```{code-cell} ipython3
ds = scarf.DataStore(
    'scarf_datasets/tenx_8K_pbmc_citeseq/data.zarr',
    default_assay='RNA',
    nthreads=4
)
```

We can print out the DataStore object to get an overview of all the assays stored.

```{code-cell} ipython3
ds
```

Feature attribute tables for each of the assays can be accessed like this:

```{code-cell} ipython3
ds.RNA.feats.head()
```

```{code-cell} ipython3
ds.ADT.feats.head()
```

Cell filtering is performed based on the default assay. Here we use the `auto_filter_cells` method of the `DataStore` to filter low quality cells.

```{code-cell} ipython3
ds.auto_filter_cells()
```

---
### 3) Process gene expression modality

Now we process the RNA assay to perform feature selection, create KNN graph, run UMAP reduction and clustering. These steps are same as shown in the basic workflow for scRNA-Seq data.

```{code-cell} ipython3
ds.mark_hvgs(
    min_cells=20,
    top_n=1000,
    min_mean=-3,
    max_mean=2,
    max_var=6
)

ds.make_graph(
    feat_key='hvgs',
    k=21,
    dims=15,
    n_centroids=100
)

ds.run_umap(
    n_epochs=250,
    spread=5,
    min_dist=1,
    parallel=True
)

ds.run_leiden_clustering(resolution=1)
```

```{code-cell} ipython3
ds.plot_layout(
    layout_key='RNA_UMAP',
    color_by='RNA_leiden_cluster',
    cmap='tab20'
)
```

---
### 4) Process protein surface abundance modality

+++

We will now perform similar steps as RNA for the ADT data. Since ADT panels are often custom designed, we will not perform any feature selection step. This particular data contains some control antibodies which we should filter out before downstream analysis. 

```{code-cell} ipython3
ds.ADT.feats.head(n=ds.ADT.feats.N)
```

We can manually filter out the control antibodies by updating **I** to be False for those features. To do so we first extract the names of all the ADT features like below:

```{code-cell} ipython3
adt_names = ds.ADT.feats.to_pandas_dataframe(['names'])['names']
adt_names
```

The ADT features with 'control' in name are designated as control antibodies. You can have your own selection criteria here. The aim here is to create a boolean array that has `True` value for features to be removed.

```{code-cell} ipython3
is_control = adt_names.str.contains('control').values
is_control
```

Now we update `I` to remove the control features. `update_key` method takes a boolean array and disables the features that have `False` value. So we invert the above created array (using `~`) before providing it to `update_key`. The second parameter for `update_key` denotes which feature table boolean column to modify, `I` in this case.

```{code-cell} ipython3
ds.ADT.feats.update_key(~is_control, 'I')
ds.ADT.feats.head(n=ds.ADT.feats.N)
```

Assays named ADT are automatically created as objects of the `ADTassay` class, which uses CLR (centred log ratio) normalization as the default normalization method.

```{code-cell} ipython3
print (ds.ADT)
print (ds.ADT.normMethod.__name__)
```

Now we are ready to create a KNN graph of cells using only ADT data. Here we will use all the features (except those that were filtered out) and that is why we use `I` as value for `feat_key`. It is important to note the value for `from_assay` parameter which has now been set to `ADT`. If no value is provided for `from_assay` then it is automatically set to the default assay. By setting `dims` to 0 we disable dimension reduction.

```{code-cell} ipython3
ds.make_graph(
    from_assay='ADT',
    feat_key='I', 
    k=21,
    dims=0,
    n_centroids=100
)
```

UMAP and clustering can be run on ADT assay by simply setting `from_assay` parameter value to 'ADT':

```{code-cell} ipython3
ds.run_umap(
    from_assay='ADT',
    n_epochs=250,
    spread=5,
    min_dist=1,
    parallel=True
)

ds.run_leiden_clustering(
    from_assay='ADT',
    resolution=1
)
```

If we now check the cell attribute table, we will find the UMAP coordinates and clusters calculated using `ADT` assay:

```{code-cell} ipython3
ds.cells.head()
```

Visualizing the UMAP and clustering calcualted using `ADT` only:

```{code-cell} ipython3
ds.plot_layout(
    layout_key='ADT_UMAP',
    color_by='ADT_leiden_cluster',
    cmap='tab20'
)
```

---
### 5) Cross modality comparison

It is generally of interest to see how different modalities corroborate each other.

```{code-cell} ipython3
# UMAP on RNA and coloured with clusters calculated on ADT
ds.plot_layout(
    layout_key=['RNA_UMAP', 'ADT_UMAP'],
    color_by=['ADT_leiden_cluster', 'RNA_leiden_cluster'],
    cmap='tab20',
    width=4,
    height=4, 
    n_columns=2,
    point_size=5,
    legend_onside=False
)
```

We can quantify the overlap of cells between RNA and ADT clusters. The following table has ADT clusters on columns and RNA clusters on rows. This table shows a cross tabulation of cells across the clustering from the two modalities.

```{code-cell} ipython3
import pandas as pd

df = pd.crosstab(
    ds.cells.fetch('RNA_leiden_cluster'),
    ds.cells.fetch('ADT_leiden_cluster')
)
df
```

There are possibly many interesting strategies to analyze this further. One simple way to summarize the above table can be quantify the transcriptomics 'purity' of ADT clusters:

```{code-cell} ipython3
(100 * df.max()/df.sum()).sort_values(ascending=False)
```

Individual ADT expression can be visualized in both UMAPs easily.

```{code-cell} ipython3
ds.plot_layout(
    layout_key=['RNA_UMAP', 'ADT_UMAP'],
    color_by='CD16_TotalSeqB',
    from_assay='ADT',
    width=4,
    height=4,
    n_columns=2,
    point_size=5
)
```

We can also query gene expression and visualize it on both RNA and ADT UMAPs. Here we query gene FCGR3A which codes for CD16:

```{code-cell} ipython3
ds.plot_layout(
    layout_key=['RNA_UMAP', 'ADT_UMAP'],
    color_by='FCGR3A',
    from_assay='RNA',
    width=4,
    height=4,
    n_columns=2,
    point_size=5
)
```

---
### 6) Integration of modalities

The KNN graphs created individually for each of the modalities can be merged together to provide an integrated mutimodal view of the data. Scarf takes the latest KNN graphs (continous form edge weight) generated for each of the user chosen modality and merges the edges from each modality. After first round of merging, Scarf performs edge pruning by penalizing those edges more that have lower number of shared nearest neighbors between the connected cells. For each cells edges are pruned until the same number of edges as in individual modalities' KNN graphs are left.

Here we will integrate the *RNA* and *ADT* assays and run UMAP and leiden clustering on the integrated graph.

```{code-cell} ipython3
ds.integrate_assays(
    assays=['RNA', 'ADT'],
    label='RNA+ADT'
)
```

`integrated_graph` parameter in `run_umap` and `run_leiden_clustering` allows running these steps on the integrated graph.

```{code-cell} ipython3
ds.run_umap(
    integrated_graph='RNA+ADT',
    n_epochs=500,
    spread=5,
    min_dist=0.5,
    parallel=True
)

ds.run_leiden_clustering(
    integrated_graph='RNA+ADT',
    resolution=1.75
)
```

Lets visualize the UMAPs created using the integrated manifolds from the two modalities. Here we label the cells based on their modality specific cluster identity as well as integrated manifold cluster identity

```{code-cell} ipython3
ds.plot_layout(
    layout_key=['RNA+ADT_UMAP'],
    color_by=['RNA_leiden_cluster',
              'ADT_leiden_cluster',
              'RNA+ADT_leiden_cluster'],
    cmap='tab20',
    legend_onside=False,
    point_size=5,
    width=4,
    height=4,
    n_columns=3
)
```

```{code-cell} ipython3
ds.cells.columns
```

The UMAP and clustering calculated on the integrated graph are here saved under cell attribute table with prefix *RNA+ADT*

+++

---
That is all for this vignette.