---
jupyter:
  jupytext:
    formats: ipynb,md
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.11.3
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
---

## Data import and export in Scarf

```python
%load_ext autotime

import scarf
scarf.__version__
```

---
### 1) Fetch datsets from cloud repository

Scarf stores many single-cell datasets online on [OSF](https://osf.io/zeupv/). Herein datasets are stored in many different formats including MTX, 10x HDF5 and H5ad(anndata). These files can readily be downloaded using Scarf's `fetch_dataset` command.


To check which datasets are available to download, use the `show_available_datasets` function:

```python
scarf.show_available_datasets()
```

**Naming format**: Datasets are named using this rule: \<author\>\_\<number of cells\>\_\<cell/tissue type or species\>\_\<single-cell method>

Now using any of these dataset names we can download the dataset of our choice:

```python
# This dataset is in Cellranger (10x) HDF5 format.
scarf.fetch_dataset('tenx_10K_pbmc_atacseq', save_path='./scarf_datasets')
```

The above dataset gets saved under the directory `scarf_datasets` in our current working directory. You can modify `save_path` parameter to save data in location of your choice. The dataset above was downloaded in 10x's HDF5 format. Let download few more datasets that are in differnet file formats.

```python
# This dataset is in MTX format along with barcodes and features TSV files.
scarf.fetch_dataset('xin_1K_pancreas_rnaseq', save_path='./scarf_datasets')
```

```python
# This dataset is in H5ad (anndata) format.
scarf.fetch_dataset('bastidas-ponce_4K_pancreas-d15_rnaseq', save_path='./scarf_datasets')
```

---
### 2) Conversion to Scarf's Zarr format file

Scarf stores data as dense, compressed chunks in Zarr file format. `scarf.readers` and `scarf.writers` modules contain classes that allow reading many different file formats and convert them to Zarr. There are often complementary reader and writer classes. Let's explore them below.


#### From 10x's HDF5 file format

```python
# Change file_type to 'rna' in case of sc-RNA-seq or CITE-Seq
reader = scarf.CrH5Reader('scarf_datasets/tenx_10K_pbmc_atacseq/data.h5', file_type='atac')

writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/pbmc_atac.zarr')  # change value of `zarr_fn` to your choice of filename and path
writer.dump()
```

#### From 10x's (Cellranger) MTX file format

`scarf.CrDirReader` class reads MTX files generated by Cellranger pipeline. `CrDirReader` stands for 'Cellranger directory reader'. Once read in, the data can be dumped into Zarr format using `scarf.CrToZarr` class. Following is an example of how to do this conversion:

```python
 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.CrDirReader('scarf_datasets/xin_1K_pancreas_rnaseq', file_type='rna')

writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/xin_1K.zarr')  # change value of `zarr_fn` to your choice of filename and path
writer.dump()
```

#### From Anndata H5ad file format

```python
 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.H5adReader('scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.h5ad', 
                          cell_ids_key = 'index',               # Where Cell/barcode ids are saved under 'obs' slot
                          feature_ids_key = 'index',            # Where gene ids are saved under 'var' slot
                          feature_name_key = 'gene_short_name')  # Where gene names are saved under 'var' slot

writer = scarf.H5adToZarr(reader, zarr_fn='scarf_datasets/differentiating_pancreatic_cells.zarr') # change value of `zarr_fn` to your choice of filename and path
writer.dump()
```

Conversion from [Loom](https://loompy.org/) file formats is also supported using `scarf.LoomReader` and `scarf.LoomToZarr` which can be used in similar fashion as other readers and writers.


---
### 3) Exporting to data from Zarr file format


#### To Cellranger (10x) MTX format

```python
ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')
```

```python
scarf.writers.to_mtx(ds.RNA, mtx_directory='scarf_datasets/diff_pancreas')
```

#### To H5ad format

Conversion to H5ad is the preferred mode as it runs much faster and produces files with smaller footprints. Updates are underway to include all the data from Zarr file like UMAP, PCA and graph, into anndata.

```python
ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')
```

```python
scarf.writers.to_h5ad(ds.RNA, h5ad_filename='scarf_datasets/diff_pancreas.h5ad')
```

---
That is all for this vignette.