--- jupyter: jupytext: formats: ipynb,md text_representation: extension: .md format_name: markdown format_version: '1.3' jupytext_version: 1.11.3 kernelspec: display_name: Python 3 language: python name: python3 --- ## Data import and export in Scarf ```python %load_ext autotime import scarf scarf.__version__ ``` --- ### 1) Fetch datsets from cloud repository Scarf stores many single-cell datasets online on [OSF](https://osf.io/zeupv/). Herein datasets are stored in many different formats including MTX, 10x HDF5 and H5ad(anndata). These files can readily be downloaded using Scarf's `fetch_dataset` command. To check which datasets are available to download, use the `show_available_datasets` function: ```python scarf.show_available_datasets() ``` **Naming format**: Datasets are named using this rule: \\_\\_\\_\ Now using any of these dataset names we can download the dataset of our choice: ```python # This dataset is in Cellranger (10x) HDF5 format. scarf.fetch_dataset('tenx_10K_pbmc_atacseq', save_path='./scarf_datasets') ``` The above dataset gets saved under the directory `scarf_datasets` in our current working directory. You can modify `save_path` parameter to save data in location of your choice. The dataset above was downloaded in 10x's HDF5 format. Let download few more datasets that are in differnet file formats. ```python # This dataset is in MTX format along with barcodes and features TSV files. scarf.fetch_dataset('xin_1K_pancreas_rnaseq', save_path='./scarf_datasets') ``` ```python # This dataset is in H5ad (anndata) format. scarf.fetch_dataset('bastidas-ponce_4K_pancreas-d15_rnaseq', save_path='./scarf_datasets') ``` --- ### 2) Conversion to Scarf's Zarr format file Scarf stores data as dense, compressed chunks in Zarr file format. `scarf.readers` and `scarf.writers` modules contain classes that allow reading many different file formats and convert them to Zarr. There are often complementary reader and writer classes. Let's explore them below. #### From 10x's HDF5 file format ```python # Change file_type to 'rna' in case of sc-RNA-seq or CITE-Seq reader = scarf.CrH5Reader('scarf_datasets/tenx_10K_pbmc_atacseq/data.h5', file_type='atac') writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/pbmc_atac.zarr') # change value of `zarr_fn` to your choice of filename and path writer.dump() ``` #### From 10x's (Cellranger) MTX file format `scarf.CrDirReader` class reads MTX files generated by Cellranger pipeline. `CrDirReader` stands for 'Cellranger directory reader'. Once read in, the data can be dumped into Zarr format using `scarf.CrToZarr` class. Following is an example of how to do this conversion: ```python # Note here we only give name of directory containing MTX file (along with barcodes and features file) reader = scarf.CrDirReader('scarf_datasets/xin_1K_pancreas_rnaseq', file_type='rna') writer = scarf.CrToZarr(reader, zarr_fn='scarf_datasets/xin_1K.zarr') # change value of `zarr_fn` to your choice of filename and path writer.dump() ``` #### From Anndata H5ad file format ```python # Note here we only give name of directory containing MTX file (along with barcodes and features file) reader = scarf.H5adReader('scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.h5ad', cell_ids_key = 'index', # Where Cell/barcode ids are saved under 'obs' slot feature_ids_key = 'index', # Where gene ids are saved under 'var' slot feature_name_key = 'gene_short_name') # Where gene names are saved under 'var' slot writer = scarf.H5adToZarr(reader, zarr_fn='scarf_datasets/differentiating_pancreatic_cells.zarr') # change value of `zarr_fn` to your choice of filename and path writer.dump() ``` Conversion from [Loom](https://loompy.org/) file formats is also supported using `scarf.LoomReader` and `scarf.LoomToZarr` which can be used in similar fashion as other readers and writers. --- ### 3) Exporting to data from Zarr file format #### To Cellranger (10x) MTX format ```python ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr') ``` ```python scarf.writers.to_mtx(ds.RNA, mtx_directory='scarf_datasets/diff_pancreas') ``` #### To H5ad format Conversion to H5ad is the preferred mode as it runs much faster and produces files with smaller footprints. Updates are underway to include all the data from Zarr file like UMAP, PCA and graph, into anndata. ```python ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr') ``` ```python scarf.writers.to_h5ad(ds.RNA, h5ad_filename='scarf_datasets/diff_pancreas.h5ad') ``` --- That is all for this vignette.