Data import and export in Scarf#

%load_ext autotime

import scarf
scarf.__version__
'0.28.9'
time: 1.1 s (started: 2024-01-24 15:35:41 +00:00)

1) Fetch datsets from cloud repository#

Scarf stores many single-cell datasets online on OSF. Herein datasets are stored in many different formats including MTX, 10x HDF5 and H5ad(anndata). These files can readily be downloaded using Scarf’s fetch_dataset command.

To check which datasets are available to download, use the show_available_datasets function:

scarf.show_available_datasets()
annotations
baron_8K_pancreas_rnaseq
bastidas-ponce_4K_pancreas-d15_rnaseq
cao_2.1M_moca_rnaseq
cusanovich_81K_mouse_atacseq
hca_783K_blood_rnaseq
kang_14K_ifnb-pbmc_rnaseq
kang_15K_pbmc_rnaseq
lecun_60K_mnist_images
motifs
muraro_2K_pancreas_rnaseq
saunders_110K_brain_rnaseq
segerstolpe_2K_pancreas_rnaseq
tenx_1.3M_brain_rnaseq
tenx_10K_pbmc-v1_atacseq
tenx_3K_pbmc_multiome-gex-atac
tenx_5K_pbmc_rnaseq
tenx_8K_pbmc_citeseq
xin_1K_pancreas_rnaseq
zalando_60K_fmnist_images
zeisel_161K_nervous_rnaseq
zheng_69K_pbmc_rnaseq
time: 16 s (started: 2024-01-24 15:35:42 +00:00)

Naming format: Datasets are named using this rule: <author>_<number of cells>_<cell/tissue type or species>_<single-cell method>

Now using any of these dataset names we can download the dataset of our choice:

# This dataset is in Cellranger (10x) HDF5 format.
scarf.fetch_dataset(
    dataset_name='tenx_10K_pbmc-v1_atacseq',
    save_path='./scarf_datasets'
)
time: 6.1 s (started: 2024-01-24 15:35:58 +00:00)

The above dataset gets saved under the directory scarf_datasets in our current working directory. You can modify save_path parameter to save data in location of your choice. The dataset above was downloaded in 10x’s HDF5 format. Let download few more datasets that are in differnet file formats.

# This dataset is in MTX format along with barcodes and features TSV files.
scarf.fetch_dataset(
    dataset_name='xin_1K_pancreas_rnaseq',
    save_path='./scarf_datasets'
)
time: 13.6 s (started: 2024-01-24 15:36:04 +00:00)
# This dataset is in H5ad (anndata) format.
scarf.fetch_dataset(
    dataset_name='bastidas-ponce_4K_pancreas-d15_rnaseq',
    save_path='./scarf_datasets'
)
time: 7.32 s (started: 2024-01-24 15:36:18 +00:00)

2) Conversion to Scarf’s Zarr format file#

Scarf stores data as dense, compressed chunks in Zarr file format. scarf.readers and scarf.writers modules contain classes that allow reading many different file formats and convert them to Zarr. There are often complementary reader and writer classes. Let’s explore them below.

From 10x’s HDF5 file format#

# Change file_type to 'rna' in case of sc-RNA-seq or CITE-Seq
reader = scarf.CrH5Reader(
    'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.h5'
)

# change value of `zarr_loc` to your choice of filename and path
writer = scarf.CrToZarr(
    reader,
    zarr_loc='scarf_datasets/pbmc_atac.zarr'  
)  
writer.dump()
time: 10.3 s (started: 2024-01-24 15:36:25 +00:00)

From 10x’s (Cellranger) MTX file format#

scarf.CrDirReader class reads MTX files generated by Cellranger pipeline. CrDirReader stands for ‘Cellranger directory reader’. Once read in, the data can be dumped into Zarr format using scarf.CrToZarr class. Following is an example of how to do this conversion:

 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.CrDirReader(
    'scarf_datasets/xin_1K_pancreas_rnaseq'
)

# change value of `zarr_loc` to your choice of filename and path
writer = scarf.CrToZarr(
    reader, 
    zarr_loc='scarf_datasets/xin_1K.zarr'
)
writer.dump()
WARNING: feature_types extraction failed from features.tsv.gz in column 2
time: 4.43 s (started: 2024-01-24 15:36:36 +00:00)

From Anndata H5ad file format#

 # Note here we only give name of directory containing MTX file (along with barcodes and features file)
reader = scarf.H5adReader(
    'scarf_datasets/bastidas-ponce_4K_pancreas-d15_rnaseq/data.h5ad', 
    cell_ids_key = 'index',               # Where Cell/barcode ids are saved under 'obs' slot
    feature_ids_key = 'index',            # Where gene ids are saved under 'var' slot
    feature_name_key = 'gene_short_name'  # Where gene names are saved under 'var' slot
)  

# change value of `zarr_loc` to your choice of filename and path
writer = scarf.H5adToZarr(
    reader,
    zarr_loc='scarf_datasets/differentiating_pancreatic_cells.zarr'
)
writer.dump()
INFO: No value provided for assay names. Will use default value: 'RNA'
WARNING: Could not find feature names key: gene_short_name in self.featureAttrsKey.
time: 2.21 s (started: 2024-01-24 15:36:40 +00:00)

Conversion from Loom file formats is also supported using scarf.LoomReader and scarf.LoomToZarr which can be used in similar fashion as other readers and writers.


3) Exporting to data from Zarr file format#

To Cellranger (10x) MTX format#

ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')
time: 2.18 s (started: 2024-01-24 15:36:42 +00:00)
scarf.writers.to_mtx(
    assay=ds.RNA,
    mtx_directory='scarf_datasets/diff_pancreas'
)
time: 12.6 s (started: 2024-01-24 15:36:44 +00:00)

To H5ad format#

Conversion to H5ad is the preferred mode as it runs much faster and produces files with smaller footprints. Updates are underway to include all the data from Zarr file like UMAP, PCA and graph, into anndata.

ds = scarf.DataStore('scarf_datasets/differentiating_pancreatic_cells.zarr')
time: 27.4 ms (started: 2024-01-24 15:36:57 +00:00)
scarf.writers.to_h5ad(
    assay=ds.RNA,
    h5ad_filename='scarf_datasets/diff_pancreas.h5ad'
)
time: 3.37 s (started: 2024-01-24 15:36:57 +00:00)

That is all for this vignette.