Workflow for scATAC-Seq data

%load_ext autotime

import scarf
scarf.__version__
'0.16.3'
time: 1.01 s (started: 2021-08-22 17:24:45 +00:00)

1) Fetch and convert data

scarf.fetch_dataset('tenx_10K_pbmc_atacseq', save_path='scarf_datasets')
reader = scarf.CrH5Reader('scarf_datasets/tenx_10K_pbmc_atacseq/data.h5', 'atac')
reader.assayFeats
INFO: Download finished! File saved here: /home/docs/checkouts/readthedocs.org/user_builds/scarf/checkouts/0.16.3/docs/source/vignettes/scarf_datasets/tenx_10K_pbmc_atacseq/data.h5
ATAC
type Peaks
start 0
end 90686
nFeatures 90686
time: 9.49 s (started: 2021-08-22 17:24:46 +00:00)
writer = scarf.CrToZarr(reader, zarr_fn=f'scarf_datasets/tenx_10K_pbmc_atacseq/data.zarr', chunk_size=(1000, 2000))
writer.dump(batch_size=1000)
time: 21.8 s (started: 2021-08-22 17:24:56 +00:00)

2) Create DataStore and filter cells

ds = scarf.DataStore('scarf_datasets/tenx_10K_pbmc_atacseq/data.zarr', nthreads=4)
time: 8.61 s (started: 2021-08-22 17:25:18 +00:00)
ds.auto_filter_cells()
INFO: 296 cells flagged for filtering out using attribute ATAC_nCounts
INFO: 260 cells flagged for filtering out using attribute ATAC_nFeatures
../_images/basic_tutorial_scATACseq_7_2.png ../_images/basic_tutorial_scATACseq_7_3.png
time: 3.83 s (started: 2021-08-22 17:25:26 +00:00)

3) Feature selection

For scATAC-Seq data, the features are ranked by their TF-IDF normalized values, summed across all cells. The top n features are marked as prevalent_peaks and are used for downstream steps.

ds.mark_prevalent_peaks(top_n=20000)
time: 9.21 s (started: 2021-08-22 17:25:30 +00:00)

4) KNN graph creation

For scATAC-Seq datasets, Scarf uses TF-IDF normalization. The normalization is automatically performed during the graph building step. The selected features, marked as prevalent_peaks in feature metadata, are used for graph creation. For the dimension reduction step, LSI (latent semantic indexing) is used rather than PCA. The rest of the steps are same as for scRNA-Seq data.

ds.make_graph(feat_key='prevalent_peaks', k=11, dims=21, n_centroids=1000)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.16.3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
INFO: ANN recall: 99.99%
time: 2min 29s (started: 2021-08-22 17:25:39 +00:00)

5) UMAP reduction and clustering

Non-linear dimension reduction using UMAP and tSNE are performed in the same way as for scRNA-Seq data. Similarly the clustering step is also performed in the same way as for scRNA-Seq data.

ds.run_umap(n_epochs=250, min_dist=0.5, parallel=True)
	completed  0  /  250 epochs
	completed  25  /  250 epochs
	completed  50  /  250 epochs
	completed  75  /  250 epochs
	completed  100  /  250 epochs
	completed  125  /  250 epochs
	completed  150  /  250 epochs
	completed  175  /  250 epochs
	completed  200  /  250 epochs
	completed  225  /  250 epochs
time: 8.86 s (started: 2021-08-22 17:28:08 +00:00)
ds.run_leiden_clustering(resolution=1)
time: 222 ms (started: 2021-08-22 17:28:17 +00:00)
ds.plot_layout(layout_key='ATAC_UMAP', color_by='ATAC_leiden_cluster')
../_images/basic_tutorial_scATACseq_15_0.png
time: 556 ms (started: 2021-08-22 17:28:18 +00:00)

6) Calculating gene scores

This feature is coming soon..

ds.ATAC.feats.head()
I ids names I__prevalent_peaks dropOuts nCells stats_I_prevalence
0 True chr1:565163-565491 chr1:565163-565491 False 9619 49 0.123722
1 True chr1:569190-569620 chr1:569190-569620 False 9545 123 0.299672
2 True chr1:713551-714783 chr1:713551-714783 True 6403 3265 2.214712
3 True chr1:752418-753020 chr1:752418-753020 False 9102 566 0.557693
4 True chr1:762249-763345 chr1:762249-763345 True 7433 2235 1.673136
time: 23.8 ms (started: 2021-08-22 17:28:18 +00:00)

That is all for this vignette.