Workflow for scATAC-Seq data#

If you haven’t already then we highly recommend that you please go the Workflow for scRNA-Seq analysis and also go through this vignette describing data organization and design principles of Scarf.

Here we will go through some of the basic steps of analysis of single-cell ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data.

%load_ext autotime

import scarf
scarf.__version__
'0.28.9'
time: 1.07 s (started: 2024-01-24 15:24:11 +00:00)

1) Fetch and convert data#

We will use 10x Genomics’s singel-cell ATAC-Seq data from peripheral blood mononuclear cells. Like single-cell RNA-Seq, Scarf only needs a count matrix to start the analysis of scATAC-Seq data. We can use fetch_dataset to download the data in 10x’s HDF5 format.

scarf.fetch_dataset(
    dataset_name='tenx_10K_pbmc-v1_atacseq',
    save_path='scarf_datasets'
)
time: 19.1 s (started: 2024-01-24 15:24:12 +00:00)

The CrH5Reader class provides access to the HDF5 file. We can load the file and quickly check the number of features, and also verify that Scarf identified the assay as an ATAC assay.

reader = scarf.CrH5Reader(
    'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.h5'
)
reader.assayFeats
ATAC
type Peaks
start 0
end 89796
nFeatures 89796
time: 78 ms (started: 2024-01-24 15:24:31 +00:00)
writer = scarf.CrToZarr(
    reader,
    zarr_loc=f'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.zarr',
    chunk_size=(1000, 2000)
)
writer.dump(batch_size=1000)
time: 9.96 s (started: 2024-01-24 15:24:31 +00:00)

2) Create DataStore and filter cells#

We load the Zarr file on using DataStore class. The obtained DataStore object will be our single point on interaction for rest of this analysis. When loaded, Scarf will automatically calculate the number of cells where each peak is present and number of peaks that are accessible in cell (nFeatures). Scarf will also calculate the total number of fragments/cut sites within each cell.

ds = scarf.DataStore(
    'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.zarr', 
    nthreads=4
)
time: 7.45 s (started: 2024-01-24 15:24:41 +00:00)

We will use auto_filter_cells method which automatically remove outliers from the data. To identify outliers we generate a normal distribution using sample mean and variance. Using this normal distribution Scarf estimates the values with probability less 0.01 (default value) on both ends of the distribution and flags them for removal.

ds.auto_filter_cells()
INFO: 247 cells flagged for filtering out using attribute ATAC_nCounts
INFO: 294 cells flagged for filtering out using attribute ATAC_nFeatures
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
../_images/6048f1270182a2b4ea966a3545a750ced5c1d2d108c5a1de900e8cfc2fe7d9fa.png
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
../_images/7f00bea84472604cef0f1a0d2476633f899dfbd96f7852c6e27425e9e09b6f09.png
time: 5.44 s (started: 2024-01-24 15:24:48 +00:00)

3) Feature selection#

For scATAC-Seq data, the features are ranked by their TF-IDF normalized values, summed across all cells. The top n features are marked as prevalent_peaks and are used for downstream steps. Here we used top 25000 peaks, which makes for more than quarter of all the peaks which inline with the what has been suggested in other scATAC-Seq analysis protocols.

ds.mark_prevalent_peaks(top_n=25000)
time: 8.61 s (started: 2024-01-24 15:24:54 +00:00)
ds.ATAC.feats.head()
I ids names I__prevalent_peaks dropOuts nCells stats_I_prevalence
0 True chr1:565107-565550 chr1:565107-565550 False 8645 83 0.206181
1 True chr1:569174-569639 chr1:569174-569639 False 8529 199 0.350725
2 True chr1:713460-714823 chr1:713460-714823 True 6242 2486 1.993573
3 True chr1:752422-753038 chr1:752422-753038 True 8296 432 0.680701
4 True chr1:762106-763359 chr1:762106-763359 True 6850 1878 1.570790
time: 30.6 ms (started: 2024-01-24 15:25:02 +00:00)

4) KNN graph creation#

For scATAC-Seq datasets, Scarf uses TF-IDF normalization. The normalization is automatically performed during the graph building step. The selected features, marked as prevalent_peaks in feature metadata, are used for graph creation. For the dimension reduction step, LSI (latent semantic indexing) is used rather than PCA. The rest of the steps are same as for scRNA-Seq data.

LSI reduction of scATAC-Seq is known to capture the sequencing depth of cells in the first LSI dimension. Hence, by default, the lsi_skip_first parameter is True but users can override it.

ds.make_graph(
    feat_key='prevalent_peaks',
    k=21,
    dims=50,
    n_centroids=100,
    lsi_skip_first=True,
)
INFO: ANN recall: 99.42%
time: 2min 30s (started: 2024-01-24 15:25:02 +00:00)

5) UMAP reduction and clustering#

Non-linear dimension reduction using UMAP and tSNE are performed in the same way as for scRNA-Seq data. Because, in Scarf the core UMAP step are run directly on the neighbourhood graph, the scATAC-Seq data is handled similar to any other data.

ds.run_umap(
    n_epochs=500,
    min_dist=0.1, 
    spread=1, 
    parallel=True
)
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
time: 21.1 s (started: 2024-01-24 15:27:33 +00:00)

Same goes for clustering as well. The leiden clustering acts on the neighbourhood graph directly.

ds.run_leiden_clustering(resolution=0.6)
time: 1.3 s (started: 2024-01-24 15:27:54 +00:00)

Results from both UMAP and Leiden clustering are stored in cell attribute table. Here, because the UMAP and leiden clustering columns have ‘ATAC’ prefixes because they were executed on the default ‘ATAC’ assay. You can also see that the cells that were filtered out (marked by ‘False’ value in column ‘I’) have NaN value for UMAP axes and -1 for clustering

ds.cells.head()
I ids names ATAC_UMAP1 ATAC_UMAP2 ATAC_leiden_cluster ATAC_nCounts ATAC_nFeatures
0 True AAACGAAAGAGCGAAA-1 AAACGAAAGAGCGAAA-1 -6.240285 -12.517072 1 13187.0 5546.0
1 True AAACGAAAGAGTTTGA-1 AAACGAAAGAGTTTGA-1 -6.152341 -15.275393 5 16074.0 6586.0
2 True AAACGAAAGCGAGCTA-1 AAACGAAAGCGAGCTA-1 0.595009 8.065815 4 28009.0 9214.0
3 False AAACGAAAGGCTTCGC-1 AAACGAAAGGCTTCGC-1 NaN NaN -1 221747.0 30325.0
4 True AAACGAAAGTGCTGAG-1 AAACGAAAGTGCTGAG-1 -5.925762 -12.338728 1 11440.0 4922.0
time: 22.6 ms (started: 2024-01-24 15:27:55 +00:00)

We can visualize the UMAP embedding and the clusters of cells on the embedding

ds.plot_layout(
    layout_key='ATAC_UMAP', 
    color_by='ATAC_leiden_cluster'
)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T
../_images/a268e255d9f9b0cff3d66e6c8179b4902e5f266095095ad9cec545d3972b72c0.png
time: 515 ms (started: 2024-01-24 15:27:55 +00:00)

Those familiar with PBMC datasets might already be able to identify different cell types in the UMAP plot.


6) Calculating gene scores#

The features in snATAC-Seq in form of peak coordinates are hard to interpret by themselves. Hence, distilling the open chromatin regions in terms of accessible genes can help in identification of cell types. GeneScores are simply the summation of all the peak fragments that are present within gene bodies and their corresponding promoter region. Since, marker genes are often better understood than the non-coding regions the GeneScores can be used to annotate cell types.

In Scarf, the users can provide a BED file containing gene annotations. This BED should have no header, should be tab separated and must have the columns in following order:

  1. chromosome identifier

  2. start coordinate

  3. end coordinate

  4. Gene ID

  5. Gene Name

  6. Strand (Optional)

The start/end coordinate can extend through transcription start site (TSS) to include a portion of promoter.

For convenience we have generated such BED files for human and mouse assemblies using the annotation information from GENCODE project. We downloaded the GFF3 format primary chromosome annotations and used Scarf’s GFFReader to convert the files into BED and add promoter offset of 2KB. These BED files containing gene annotations can be downloaded using fetch_dataset command and passing ‘annotations’ parameter.

scarf.fetch_dataset(
    dataset_name='annotations', 
    save_path='scarf_datasets'
)
time: 19.2 s (started: 2024-01-24 15:27:56 +00:00)

Now we have the annotations and are ready to calculate the ‘GeneScores’. The add_melded_assay is the DataStore method that will be used for this purpose. The add_melded_assay method is actually designed to map any arbitrary genomic coordinate information (not just gene annotations) to the ATAC peaks. Here, we use the term ‘melding’ for the process wherein for a given loci the values from all the overlapping peaks are merged/melded into that feature. The peaks values are TF-IDF normalized before melding.

Now we explain the parameters that are usually passed to the add_melded_assay.

  • from_assay: Name of the assay to be acted on. You can generally skip if you have only one assay. We use this parameter here only for demonstration puspose

  • external_bed_fn: This is the annotation file. Here we pass the annotation for human GRCh37/hg19 assembly based GENCODE v38 annotations.

  • peak_col: This is the column in ds.ATAC.feats table that contains the peak coordinates. In this case it is ‘ids’, but could be anyt other column. Please note that the coordinate information in this column should be in format ‘chr:start-end’ (please note the positions of colon and hyphen)

  • renormalization: By overriding the default value of True to False here, we turned off ‘renormalization’ step. The renormalization step makes sure that all the feature values for each cells in the melded assay sum up to the same value. Here we turned this off because we will have ‘GeneScores’ as an ‘RNAassay’ which uses library size normalization that has the same effect as ‘renormalization’.

  • assay_label: The label/name of the melded/output assay. Because we are using gene bodies as our input, ‘GeneScores’ is sensible choice.

  • assay_type: Here we set the type of assay as ‘RNA’ which means that the new melded assay will be treated as if it was an scRNA-Seq assay. Alternatively, we could have set it as a generic ‘Assay’

ds.add_melded_assay(
    from_assay='ATAC',
    external_bed_fn='scarf_datasets/annotations/human_GRCh37_gencode_v38_gene_body.bed.gz',
    peaks_col='ids',
    renormalization=False,
    assay_label='GeneScores',
    assay_type='RNA'
)
INFO: 33898/62409 features did not overlap with any peak
WARNING: Minimum cell count (6.539499236477921) is lower than size factor multiplier (1000)
time: 3min 2s (started: 2024-01-24 15:28:15 +00:00)

We can now print out the DataStore and see that the ‘GeneScores’ assay has indeed been added. The ‘add_melded_assay’ also printed a useful bit of information that almost half of features(genes bodies) did not overlap with even a single peak. The melded assay anyway contains all the genes but sets these ‘empty’ genes as invalid.

ds
DataStore has 8426 (8728) cells with 2 assays: ATAC GeneScores
   Cell metadata:
            'I', 'ids', 'names', 'ATAC_UMAP1', 'ATAC_UMAP2', 
            'ATAC_leiden_cluster', 'ATAC_nCounts', 'ATAC_nFeatures', 'GeneScores_nCounts', 'GeneScores_nFeatures', 
            'GeneScores_percentMito', 'GeneScores_percentRibo'
   ATAC assay has 86049 (89796) features and following metadata:
            'I', 'ids', 'names', 'I__prevalent_peaks', 'dropOuts', 
            'nCells'
   GeneScores assay has 28357 (62409) features and following metadata:
            'I', 'ids', 'names', 'dropOuts', 'nCells', 
          
time: 11.6 ms (started: 2024-01-24 15:31:17 +00:00)

Let’s now visualize some ‘GeneScores’ for some of the known marker genes for PBMCs on the UMAP plot calculated on the ATAC assay

ds.plot_layout(
    layout_key='ATAC_UMAP', from_assay='GeneScores', 
    color_by=['CD3D', 'MS4A1', 'LEF1', 'NKG7', 'TREM1', 'LYZ'],
    clip_fraction=0.01, 
    n_columns=3,
    width=3,
    height=3,
    point_size=5,
    scatter_kwargs={'lw': 0.01},
)
../_images/fb958f93fcb120ea8a2ac1fcb0057e9a42328ad7f14c2f55e53ab73a037fa372.png
time: 5.72 s (started: 2024-01-24 15:31:17 +00:00)

We stop here in this vignette. But will soon add other vignettes that show how we can other kinds of melded assays like motifs, enhancers, etc. We will also explore how we can use the ‘GeneScores’ to integrate scATAC-Seq datasets with scRNA-Seq datasets from same population of cells (but not the exact same set of cells)


That is all for this vignette.