Workflow for scATAC-Seq data#

If you haven’t already then we highly recommend that you please go the Workflow for scRNA-Seq analysis and also go through this vignette describing data organization and design principles of Scarf.

Here we will go through some of the basic steps of analysis of single-cell ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data.

%load_ext autotime

import scarf
scarf.__version__

'0.28.9'

time: 1.07 s (started: 2024-01-24 15:24:11 +00:00)

1) Fetch and convert data#

We will use 10x Genomics’s singel-cell ATAC-Seq data from peripheral blood mononuclear cells. Like single-cell RNA-Seq, Scarf only needs a count matrix to start the analysis of scATAC-Seq data. We can use fetch_dataset to download the data in 10x’s HDF5 format.

scarf.fetch_dataset(
    dataset_name='tenx_10K_pbmc-v1_atacseq',
    save_path='scarf_datasets'
)

time: 19.1 s (started: 2024-01-24 15:24:12 +00:00)

The CrH5Reader class provides access to the HDF5 file. We can load the file and quickly check the number of features, and also verify that Scarf identified the assay as an ATAC assay.

reader = scarf.CrH5Reader(
    'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.h5'
)
reader.assayFeats

	ATAC
type	Peaks
start	0
end	89796
nFeatures	89796

time: 78 ms (started: 2024-01-24 15:24:31 +00:00)

writer = scarf.CrToZarr(
    reader,
    zarr_loc=f'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.zarr',
    chunk_size=(1000, 2000)
)
writer.dump(batch_size=1000)

time: 9.96 s (started: 2024-01-24 15:24:31 +00:00)

2) Create DataStore and filter cells#

We load the Zarr file on using DataStore class. The obtained DataStore object will be our single point on interaction for rest of this analysis. When loaded, Scarf will automatically calculate the number of cells where each peak is present and number of peaks that are accessible in cell (nFeatures). Scarf will also calculate the total number of fragments/cut sites within each cell.

ds = scarf.DataStore(
    'scarf_datasets/tenx_10K_pbmc-v1_atacseq/data.zarr', 
    nthreads=4
)

time: 7.45 s (started: 2024-01-24 15:24:41 +00:00)

We will use auto_filter_cells method which automatically remove outliers from the data. To identify outliers we generate a normal distribution using sample mean and variance. Using this normal distribution Scarf estimates the values with probability less 0.01 (default value) on both ends of the distribution and flags them for removal.

ds.auto_filter_cells()

INFO: 247 cells flagged for filtering out using attribute ATAC_nCounts

INFO: 294 cells flagged for filtering out using attribute ATAC_nFeatures

/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

../_images/6048f1270182a2b4ea966a3545a750ced5c1d2d108c5a1de900e8cfc2fe7d9fa.png

/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

../_images/7f00bea84472604cef0f1a0d2476633f899dfbd96f7852c6e27425e9e09b6f09.png

time: 5.44 s (started: 2024-01-24 15:24:48 +00:00)

3) Feature selection#

For scATAC-Seq data, the features are ranked by their TF-IDF normalized values, summed across all cells. The top n features are marked as prevalent_peaks and are used for downstream steps. Here we used top 25000 peaks, which makes for more than quarter of all the peaks which inline with the what has been suggested in other scATAC-Seq analysis protocols.

ds.mark_prevalent_peaks(top_n=25000)

time: 8.61 s (started: 2024-01-24 15:24:54 +00:00)

ds.ATAC.feats.head()

	I	ids	names	I__prevalent_peaks	dropOuts	nCells	stats_I_prevalence
0	True	chr1:565107-565550	chr1:565107-565550	False	8645	83	0.206181
1	True	chr1:569174-569639	chr1:569174-569639	False	8529	199	0.350725
2	True	chr1:713460-714823	chr1:713460-714823	True	6242	2486	1.993573
3	True	chr1:752422-753038	chr1:752422-753038	True	8296	432	0.680701
4	True	chr1:762106-763359	chr1:762106-763359	True	6850	1878	1.570790

time: 30.6 ms (started: 2024-01-24 15:25:02 +00:00)

4) KNN graph creation#

For scATAC-Seq datasets, Scarf uses TF-IDF normalization. The normalization is automatically performed during the graph building step. The selected features, marked as prevalent_peaks in feature metadata, are used for graph creation. For the dimension reduction step, LSI (latent semantic indexing) is used rather than PCA. The rest of the steps are same as for scRNA-Seq data.

LSI reduction of scATAC-Seq is known to capture the sequencing depth of cells in the first LSI dimension. Hence, by default, the lsi_skip_first parameter is True but users can override it.

ds.make_graph(
    feat_key='prevalent_peaks',
    k=21,
    dims=50,
    n_centroids=100,
    lsi_skip_first=True,
)

INFO: ANN recall: 99.42%

time: 2min 30s (started: 2024-01-24 15:25:02 +00:00)

5) UMAP reduction and clustering#

Non-linear dimension reduction using UMAP and tSNE are performed in the same way as for scRNA-Seq data. Because, in Scarf the core UMAP step are run directly on the neighbourhood graph, the scATAC-Seq data is handled similar to any other data.

ds.run_umap(
    n_epochs=500,
    min_dist=0.1, 
    spread=1, 
    parallel=True
)

	completed  0  /  500 epochs

	completed  50  /  500 epochs

	completed  100  /  500 epochs

	completed  150  /  500 epochs

	completed  200  /  500 epochs

	completed  250  /  500 epochs

	completed  300  /  500 epochs

	completed  350  /  500 epochs

	completed  400  /  500 epochs

	completed  450  /  500 epochs

time: 21.1 s (started: 2024-01-24 15:27:33 +00:00)

Same goes for clustering as well. The leiden clustering acts on the neighbourhood graph directly.

ds.run_leiden_clustering(resolution=0.6)

time: 1.3 s (started: 2024-01-24 15:27:54 +00:00)

Results from both UMAP and Leiden clustering are stored in cell attribute table. Here, because the UMAP and leiden clustering columns have ‘ATAC’ prefixes because they were executed on the default ‘ATAC’ assay. You can also see that the cells that were filtered out (marked by ‘False’ value in column ‘I’) have NaN value for UMAP axes and -1 for clustering

ds.cells.head()

	I	ids	names	ATAC_UMAP1	ATAC_UMAP2	ATAC_leiden_cluster	ATAC_nCounts	ATAC_nFeatures
0	True	AAACGAAAGAGCGAAA-1	AAACGAAAGAGCGAAA-1	-6.240285	-12.517072	1	13187.0	5546.0
1	True	AAACGAAAGAGTTTGA-1	AAACGAAAGAGTTTGA-1	-6.152341	-15.275393	5	16074.0	6586.0
2	True	AAACGAAAGCGAGCTA-1	AAACGAAAGCGAGCTA-1	0.595009	8.065815	4	28009.0	9214.0
3	False	AAACGAAAGGCTTCGC-1	AAACGAAAGGCTTCGC-1	NaN	NaN	-1	221747.0	30325.0
4	True	AAACGAAAGTGCTGAG-1	AAACGAAAGTGCTGAG-1	-5.925762	-12.338728	1	11440.0	4922.0

time: 22.6 ms (started: 2024-01-24 15:27:55 +00:00)

We can visualize the UMAP embedding and the clusters of cells on the embedding

ds.plot_layout(
    layout_key='ATAC_UMAP', 
    color_by='ATAC_leiden_cluster'
)

/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/latest/lib/python3.10/site-packages/scarf/plots.py:594: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  centers = df[[x, y, vc]].groupby(vc).median().T

../_images/a268e255d9f9b0cff3d66e6c8179b4902e5f266095095ad9cec545d3972b72c0.png

time: 515 ms (started: 2024-01-24 15:27:55 +00:00)

Those familiar with PBMC datasets might already be able to identify different cell types in the UMAP plot.

6) Calculating gene scores#

The features in snATAC-Seq in form of peak coordinates are hard to interpret by themselves. Hence, distilling the open chromatin regions in terms of accessible genes can help in identification of cell types. GeneScores are simply the summation of all the peak fragments that are present within gene bodies and their corresponding promoter region. Since, marker genes are often better understood than the non-coding regions the GeneScores can be used to annotate cell types.

In Scarf, the users can provide a BED file containing gene annotations. This BED should have no header, should be tab separated and must have the columns in following order:

chromosome identifier
start coordinate
end coordinate
Gene ID
Gene Name
Strand (Optional)

The start/end coordinate can extend through transcription start site (TSS) to include a portion of promoter.

For convenience we have generated such BED files for human and mouse assemblies using the annotation information from GENCODE project. We downloaded the GFF3 format primary chromosome annotations and used Scarf’s GFFReader to convert the files into BED and add promoter offset of 2KB. These BED files containing gene annotations can be downloaded using fetch_dataset command and passing ‘annotations’ parameter.

scarf.fetch_dataset(
    dataset_name='annotations', 
    save_path='scarf_datasets'
)

time: 19.2 s (started: 2024-01-24 15:27:56 +00:00)

Now we have the annotations and are ready to calculate the ‘GeneScores’. The add_melded_assay is the DataStore method that will be used for this purpose. The add_melded_assay method is actually designed to map any arbitrary genomic coordinate information (not just gene annotations) to the ATAC peaks. Here, we use the term ‘melding’ for the process wherein for a given loci the values from all the overlapping peaks are merged/melded into that feature. The peaks values are TF-IDF normalized before melding.

Now we explain the parameters that are usually passed to the add_melded_assay.

from_assay: Name of the assay to be acted on. You can generally skip if you have only one assay. We use this parameter here only for demonstration puspose
external_bed_fn: This is the annotation file. Here we pass the annotation for human GRCh37/hg19 assembly based GENCODE v38 annotations.
peak_col: This is the column in ds.ATAC.feats table that contains the peak coordinates. In this case it is ‘ids’, but could be anyt other column. Please note that the coordinate information in this column should be in format ‘chr:start-end’ (please note the positions of colon and hyphen)
renormalization: By overriding the default value of True to False here, we turned off ‘renormalization’ step. The renormalization step makes sure that all the feature values for each cells in the melded assay sum up to the same value. Here we turned this off because we will have ‘GeneScores’ as an ‘RNAassay’ which uses library size normalization that has the same effect as ‘renormalization’.
assay_label: The label/name of the melded/output assay. Because we are using gene bodies as our input, ‘GeneScores’ is sensible choice.
assay_type: Here we set the type of assay as ‘RNA’ which means that the new melded assay will be treated as if it was an scRNA-Seq assay. Alternatively, we could have set it as a generic ‘Assay’

ds.add_melded_assay(
    from_assay='ATAC',
    external_bed_fn='scarf_datasets/annotations/human_GRCh37_gencode_v38_gene_body.bed.gz',
    peaks_col='ids',
    renormalization=False,
    assay_label='GeneScores',
    assay_type='RNA'
)

INFO: 33898/62409 features did not overlap with any peak

WARNING: Minimum cell count (6.539499236477921) is lower than size factor multiplier (1000)

time: 3min 2s (started: 2024-01-24 15:28:15 +00:00)

We can now print out the DataStore and see that the ‘GeneScores’ assay has indeed been added. The ‘add_melded_assay’ also printed a useful bit of information that almost half of features(genes bodies) did not overlap with even a single peak. The melded assay anyway contains all the genes but sets these ‘empty’ genes as invalid.

ds

DataStore has 8426 (8728) cells with 2 assays: ATAC GeneScores
   Cell metadata:
            'I', 'ids', 'names', 'ATAC_UMAP1', 'ATAC_UMAP2', 
            'ATAC_leiden_cluster', 'ATAC_nCounts', 'ATAC_nFeatures', 'GeneScores_nCounts', 'GeneScores_nFeatures', 
            'GeneScores_percentMito', 'GeneScores_percentRibo'
   ATAC assay has 86049 (89796) features and following metadata:
            'I', 'ids', 'names', 'I__prevalent_peaks', 'dropOuts', 
            'nCells'
   GeneScores assay has 28357 (62409) features and following metadata:
            'I', 'ids', 'names', 'dropOuts', 'nCells', 
          

time: 11.6 ms (started: 2024-01-24 15:31:17 +00:00)

Let’s now visualize some ‘GeneScores’ for some of the known marker genes for PBMCs on the UMAP plot calculated on the ATAC assay

ds.plot_layout(
    layout_key='ATAC_UMAP', from_assay='GeneScores', 
    color_by=['CD3D', 'MS4A1', 'LEF1', 'NKG7', 'TREM1', 'LYZ'],
    clip_fraction=0.01, 
    n_columns=3,
    width=3,
    height=3,
    point_size=5,
    scatter_kwargs={'lw': 0.01},
)

../_images/fb958f93fcb120ea8a2ac1fcb0057e9a42328ad7f14c2f55e53ab73a037fa372.png

time: 5.72 s (started: 2024-01-24 15:31:17 +00:00)

We stop here in this vignette. But will soon add other vignettes that show how we can other kinds of melded assays like motifs, enhancers, etc. We will also explore how we can use the ‘GeneScores’ to integrate scATAC-Seq datasets with scRNA-Seq datasets from same population of cells (but not the exact same set of cells)

That is all for this vignette.

Workflow for scATAC-Seq data

Sections