API#

DataStore classes#

BaseDataStore#

GraphDataStore#

MappingDatastore#

DataStore#

Assay classes#

Assay#

class scarf.assay.Assay(z, name, cell_data, nthreads, min_cells_per_feature=10)#

A generic Assay class that contains methods to calculate feature level statistics. It also provides a method for saving normalized subset of data for later KNN graph construction.

Parameters:
  • z (zarr.Group) – Zarr hierarchy where raw data is located

  • name (str) – A label/name for assay.

  • cell_data (MetaData) – Metadata class object for the cell attributes.

  • nthreads (int) – number for threads to use for dask parallel computations

  • min_cells_per_feature (int) –

name#

A label for the assay instance

z#

Zarr group that contains the assay

cells#

A Metadata class object for cell attributes

nthreads#

number of threads to use for computations

rawData#

dask array containing the raw data

feats#

a MetaData class object for feature attributes

attrs#

Zarr attributes for the zarr group of the assay

normMethod#

normalization method to use.

sf#

scaling factor for doing library-size normalization

add_percent_feature(feat_pattern, name)#
Parameters:
  • feat_pattern (str) – A regular expression pattern to identify the features of interest

  • name (str) – This will be used as the name of column under which the percentages will be saved

Return type:

None

Returns:

iter_normed_feature_wise(cell_key, feat_key, batch_size, msg, as_dataframe=True, **norm_params)#

This generator iterates over all the features marked by feat_key in batches.

Parameters:
  • cell_key (Optional[str]) – Name of the key (column) from cell attribute table. The data will be fetched for only those cells that have a True value in this column. If None then all the cells are used

  • feat_key (Optional[str]) – Name of the key (column) from feature attribute table. The data will be fetched for only those features that have a True value in this column. If None then all the features are used

  • batch_size (int) – Number of genes to be loaded in the memory at a time.

  • msg (Optional[str]) – Message to be displayed in the progress bar

  • as_dataframe (bool) – If true (default) then the yielded matrices are pandas dataframe

Return type:

Generator[Union[DataFrame, Tuple[ndarray, ndarray]], None, None]

Returns:

normed(cell_idx=None, feat_idx=None, **kwargs)#

This function normalizes the raw and returns a delayed dask array of the normalized data.

Parameters:
  • cell_idx (Optional[ndarray]) – Indices of cells to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of cell attribute table)

  • feat_idx (Optional[ndarray]) – Indices of features to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of feature attribute table)

  • **kwargs

Return type:

<module ‘dask.array’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/dask/array/__init__.py’>

Returns: A dask array (delayed matrix) containing normalized data.

save_aggregated_ordering(cell_key, feat_key, ordering_key, min_exp=10, window_size=200, chunk_size=50, smoothen=True, z_scale=True, batch_size=100, **norm_params)#
Parameters:
  • cell_key (str) –

  • feat_key (str) –

  • ordering_key (str) –

  • min_exp (float) –

  • window_size (int) –

  • chunk_size (int) –

  • smoothen (bool) –

  • z_scale (bool) –

  • batch_size (int) –

  • **norm_params

Returns:

save_normalized_data(cell_key, feat_key, batch_size, location, log_transform, renormalize_subset, update_keys)#

Create a new zarr group and saves the normalized data in the group for the selected features only.

Parameters:
  • cell_key (str) – Name of the key (column) from cell attribute table. The data will be saved for only those cells that have a True value in this column.

  • feat_key (str) – Name of the key (column) from feature attribute table. The data will be saved for only those features that have a True value in this column

  • batch_size (int) – Number of cells to store in a single chunk. Higher values lead to larger memory consumption

  • location (str) – Zarr group wherein to save the normalized values

  • log_transform (bool) – Whether to log transform the values. Is only used if the ‘normed’ method takes this parameter, ex. RNAassay

  • renormalize_subset (bool) – Only used if the ‘normed’ method takes this parameter. Please refer to the documentation of the ‘normed’ method of the RNAassay for further description of this parameter.

  • update_keys (bool) – Whether to update the keys. If True then the ‘latest_feat_key’ and ‘latest_feat_key’ attributes of the assay will be updated. It can be useful to set False in case where you only need to save the normalized data but don’t intend to use it directly. For example, when mapping onto a different dataset and aligning features to that dataset.

Return type:

<module ‘dask.array’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/dask/array/__init__.py’>

Returns: Dask array containing the normalized data

save_normed_for_query(feat_key, batch_size, overwrite=True)#

This methods dumps normalized values for features (as marked by feat_key) onto disk in the ‘prenormed’ slot under the assay’s own slot.

Parameters:
  • feat_key (Optional[str]) – Name of the key (column) from feature attribute table. The data will be fetched for only those features that have a True value in this column. If None then all the features are used

  • batch_size (int) – Number of genes to be loaded in the memory at a time.

  • overwrite (bool) – If True (default value), then will overwrite the existing ‘prenormed’ slot in the assay hierarchy

Return type:

None

Returns:

None

score_features(feature_names, cell_key, ctrl_size, n_bins, rand_seed)#

Calculates the scores (mean values) of selection of features over a randomly sampled selected feature set in given cells (as marked by cell_key)

Parameters:
  • feature_names (List[str]) – Names (as in ‘names’ column of the feature attribute table) of features to be used for scoring

  • cell_key (str) – Name of the key (column) from cell attribute table.

  • ctrl_size (int) – Number of reference features to be sampled from each bin.

  • n_bins (int) – Number of bins for sampling.

  • rand_seed (int) – The seed to use for the random number generation.

Return type:

ndarray

Returns: Numpy array of the calculated scores

to_raw_sparse(cell_key)#
Parameters:

cell_key – A column from cell attribute table. This column must be a boolean type. The data will be exported for only those that have a True value in this column.

Return type:

csr_matrix

Returns: A sparse matrix containing raw data.

RNAassay#

class scarf.assay.RNAassay(z, name, cell_data, **kwargs)#

This subclass of Assay is designed for feature selection and normalization of scRNA-Seq data.

Parameters:
  • z (zarr.Group) – Zarr hierarchy where raw data is located

  • name (str) – A label/name for assay.

  • cell_data (MetaData) – Metadata class object for the cell attributes.

  • **kwargs – kwargs to be passed to the Assay class

normMethod#

A pointer to the function to be used for normalization of the raw data

sf#

scaling factor for doing library-size normalization

scalar#

This is used to cache the library size of the cells. It is set to None until normed method is called.

mark_hvgs(cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, hvg_key_name, keep_bounds, show_plot, **plot_kwargs)#

Identifies highly variable genes in the dataset.

The parameters govern the min/max variance (corrected) and mean expression threshold for calling genes highly variable. The variance is corrected by first dividing genes into bins based on their mean expression values. Genes with minimum variance is selected from each bin and a Lowess curve is fitted to the mean-variance trend of these genes. mark_hvgs will by default run on the default assay. See utils.fit_lowess for further details.

Modifies the feats table: adds a column named <cell_key>__hvgs to the feature table, which contains a True value for genes marked HVGs. The prefix comes from the cell_key parameter, the naming rule in Scarf dictates that cells used to identify HVGs are prepended to the column name (with a double underscore delimiter).

Parameters:
  • cell_key (str) – Specify which cells to use to identify the HVGs. (Default value ‘I’ use all non-filtered out cells).

  • min_cells (int) – Minimum number of cells where a gene should have non-zero expression values for it to be considered a candidate for HVG selection. Large values for this parameter might make it difficult to identify rare populations of cells. Very small values might lead to higher signal to noise ratio in the selected features.

  • top_n (int) – Number of top most variable genes to be set as HVGs. This value is ignored if a value is provided for min_var parameter.

  • min_var (float) – Minimum variance threshold for HVG selection.

  • max_var (float) – Maximum variance threshold for HVG selection.

  • min_mean (float) – Minimum mean value of expression threshold for HVG selection.

  • max_mean (float) – Maximum mean value of expression threshold for HVG selection.

  • n_bins (int) – Number of bins into which the mean expression is binned.

  • lowess_frac (float) – Between 0 and 1. The fraction of the data used when estimating the fit between mean and variance. This is same as frac in statsmodels.nonparametric.smoothers_lowess.lowess

  • blacklist (str) – A regular expression string pattern. Gene names matching to this pattern will be excluded from the final highly variable genes list

  • hvg_key_name (str) – The label for highly variable genes. This label will be used to mark the HVGs in the feature attribute table. The value for ‘cell_key’ parameter is prepended to this value.

  • keep_bounds (bool) – If True, then the boundary values are retained and not filtered out.

  • show_plot (bool) – If True, a plot is produced, that for each gene shows the corrected variance on the y-axis and the non-zero mean (means from cells where the gene had a non-zero value) on the x-axis. The genes are colored in two gradients which indicate the number of cells where the gene was expressed. The colors are yellow to dark red for HVGs, and blue to green for non-HVGs.

  • **plot_kwargs – Keyword arguments for matplotlib.pyplot.scatter function

Return type:

None

normed(cell_idx=None, feat_idx=None, renormalize_subset=False, log_transform=False, **kwargs)#

This function normalizes the raw and returns a delayed dask array of the normalized data. Unlike the normed method in the generic Assay class this method is optimized for scRNA-Seq data and takes additional parameters that will be used by norm_lib_size (default normalization method for this class).

Parameters:
  • cell_idx (Optional[ndarray]) – Indices of cells to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of cell attribute table)

  • feat_idx (Optional[ndarray]) – Indices of features to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of feature attribute table)

  • renormalize_subset (bool) – If True, then the data is normalized using only those features that are True in feat_key column rather using total expression of all features in a cell (Default value: False)

  • log_transform (bool) – If True, then the normalized data is log-transformed (Default value: False).

  • **kwargs – kwargs have no effect here.

Return type:

<module ‘dask.array’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/dask/array/__init__.py’>

Returns:

A dask array (delayed matrix) containing normalized data.

set_feature_stats(cell_key, min_cells)#

Calculates summary statistics for the features of the assay using only cells that are marked True by the ‘cell_key’ parameter.

Parameters:
  • cell_key (str) – Name of the key (column) from cell attribute table.

  • min_cells (int) – Minimum number of cells across which a given feature should be present. If a feature is present (has non zero un-normalized value) in fewer cells that it is ignored and summary statistics are not calculated for that feature. Also, such features will be disabled and I value of these features in the feature attribute table will be set to False

Return type:

None

Returns: None

ATACassay#

class scarf.assay.ATACassay(z, name, cell_data, **kwargs)#

This subclass of Assay is designed for feature selection and normalization of scATAC-Seq data.

mark_prevalent_peaks(cell_key, top_n, prevalence_key_name)#

Marks top_n peaks with highest prevalence as prevalent peaks.

Parameters:
  • cell_key (str) – Cells to use for selection of most prevalent peaks. The provided value for cell_key should be a column in cell attributes table with boolean values.

  • top_n (int) – Number of top prevalent peaks to be selected. This value is ignored if a value is provided for min_var parameter.

  • prevalence_key_name (str) – Base label for marking prevalent peaks in the features attributes column. The value for ‘cell_key’ parameter is prepended to this value.

Return type:

None

Returns: None

normed(cell_idx=None, feat_idx=None, **kwargs)#

This function normalizes the raw and returns a delayed dask array of the normalized data. Unlike the normed method in the generic Assay class this method is optimized for scATAC-Seq data. This method uses the the normalization indicated by attribute self.normMethod which by default is set to norm_tf_idf. The TF-IDF normalization is performed using only the cells and features indicated by the ‘cell_idx’ and ‘feat_idx’ parameters.

Parameters:
  • cell_idx (Optional[ndarray]) – Indices of cells to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of cell attribute table)

  • feat_idx (Optional[ndarray]) – Indices of features to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of feature attribute table)

  • **kwargs

Return type:

<module ‘dask.array’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/dask/array/__init__.py’>

Returns: A dask array (delayed matrix) containing normalized data.

set_feature_stats(cell_key)#

Calculates prevalence of each valid feature of the assay using only cells that are marked True by the ‘cell_key’ parameter. Prevalence of a feature is the sum of all its TF-IDF normalized values across cells.

Parameters:

cell_key (str) – Name of the key (column) from cell attribute table.

Return type:

None

Returns: None

ADTassay#

class scarf.assay.ADTassay(z, name, cell_data, **kwargs)#

This subclass of Assay is designed for normalization of ADT/HTO (feature-barcodes library) data from CITE-Seq experiments.

Parameters:
  • z (zarr.Group) – Zarr hierarchy where raw data is located

  • name (str) – A label/name for assay.

  • cell_data (MetaData) – Metadata class object for the cell attributes.

  • **kwargs

normMethod#

Pointer to the function to be used for normalization of the raw data

normed(cell_idx=None, feat_idx=None, **kwargs)#

This function normalizes the raw and returns a delayed dask array of the normalized data. This method uses the the normalization indicated by attribute self.normMethod which by default is set to norm_clr. The centered log-ratio normalization is performed using only the cells and features indicated by the ‘cell_idx’ and ‘feat_idx’ parameters.

Parameters:
  • cell_idx (Optional[ndarray]) – Indices of cells to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of cell attribute table)

  • feat_idx (Optional[ndarray]) – Indices of features to be included in the normalized matrix (Default value: All those marked True in ‘I’ column of feature attribute table)

  • **kwargs

Return type:

<module ‘dask.array’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/dask/array/__init__.py’>

Returns: A dask array (delayed matrix) containing normalized data.

MetaData#

class scarf.metadata.MetaData(zgrp)#

MetaData class for cells and features.

This class provides an interface to perform CRUD operations on metadata, saved in the Zarr hierarchy. All the changes at the metadata are synchronized on disk.

locations#

The locations for where the metadata is stored.

N#

The size of the primary data.

index#

A numpy array with the indices of the cells/features.

active_index(key)#
Parameters:

key (str) –

Return type:

ndarray

Returns:

property columns: List[str]#

Returns:

drop(column)#
Parameters:

column (str) –

Return type:

None

Returns:

fetch(column, key='I')#

Get column values for only valid rows.

Parameters:
  • column (str) –

  • key (str) –

Return type:

ndarray

Returns:

fetch_all(column)#
Parameters:

column (str) –

Return type:

ndarray

Returns:

get_dtype(column)#

Returns the dtype for the given column.

Parameters:

column (str) – Column name of the table.

Return type:

type

get_index_by(value_targets, column, key=None)#
Parameters:
  • value_targets (List[Any]) –

  • column (str) –

  • key (Optional[str]) –

Return type:

ndarray

Returns:

grep(pattern, only_valid=False)#
Parameters:
  • pattern (str) –

  • only_valid

Return type:

List[str]

Returns:

head(n=5)#
Parameters:

n (int) –

Return type:

DataFrame

Returns:

index_to_bool(idx, invert=False)#
Parameters:
  • idx (ndarray) –

  • invert (bool) –

Return type:

ndarray

Returns:

insert(column_name, values, fill_value=nan, key='I', overwrite=False, location='primary', force=False)#

Insert a column into the table.

Parameters:
  • column_name (str) – Name of column to modify.

  • values (np.array) – Values the column should contain.

  • fill_value (Any = np.NaN) – Value to fill unassigned slots with.

  • key (str = 'I') –

  • overwrite (bool = False) – Should function overwrite column if it already exists?

  • location (str = 'primary') –

  • force (bool = False) – Enforce change to column, even if column is a protected column name (‘I’ or ‘ids’).

Return type:

None

Returns:

None

mount_location(zgrp, identifier)#
Parameters:
  • zgrp (<module 'zarr.hierarchy' from '/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.23.5/lib/python3.8/site-packages/zarr/hierarchy.py'>) –

  • identifier (str) –

Return type:

None

Returns:

multi_sift(columns, lows, highs, keep_bounds=False)#
Parameters:
  • columns (List[str]) –

  • lows (Iterable) –

  • highs (Iterable) –

  • keep_bounds (bool) –

Return type:

ndarray

Returns:

remove_trend(x, y, n_bins=200, lowess_frac=0.1)#
Parameters:
  • x (str) –

  • y (str) –

  • n_bins (int) –

  • lowess_frac (float) –

Return type:

ndarray

Returns:

reset_key(key)#
Parameters:

key (str) –

Return type:

None

Returns:

sift(column, min_v=-inf, max_v=inf, keep_bounds=False)#
Parameters:
  • column (str) –

  • min_v (float) –

  • max_v (float) –

  • keep_bounds (bool) –

Return type:

ndarray

Returns:

to_pandas_dataframe(columns, key=None)#

Returns the requested columns as a Pandas dataframe, sorted on key.

Return type:

DataFrame

unmount_location(identifier)#
Parameters:

identifier (str) –

Return type:

None

Returns:

update_key(values, key)#

Modify a column in the metadata table, specified with key.

Parameters:
  • values (array) – The values to update the column with.

  • key – Which column in the metadata table to update.

Return type:

None

Returns:

None

Reader classes#

Cellranger H5 reader#

class scarf.readers.CrH5Reader(h5_fn, file_type=None)#

A class to read in CellRanger (Cr) data, in the form of an H5 file.

Subclass of CrReader.

Parameters:
  • h5_fn – File name for the h5 file.

  • file_type (str) – [DEPRECATED] Type of sequencing data (‘rna’ | ‘atac’)

autoNames#

Specifies if the data is from RNA or ATAC sequencing.

grpNames#

A dictionary that specifies where to find the matrix, features and barcodes.

nFeatures#

Number of features in dataset.

nCells#

Number of cells in dataset.

assayFeats#

A DataFrame with information about the features in the assay.

h5obj#

A File object from the h5py package.

grp#

Current active group in the hierarchy.

close()#

Closes file connection.

Return type:

None

consume(batch_size, lines_in_mem)#

Returns a generator that yield chunks of data.

Return type:

Generator[coo_matrix, None, None]

Cellranger directory (MTX) reader#

class scarf.readers.CrDirReader(loc, file_type=None, mtx_separator=' ', index_offset=-1)#

A class to read in CellRanger (Cr) data, in the form of a directory.

Subclass of CrReader.

Parameters:
  • loc (str) – Path for the directory containing the cellranger output.

  • file_type (str) – [DEPRECATED] Type of sequencing data (‘rna’ | ‘atac’)

  • mtx_separator (str) – Column delimiter in the MTX file (Default value: ‘ ‘)

  • index_offset (int) – This value is added to each feature index (Default value: -1)

loc#

Path for the directory containing the cellranger output.

matFn#

The file name for the matrix file.

mtx_separator#

Column delimiter in the MTX file (Default value: ‘ ‘)

Type:

str

index_offset#

This value is added to each feature index (Default value: -1)

Type:

int

consume(batch_size, lines_in_mem=100000)#

Returns a generator that yield chunks of data.

Return type:

Generator[coo_matrix, None, None]

to_sparse(a)#

Returns the input data as a sparse (COO) matrix.

Parameters:

a (ndarray) – Sparse matrix, contains a chunk of data from the MTX file.

Return type:

coo_matrix

H5ad (Anndata) reader#

class scarf.readers.H5adReader(h5ad_fn, cell_attrs_key='obs', cell_ids_key='_index', feature_attrs_key='var', feature_ids_key='_index', feature_name_key='gene_short_name', matrix_key='X', obsm_attrs_key='obsm', category_names_key='__categories', dtype=None)#

A class to read in data from a H5ad file (h5 file with AnnData information).

Parameters:
  • h5ad_fn (str) – Path to H5AD file

  • cell_attrs_key (str) – H5 group under which cell attributes are saved.(Default value: ‘obs’)

  • feature_attrs_key (str) – H5 group under which feature attributes are saved.(Default value: ‘var’)

  • cell_ids_key (str) – Key in obs group that contains unique cell IDs. By default the index will be used.

  • feature_ids_key (str) – Key in var group that contains unique feature IDs. By default the index will be used.

  • feature_name_key (str) – Key in var group that contains feature names. (Default: gene_short_name)

  • matrix_key (str) – Group where in the sparse matrix resides (default: ‘X’)

  • category_names_key (str) – Looks up this group and replaces the values in var and ‘obs’ child datasets with the corresponding index value within this group.

  • dtype (Optional[str]) – Numpy dtype of the matrix data. This dtype is enforced when streaming the data through consume method. (Default value: Automatically determined)

h5#

A File object from the h5py package.

matrix_key#

Group where in the sparse matrix resides (default: ‘X’)

cellAttrsKey#

Group wherein the cell attributes are present

featureAttrsKey#

Group wherein the feature attributes are present

groupCodes#

Used to ensure compatibility with different AnnData versions.

nFeatures#

Number of features in dataset.

nCells#

Number of cells in dataset.

cellIdsKey#

Key in obs group that contains unique cell IDs. By default the index will be used.

featIdsKey#

Key in var group that contains unique feature IDs. By default the index will be used.

featNamesKey#

Key in var group that contains feature names. (Default: gene_short_name)

catNamesKey#

Looks up this group and replaces the values in var and ‘obs’ child datasets with the corresponding index value within this group.

matrixDtype#

dtype of the matrix containing the data (as indicated by matrix_key)

cell_ids()#

Returns a list of cell IDs.

Return type:

ndarray

consume(batch_size)#

Returns a generator that yield chunks of data.

consume_dataset(batch_size=1000)#

Returns a generator that yield chunks of data.

Return type:

Generator[coo_matrix, None, None]

consume_group(batch_size)#

Returns a generator that yield chunks of data.

Return type:

Generator[coo_matrix, None, None]

feat_ids()#

Returns a list of feature IDs.

Return type:

ndarray

feat_names()#

Returns a list of feature names.

Return type:

ndarray

get_cell_columns()#

Creates a Generator that yields the cell columns.

Return type:

Generator[Tuple[str, ndarray], None, None]

get_feat_columns()#

Creates a Generator that yields the feature columns.

Return type:

Generator[Tuple[str, ndarray], None, None]

Loom reader#

class scarf.readers.LoomReader(loom_fn, matrix_key='matrix', cell_attrs_key='col_attrs', cell_names_key='obs_names', feature_attrs_key='row_attrs', feature_names_key='var_names', feature_ids_key=None, dtype=None)#

A class to read in data in the form of a Loom file.

Parameters:
  • loom_fn (str) – Path to loom format file.

  • matrix_key (str) – Child node under HDF5 file root wherein the chunked matrix is stored. (Default value: matrix). This matrix is expected to be of form (nFeatures x nCells)

  • cell_attrs_key – Child node under the HDF5 file wherein the cell attributes are stored. (Default value: col_attrs)

  • cell_names_key (str) – Child node under the cell_attrs_key wherein the cell names are stored. (Default value: obs_names)

  • feature_attrs_key (str) – Child node under the HDF5 file wherein the feature/gene attributes are stored. (Default value: row_attrs)

  • feature_names_key (str) – Child node under the feature_attrs_key wherein the feature/gene names are stored. (Default value: var_names)

  • feature_ids_key (Optional[str]) – Child node under the feature_attrs_key wherein the feature/gene ids are stored. (Default value: None)

  • dtype (Optional[str]) – Numpy dtype of the matrix data. This dtype is enforced when streaming the data through consume method. (Default value: Automatically determined)

h5#

A File object from the h5py package.

matrixKey#

Child node under HDF5 file root wherein the chunked matrix is stored.

cellAttrsKey#

Child node under the HDF5 file wherein the cell attributes are stored.

featureAttrsKey#

Child node under the HDF5 file wherein the feature/gene attributes are stored.

cellNamesKey#

Child node under the cell_attrs_key wherein the cell names are stored.

featureNamesKey#

Child node under the feature_attrs_key wherein the feature/gene names are stored.

featureIdsKey#

Child node under the feature_attrs_key wherein the feature/gene ids are stored.

matrixDtype#

Numpy dtype of the matrix data.

nFeatures#

Number of features in dataset.

nCells#

Number of cells in dataset.

cell_ids()#

Returns a list of cell IDs.

Return type:

List[str]

cell_names()#

Returns a list of names of the cells in the dataset.

Return type:

List[str]

consume(batch_size=1000)#

Returns a generator that yield chunks of data.

Return type:

Generator[ndarray, None, None]

feature_ids()#

Returns a list of feature IDs.

Return type:

List[str]

feature_names()#

Returns a list of feature names.

Return type:

List[str]

get_cell_attrs()#

Returns a Generator that yields the cells’ attributes.

Return type:

Generator[Tuple[str, ndarray], None, None]

get_feature_attrs()#

Returns a Generator that yields the features’ attributes.

Return type:

Generator[Tuple[str, ndarray], None, None]

Nabo H5 reader#

class scarf.readers.NaboH5Reader(h5_fn)#

A class to read in data in the form of a Nabo H5 file.

Parameters:

h5_fn (str) – Path to H5 file.

h5#

A File object from the h5py package.

nCells#

Number of cells in dataset.

nFeatures#

Number of features in dataset.

cell_ids()#

Returns a list of cell IDs.

Return type:

List[str]

consume(batch_size=100)#

Returns a generator that yield chunks of data.

Return type:

Generator[ndarray, None, None]

feat_ids()#

Returns a list of feature IDs.

Return type:

ndarray

feat_names()#

Returns a list of feature names.

Return type:

List[str]

Writer classes#

Cellranger to Zarr#

class scarf.writers.CrToZarr(cr, zarr_fn, chunk_size=(1000, 1000), dtype='uint32')#

A class for converting data in the Cellranger format to a Zarr hierarchy.

Parameters:
  • cr (CrReader) – A CrReader object, containing the Cellranger data.

  • zarr_fn (str) – The file name for the Zarr hierarchy.

  • chunk_size – The requested size of chunks to load into memory and process.

  • dtype (str) – the dtype of the data.

cr#

A CrReader object, containing the Cellranger data.

fn#

The file name for the Zarr hierarchy.

chunkSizes#

The requested size of chunks to load into memory and process.

z#

The Zarr hierarchy (array or group).

dump(batch_size=1000, lines_in_mem=100000)#

Writes the count values into the Zarr matrix.

Parameters:
  • batch_size (int) – Number of cells to save at a time. (Default value: 1000)

  • lines_in_mem (int) – Number of lines to read at a time from MTX file (only used for CrDirReader) (Default value: 100000)

Raises:

AssertionError – Catches eventual bugs in the class, if number of cells does not match after transformation.

Return type:

None

Returns:

None

H5ad (Anndata) to Zarr#

class scarf.writers.H5adToZarr(h5ad, zarr_fn, assay_name=None, chunk_size=(1000, 1000))#

A class for converting data in anndata’s H5ad format to Zarr hierarchy.

Parameters:
  • h5ad (H5adReader) – A H5adReader object, containing the Cellranger data.

  • zarr_fn (str) – The file name for the Zarr hierarchy.

  • assay_name (Optional[str]) – the name of the assay (e. g. ‘RNA’)

  • chunk_size – The requested size of chunks to load into memory and process.

h5ad#

A h5ad object (h5 file with added AnnData structure).

fn#

The file name for the Zarr hierarchy.

chunkSizes#

The requested size of chunks to load into memory and process.

assayName#

The Zarr hierarchy (array or group).

z#

The Zarr hierarchy (array or group).

dump(batch_size=1000)#
Raises:

AssertionError – Catches eventual bugs in the class, if number of cells does not match after transformation.

Return type:

None

Returns:

None

Nabo H5 to Zarr#

class scarf.writers.NaboH5ToZarr(h5, zarr_fn, assay_name=None, chunk_size=(1000, 1000), dtype='uint32')#

A class for converting data in a h5 file generated by Nabo, to a Zarr hierarchy.

Parameters:
  • h5 (NaboH5Reader) – A Nabo h5 object containing the data.

  • zarr_fn (str) – The file name for the Zarr hierarchy.

  • assay_name (Optional[str]) – the name of the assay (e. g. ‘RNA’)

  • chunk_size – The requested size of chunks to load into memory and process.

  • dtype (str) – the dtype of the data.

h5#

A Nabo h5 object.

fn#

The file name for the Zarr hierarchy.

chunkSizes#

The requested size of chunks to load into memory and process.

assayName#

The Zarr hierarchy (array or group).

z#

The Zarr hierarchy (array or group).

dump(batch_size=500)#
Raises:

AssertionError – Catches eventual bugs in the class, if number of cells does not match after transformation.

Return type:

None

Returns:

None

Loom to Zarr#

class scarf.writers.LoomToZarr(loom, zarr_fn, assay_name=None, chunk_size=(1000, 1000))#

A class for converting data in a Loom file to a Zarr hierarchy. Converts a Loom file read using scarf.LoomReader into Scarf’s Zarr format.

Parameters:
  • loom (LoomReader) – LoomReader object used to open Loom format file

  • zarr_fn (str) – Output Zarr filename with path

  • assay_name (Optional[str]) – Name for the output assay. If not provided then automatically set to RNA

  • chunk_size – Chunk size for the count matrix saved in Zarr file.

loom#

A scarf.LoomReader object used to open Loom format file.

fn#

The file name for the Zarr hierarchy.

chunkSizes#

The requested size of chunks to load into memory and process.

assayName#

The Zarr hierarchy (array or group).

z#

The Zarr hierarchy (array or group).

dump(batch_size=1000)#
Raises:

AssertionError – Catches eventual bugs in the class, if number of cells does not match after transformation.

Return type:

None

Returns:

None

Zarr Merge#

class scarf.writers.ZarrMerge(zarr_path, assays, names, merge_assay_name, chunk_size=(1000, 1000), dtype=None, overwrite=False, prepend_text='orig', reset_cell_filter=True)#

Merge multiple Zarr files into a single Zarr file.

Parameters:
  • zarr_path (str) – Name of the new, merged Zarr file with path.

  • assays (list) – List of assay objects to be merged. For example, [ds1.RNA, ds2.RNA].

  • names (List[str]) – Names of each of the assay objects in the assays parameter. They should be in the same order as in assays parameter.

  • merge_assay_name (str) – Name of assay in the merged Zarr file. For example, for scRNA-Seq it could be simply, ‘RNA’.

  • chunk_size – Tuple of cell and feature chunk size. (Default value: (1000, 1000)).

  • dtype (Optional[str]) – Dtype of the raw values in the assay. Dtype is automatically inferred from the provided assays. If assays have different dtypes then a float type is used.

  • overwrite (bool) – If True, then overwrites previously created assay in the Zarr file. (Default value: False).

  • prepend_text (str) – This text is pre-appended to each column name (Default value: ‘orig’).

  • reset_cell_filter (bool) – If True, then the cell filtering information is removed, i.e. even the filtered out cells are set as True as in the ‘I’ column. To keep the filtering information set the value for this parameter to False. (Default value: True)

assays#

List of assay objects to be merged. For example, [ds1.RNA, ds2.RNA].

names#

Names of the each assay objects in the assays parameter.

mergedCells#
nCells#

Number of cells in dataset.

featCollection#
mergedFeats#
nFeats#

Number of features in the dataset.

featOrder#
z#

The merged Zarr file.

assayGroup#
dump(nthreads=2)#

Copy the values from individual assays to the merged assay.

Parameters:

nthreads – Number of compute threads to use. (Default value: 2)

Returns:

Subset Zarr#

class scarf.writers.SubsetZarr(in_zarr, out_zarr, cell_key=None, cell_idx=None, reset_cell_filter=True, overwrite_existing_file=False, overwrite_cell_data=False)#

Split Zarr file using a subset of cells.

Parameters:
  • in_zarr (str) – Path of input Zarr file to be subsetted.

  • out_zarr (str) – Path of output Zarr files containing only a subset of cells.

  • cell_key (Optional[str]) – Name of a boolean column in cell metadata. The cells with value True are included in the subset.

  • cell_idx (Optional[ndarray]) – Indices of the cells to be included in the subsetted. Only used when cell_key is None.

  • reset_cell_filter (bool) – If True, then the cell filtering information is removed, i.e. even the filtered out cells are set as True as in the ‘I’ column. To keep the filtering information set the value for this parameter to False. (Default value: True)

  • overwrite_existing_file (bool) – If True, then overwrites the existing data. (Default value: False)

  • overwrite_cell_data (bool) – If True, then overwrites cell data (Default value: True)