API

BaseDataStore

This is the base datastore class that deals with loading of assays from Zarr files and generating basic cell statistics like nCounts and nFeatures.

class scarf.datastore.BaseDataStore(zarr_loc, assay_types, default_assay, min_features_per_cell, min_cells_per_feature, mito_pattern, ribo_pattern, nthreads, zarr_mode, synchronizer)

This is the base datastore class that deals with loading of assays from Zarr files and generating basic cell statistics like nCounts and nFeatures. Superclass of the other DataStores.

cells

list of cell barcodes

assayNames

list of assay names in Zarr file, e. g. ‘RNA’ or ‘ATAC’

nthreads

number of threads to use for this datastore instance

z

the Zarr file (directory) used for for this datastore instance

get_cell_vals()

fetches data from the Zarr file

set_default_assay()

override assigning of default assay

get_cell_vals(*, from_assay, cell_key, k, clip_fraction=0)

Fetches data from the Zarr file.

This convenience function allows fetching values for cells from either cell metadata table or values of a given feature from normalized matrix.

Args:

from_assay: Name of assay to be used. If no value is provided then the default assay will be used. cell_key: One of the columns from cell metadata table that indicates the cells to be used.

The values in the chosen column should be boolean (Default value: ‘I’)

k: A cell metadata column or name of a feature. clip_fraction: This value is multiplied by 100 and the percentiles are soft-clipped from either end.

(Default value: 0 )

Returns:

` The requested values.

set_default_assay(assay_name)

Override default assay

Parameters

assay_name (str) – Name of the assay that should be set as default

Returns:

Raises

ValueError – if assay_name is not found in attribute assayNames

Return type

None

GraphDataStore

This class extends BaseDataStore by providing methods required to generate a cell-cell neighbourhood graph. It also contains all the methods that use the KNN graphs as primary input like UMAP/tSNE embedding calculation, clustering, down-sampling etc.

class scarf.datastore.GraphDataStore(**kwargs)

This class extends BaseDataStore by providing methods required to generate a cell-cell neighbourhood graph.

It also contains all the methods that use the KNN graphs as primary input like UMAP/tSNE embedding calculation, clustering, down-sampling etc.

cells

list of cell barcodes

assayNames

list of assay names in Zarr file, e. g. ‘RNA’ or ‘ATAC’

nthreads

number of threads to use for this datastore instance

z

the Zarr file (directory) used for for this datastore instance

get_imputed()
load_graph()
make_graph()
run_clustering()
run_leiden_clustering()
run_pseutotime_scoring()
run_tpacedo_sampler()
run_tsne()
run_umap()
get_imputed(*, from_assay=None, cell_key=None, feature_name=None, feat_key=None, t=2, cache_operator=True)
Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feature_name (Optional[str]) – Name of the feature to be imputed

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • t (int) – Same as the t parameter in MAGIC. Higher values lead to larger diffusion of values. Too large values can slow down the algorithm and cause over-smoothening. (Default value: 2)

  • cache_operator (bool) – Whether to keep the diffusion operator in memory after the method returns. Can be useful to set to True if many features are to imputed in a batch but can lead to increased memory usage (Default value: True)

Returns: An array of imputed values for the given feature

Return type

ndarray

load_graph(*, from_assay, cell_key, feat_key, symmetric, upper_only, use_k=None)

Load the cell neighbourhood as a scipy sparse matrix

Parameters
  • from_assay (str) – Name of the assay/

  • cell_key (str) – Cell key used to create the graph

  • feat_key (str) – Feature key used to create the graph

  • symmetric (bool) – If True, makes the graph symmetric by adding it to its transpose.

  • upper_only (bool) – If True, then only the values from upper triangular of the matrix are returned. This is only used when symmetric is True

  • use_k (Optional[int]) – Number of top k-nearest neighbours to keep in the graph. This value must be greater than 0 and less the parameter k used. By default all neighbours are used (Default value: None)

Return type

csr_matrix

Returns

A scipy sparse matrix representing cell neighbourhood graph.

make_graph(*, from_assay=None, cell_key=None, feat_key=None, pca_cell_key=None, reduction_method='auto', dims=None, k=None, ann_metric=None, ann_efc=None, ann_ef=None, ann_m=None, ann_parallel=False, rand_state=None, n_centroids=None, batch_size=None, log_transform=None, renormalize_subset=None, local_connectivity=None, bandwidth=None, update_keys=True, return_ann_object=False, custom_loadings=None, feat_scaling=True, show_elbow_plot=False)

Creates a cell neighbourhood graph. Performs following steps in the process:

  • Normalizes the data calling the save_normalized_data for the assay

  • instantiates ANNStream class which perform dimension reduction, feature scaling (optional) and fits ANN index

  • queries ANN index for nearest neighbours and saves the distances and indices of the neighbours

  • recalculates the distances to bound them into 0 and 1 (check out knn_utils module for details)

  • saves the indices and distances in sparse graph friendly form

  • fits a MiniBatch kmeans on the data

The data for all the steps is saved in the Zarr in the following hierarchy which is organized based on data dependency. Parameter values for each step are incorporated into group names in the hierarchy:

RNA
├── normed__I__hvgs
│   ├── data (7648, 2000) float64                 # Normalized data
│   └── reduction__pca__31__I                     # Dimension reduction group
│       ├── mu (2000,) float64                    # Means of normalized feature values
│       ├── sigma (2000,) float64                 # Std dev. of normalized feature values
│       ├── reduction (2000, 31) float64          # PCA loadings matrix
│       ├── ann__l2__63__63__48__4466             # ANN group named with ANN parameters
│       │   └── knn__21                           # KNN group with value of k in name
│       │       ├── distances (7648, 21) float64  # Raw distance matrix for k neighbours
│       │       ├── indices (7648, 21) uint64     # Indices for k neighbours
│       │       └── graph__1.0__1.5               # sparse graph with continuous form distance values
│       │           ├── edges (160608, 2) uint64
│       │           └── weights (160608,) float64
│       └── kmeans__100__4466                     # Kmeans groups
│           ├── cluster_centers (100, 31) float64 # Centroid matrix
│           └── cluster_labels (7648,) float64    # Cluster labels for cells
...

The most recent child of each hierarchy node is noted for quick retrieval and in cases where multiple child nodes exist. Parameters starting with ann are forwarded to HNSWlib. More details about these parameters can be found here: https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

Parameters
  • from_assay (Optional[str]) – Assay to use for graph creation. If no value is provided then defaultAssay will be used

  • cell_key (Optional[str]) – Cells to use for graph creation. By default all cells with True value in ‘I’ will be used. The provided value for cell_key should be a column in cell metadata table with boolean values.

  • feat_key (Optional[str]) – Features to use for graph creation. It is a required parameter. We have chosen not to set this to ‘I’ by default because this might lead to usage of too many features and may lead to poor results. The value for feat_key should be a column in feature metadata from the from_assay assay and should be boolean type.

  • pca_cell_key (Optional[str]) – Name of a column from cell metadata table. This column should be boolean type. If no value is provided then the value is set to same as cell_key which means all the cells in the normalized data will be used for fitting the pca. This parameter, hence, basically provides a mechanism to subset the normalized data only for PCA fitting step. This parameter can be useful, for example, the data has cells from multiple replicates which wont merge together, in which case the pca_cell_key can be used to fit PCA on cells from only one of the replicate.

  • reduction_method (str) – Method to use for linear dimension reduction. Could be either ‘pca’, ‘lsi’ or ‘auto’. In case of ‘auto’ _choose_reduction_method will be used to determine best reduction type for the assay.

  • dims (Optional[int]) – Number of top reduced dimensions to use (Default value: 11)

  • k (Optional[int]) – Number of nearest neighbours to query for each cell (Default value: 11)

  • ann_metric (Optional[str]) – Refer to HNSWlib link above (Default value: ‘l2’)

  • ann_efc (Optional[int]) – Refer to HNSWlib link above (Default value: min(100, max(k * 3, 50)))

  • ann_ef (Optional[int]) – Refer to HNSWlib link above (Default value: min(100, max(k * 3, 50)))

  • ann_m (Optional[int]) – Refer to HNSWlib link above (Default value: min(max(48, int(dims * 1.5)), 64) )

  • ann_parallel (bool) – If True, then ANN graph is created in parallel mode using DataStore.nthreads number of threads. Results obtained in parallel mode will not be reproducible. (Defaul: False)

  • rand_state (Optional[int]) – Random seed number (Default value: 4466)

  • n_centroids (Optional[int]) – Number of centroids for Kmeans clustering. As a general idication, have a value of 1+ for every 100 cells. Small small (<2000 cells) and very small (<500 cells) use a ballpark number for max expected number of clusters (Default value: 500). The results of kmeans clustering are only used to provide initial embedding for UMAP and tSNE. (Default value: 500)

  • batch_size (Optional[int]) – Number of cells in a batch. This number is guided by number of features being used and the amount of available free memory. Though the full data is already divided into chunks, however, if only a fraction of features are being used in the normalized dataset, then the chunk size can be increased to speed up the computation (i.e. PCA fitting and ANN index building). (Default value: 1000)

  • log_transform (Optional[bool]) – If True, then the normalized data is log-transformed (only affects RNAassay type assays). (Default value: True)

  • renormalize_subset (Optional[bool]) – If True, then the data is normalized using only those features that are True in feat_key column rather using total expression of all features in a cell (only affects RNAassay type assays). (Default value: True)

  • local_connectivity (Optional[float]) – This parameter is forwarded to smooth_knn_dist function from UMAP package. Higher value will push distribution of edge weights towards terminal values (binary like). Lower values will accumulate edge weights around the mean produced by bandwidth parameter. (Default value: 1.0)

  • bandwidth (Optional[float]) – This parameter is forwarded to smooth_knn_dist function from UMAP package. Higher value will push the mean of distribution of graph edge weights towards right. (Default value: 1.5). Read more about smooth_knn_dist function here: https://umap-learn.readthedocs.io/en/latest/api.html#umap.umap_.smooth_knn_dist

  • update_keys (bool) – If True (default) then latest_feat_key zarr attribute of the assay will be updated. Choose False if you are experimenting with a feat_key do not want to override existing latest_feat_key and by extension latest_graph.

  • return_ann_object (bool) – If True then returns the ANNStream object. This allows one to directly interact with the PCA transformer and HNSWlib index. Check out ANNStream documentation to know more. (Default: False)

  • custom_loadings (Optional[array]) –

    Custom loadings/transformer for linear dimension reduction. If provided, should have a form

    (d x p) where d is same the number of active features in feat_key and p is the number of

    reduced dimensions. dims parameter is ignored when this is provided. (Default value: None)

  • feat_scaling (bool) – If True (default) then the feature will be z-scaled otherwise not. It is highly recommended to keep this as True unless you know what you are doing. feat_scaling is internally turned off when during cross sample mapping using CORAL normalized values are being used. Read more about this in run_mapping method.

  • show_elbow_plot (bool) – If True, then an elbow plot is shown when PCA is fitted to the data. Not shown when using existing PCA loadings or custom loadings. (Default value: False)

Returns

Either None or AnnStream object

run_clustering(*, from_assay=None, cell_key=None, feat_key=None, n_clusters=None, symmetric_graph=False, graph_upper_only=False, balanced_cut=False, max_size=None, min_size=None, max_distance_fc=2, force_recalc=False, label='cluster')

Executes Paris clustering algorithm (https://arxiv.org/pdf/1806.01664.pdf) on the cell-neighbourhood graph. The algorithm captures the multiscale structure of the graph in to an ordinary dendrogram structure. The distances in the dendrogram are are based on probability of sampling node (aka cell) pairs. This methods creates this dendrogram if it doesn’t already exits for the graph and induces either a straight cut or balanced cut to obtain clusters of cells.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • n_clusters (Optional[int]) – Number of desired clusters (required if balanced_cut is False)

  • symmetric_graph (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: True)

  • graph_upper_only (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: True)

  • balanced_cut (bool) – If True, then uses the balanced cut algorithm as implemented in BalancedCut to obtain clusters (Default value: False)

  • max_size (Optional[int]) – Same as max_size in BalancedCut. The limit for a maximum number of cells in a cluster. This parameter value is required if balanced_cut is True.

  • min_size (Optional[int]) – Same as min_size in BalancedCut. The limit for a minimum number of cells in a cluster. This parameter value is required if balanced_cut is True.

  • max_distance_fc (float) – Same as max_distance_fc in BalancedCut. The threshold of ratio of distance between two clusters beyond which they will not be merged. (Default value: 2)

  • force_recalc (bool) – Forces recalculation of dendrogram even if one already exists for the graph

  • label (str) – Base label for cluster identity in the cell metadata column (Default value: ‘cluster’)

Returns:

Return type

None

run_leiden_clustering(*, from_assay=None, cell_key=None, feat_key=None, resolution=1, symmetric_graph=False, graph_upper_only=False, label='leiden_cluster', random_seed=4444)

Executes Leiden graph clustering algorithm on the cell-neighbourhood graph and saves cluster identities in the cell metadata column.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • resolution (int) – Resolution parameter for RBConfigurationVertexPartition configuration

  • symmetric_graph (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: True)

  • graph_upper_only (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: True)

  • label (str) – base label for cluster identity in the cell metadata column (Default value: ‘leiden_cluster’)

  • random_seed (int) – (Default value: 4444)

Returns:

Return type

None

run_pseudotime_scoring(*, from_assay=None, cell_key=None, feat_key=None, k_singular=20, r_vec=None, label='pseudotime')

Calculate differentiation potential of cells. This function is a reimplementation of population balance analysis (PBA) approach published in Weinreb et al. 2017, PNAS. This function computes the random walk normalized Laplacian matrix of the reference graph, L_rw = I-A/D and then calculates a Moore-Penrose pseudoinverse of L_rw. The method takes an optional but recommended parameter ‘r’ which represents the relative rates of proliferation and loss in different gene expression states (R). If not provided then a vector with ones is used. The differentiation potential is the dot product of inverse L_rw and R

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • k_singular (int) – Number of smallest singular values to save

  • r_vec (Optional[ndarray]) – Same as parameter R in the above said reference.

  • label (str) –

Returns:

Return type

None

run_topacedo_sampler(*, from_assay=None, cell_key=None, feat_key=None, cluster_key=None, use_k=None, density_depth=2, density_bandwidth=5.0, max_sampling_rate=0.05, min_sampling_rate=0.01, min_cells_per_group=3, snn_bandwidth=5.0, seed_reward=3.0, non_seed_reward=0, edge_cost_multiplier=1.0, edge_cost_bandwidth=10.0, save_sampling_key='sketched', save_density_key='cell_density', save_mean_snn_key='snn_value', save_seeds_key='sketch_seeds', rand_state=4466, return_edges=False)

Perform sub-sampling (aka sketching) of cells using TopACeDo algorithm. Sub-sampling required that cells are partitioned in cluster already. Since, sub-sampling is dependent on cluster information, having, large number of homogeneous and even sized cluster improves sub-sampling results.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • cluster_key (Optional[str]) – Name of the column in cell metadata table where cluster information is stored.

  • use_k (Optional[int]) – Number of top k-nearest neighbours to retain in the graph over which downsampling is performed. BY default all neighbours are used. (Default value: None)

  • density_depth (int) – Same as ‘search_depth’ parameter in calc_neighbourhood_density. (Default value: 2)

  • density_bandwidth (float) – This value is used to scale the penalty affected by neighbourhood density. Higher values will lead to to larger penalty. (Default value: 5.0)

  • max_sampling_rate (float) – Maximum fraction of cells to sample from each group. The effective sampling rate is lower than this value depending on the neighbourhood degree and SNN density of cells. Should be greater than 0 and less than 1. (Default value: 0.1)

  • min_sampling_rate (float) – Minimum sampling rate. Effective sampling rate is not allowed to be lower than this value. Should be greater than 0 and less than 1. (Default value: 0.01)

  • min_cells_per_group (int) – Minimum number of cells to sample from each group. (Default value: 3)

  • snn_bandwidth (float) – Bandwidth for the shared nearest neighbour award. Clusters with higher mean SNN values get lower sampling penalty. This value, is raised to mean SNN value of the cluster to obtain sampling reward of the cluster. (Default value: 5.0)

  • seed_reward (float) – Reward/prize value for seed nodes. (Default value: 3.0)

  • non_seed_reward (float) – Reward/prize for non-seed nodes. (Default value: 0.1)

  • edge_cost_multiplier (float) – This value is multiplier to each edge’s cost. Higher values will make graph traversal costly and might lead to removal of poorly connected nodes (Default value: 1.0)

  • edge_cost_bandwidth (float) – This value is raised to edge cost to get an adjusted edge cost (Default value: 1.0)

  • save_sampling_key (str) – base label for marking the cells that were sampled into a cell metadata column (Default value: ‘sketched’)

  • save_density_key (str) – base label for saving the cell neighbourhood densities into a cell metadata column (Default value: ‘cell_density’)

  • save_mean_snn_key (str) – base label for saving the SNN value for each cells (identified by topacedo sampler) into a cell metadata column (Default value: ‘snn_value’)

  • save_seeds_key (str) – base label for saving the seed cells (identified by topacedo sampler) into a cell metadata column (Default value: ‘sketch_seeds’)

  • rand_state (int) – A random values to set seed while sampling cells from a cluster randomly. (Default value: 4466)

  • return_edges (bool) – If True, then steiner nodes and edges are returned. (Default value: False)

Returns:

Return type

Optional[List]

run_tsne(*, from_assay=None, cell_key=None, feat_key=None, symmetric_graph=False, graph_upper_only=False, ini_embed=None, tsne_dims=2, lambda_scale=1.0, max_iter=500, early_iter=200, alpha=10, box_h=0.7, temp_file_loc='.', label='tSNE', verbose=True, parallel=False, nthreads=None)

Run SGtSNE-pi (Read more here: https://github.com/fcdimitr/sgtsnepi/tree/v1.0.1). This is an implementation of tSNE that runs directly on graph structures. We use the graphs generated by make_graph method to create a layout of cells using tSNE algorithm. This function makes a system call to sgtSNE binary. To get a better understanding of how the parameters affect the embedding, check this out: http://t-sne-pi.cs.duke.edu/

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • symmetric_graph (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: False)

  • graph_upper_only (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: False)

  • ini_embed (Optional[ndarray]) – Initial embedding coordinates for the cells in cell_key. Should have same number of columns as tsne_dims. If not value is provided then the initial embedding is obtained using get_ini_embed.

  • tsne_dims (int) – Number of tSNE dimensions to compute (Default value: 2)

  • lambda_scale (float) – λ rescaling parameter (Default value: 1.0)

  • max_iter (int) – Maximum number of iterations (Default value: 500)

  • early_iter (int) – Number of early exaggeration iterations (Default value: 200)

  • alpha (int) – Early exaggeration multiplier (Default value: 10)

  • box_h (float) – Grid side length (accuracy control). Lower values might drastically slow down the algorithm (Default value: 0.7)

  • temp_file_loc (str) – Location of temporary file. By default these files will be created in the current working directory. These files are deleted before the method returns.

  • label (str) – base label for tSNE dimensions in the cell metadata column (Default value: ‘tSNE’)

  • verbose (bool) – If True (default) then the full log from SGtSNEpi algorithm is shown.

  • parallel (bool) – Whether to run tSNE in parallel mode. Setting value to True will use nthreads threads. The results are not reproducible in parallel mode. (Default value: False)

  • nthreads (Optional[int]) – If parallel=True then this number of threads will be used to run tSNE. By default the nthreads attribute of the class is used. (Default value: None)

Returns:

Return type

None

run_umap(*, from_assay=None, cell_key=None, feat_key=None, symmetric_graph=False, graph_upper_only=False, ini_embed=None, umap_dims=2, spread=2.0, min_dist=1, fit_n_epochs=200, tx_n_epochs=100, set_op_mix_ratio=1.0, repulsion_strength=1.0, initial_alpha=1.0, negative_sample_rate=5, random_seed=4444, label='UMAP', parallel=False, nthreads=None)

Runs UMAP algorithm using the precomputed cell-neighbourhood graph. The calculated UMAP coordinates are saved in the cell metadata table

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • symmetric_graph (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: False)

  • graph_upper_only (bool) – This parameter is forwarded to load_graph and is same as there. (Default value: False)

  • ini_embed (Optional[ndarray]) – Initial embedding coordinates for the cells in cell_key. Should have same number of columns as umap_dims. If not value is provided then the initial embedding is obtained using get_ini_embed.

  • umap_dims (int) – Number of dimensions of UMAP embedding (Default value: 2)

  • spread (float) – Same as spread in UMAP package. The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

  • min_dist (float) – Same as min_dist in UMAP package. The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. (Default value: 1)

  • fit_n_epochs (int) – Same as n_epochs in UMAP package. The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. (Default value: 200)

  • tx_n_epochs (int) – NUmber of epochs during transform (Default value: 100)

  • set_op_mix_ratio (float) – Same as set_op_mix_ratio in UMAP package. Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

  • repulsion_strength (float) – Same as repulsion_strength in UMAP package. Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. (Default value: 1.0)

  • initial_alpha (float) – Same as learning_rate in UMAP package. The initial learning rate for the embedding optimization. (Default value: 1.0)

  • negative_sample_rate (float) – Same as negative_sample_rate in UMAP package. The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy. (Default value: 5)

  • random_seed (int) – (Default value: 4444)

  • label – base label for UMAP dimensions in the cell metadata column (Default value: ‘UMAP’)

  • parallel (bool) – Whether to run UMAP in parallel mode. Setting value to True will use nthreads threads. The results are not reproducible in parallel mode. (Default value: False)

  • nthreads (Optional[int]) – If parallel=True then this number of threads will be used to run UMAP. By default the nthreads attribute of the class is used. (Default value: None)

Returns:

Return type

None

MappingDatastore

This class extends GraphDataStore by providing methods for mapping/ projection of cells from one DataStore onto another. It also contains the methods required for label transfer, mapping score generation and co-embedding.

class scarf.datastore.MappingDatastore(**kwargs)

This class extends GraphDataStore by providing methods for mapping/ projection of cells from one DataStore onto another.

It also contains the methods required for label transfer, mapping score generation and co-embedding.

cells

list of cell barcodes

assayNames

list of assay names in Zarr file, e. g. ‘RNA’ or ‘ATAC’

nthreads

number of threads to use for this datastore instance

z

the Zarr file (directory) used for for this datastore instance

get_mapping_score()
get_target_classes()
load_unified_graph()
plot_unified_layout()
run_mapping()
plot_unified_tsne()
plot_unified_umap()
get_mapping_score(*, target_name, target_groups=None, from_assay=None, cell_key='I', log_transform=True, multiplier=1000, weighted=True, fixed_weight=0.1)

Mapping scores are an indication of degree of similarity of reference cells in the graph to the target cells. The more often a reference cell is found in the nearest neighbour list of the target cells, the higher will be the mapping score for that cell.

Parameters
  • target_name (str) – Name of target data. This used to keep track of projections in the Zarr hierarchy

  • target_groups (Optional[ndarray]) – Group/cluster identity of target cells. This will then be used to calculate mapping score for each group separately.

  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (str) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • log_transform (bool) – If True (default) then the mapping scores will be log transformed

  • multiplier (float) – A scaling factor for mapping scores. All scores al multiplied this value. This mostly intended for visualization of mapping scores (Default: 1000)

  • weighted (bool) – Use distance weights when calculating mapping scores (default: True). If False then the actual distances between the reference and target cells are ignored.

  • fixed_weight (float) – Used when weighted is False. This is the value that is added to mapping score of each reference cell for every projected target cell. Can be any value >0.

Yields

A tuple of group name and mapping score of reference cells for that target group.

Return type

Generator[Tuple[str, ndarray], None, None]

get_target_classes(*, target_name, from_assay=None, cell_key='I', reference_class_group=None, threshold_fraction=0.5, target_subset=None, na_val='NA')

Perform classification of target cells using a reference group

Parameters
  • target_name (str) – Name of target data. This value should be the same as that used for run_mapping earlier.

  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (str) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • reference_class_group (Optional[str]) – Group/cluster identity of the reference cells. These are the target labels for the classifier. The value here should be a column from cell metadata table. For example, to use default clustering identity one could use RNA_cluster

  • threshold_fraction (int) – This value (Default value: 0.5)

  • target_subset (Optional[List[int]]) – Choose only a subset of target cells to be classified. The value should be a list of indices of the target cells (Default: None)

  • na_val – Value to be used if a cell is not classified to any of the reference_class_group (Default value: ‘NA’)

Returns: A pandas Series containing predicted class for each cell in the projected sample (target_name).

Return type

Series

load_unified_graph(*, from_assay, cell_key, feat_key, target_names, use_k, target_weight)

This is similar to load_graph but includes projected cells and their edges.

Parameters
  • from_assay (str) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (str) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (str) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • target_names (List[str]) – Name of target datasets to be included in the unified graph

  • use_k (int) – Number of nearest neighbour edges of each projected cell to be included. If this value is larger than than save_k parameter while running mapping for the target_name target then use_k is reset to ‘save_k’

  • target_weight (float) – A constant uniform weight to be ascribed to each target-reference edge.

Returns:

Return type

Tuple[List[int], csr_matrix]

plot_unified_layout(*, from_assay=None, layout_key=None, show_target_only=False, ref_name='reference', target_groups=None, width=6, height=6, cmap=None, color_key=None, mask_color='k', point_size=10, ax_label_size=12, frame_offset=0.05, spine_width=0.5, spine_color='k', displayed_sides=('bottom', 'left'), legend_ondata=False, legend_onside=True, legend_size=12, legends_per_col=20, marker_scale=70, lspacing=0.1, cspacing=1, savename=None, save_dpi=300, ax=None, fig=None, force_ints_as_cats=True, scatter_kwargs=None, shuffle_zorder=True)

This function helps plotting the reference and target cells the coordinates for which were obtained from either run_unified_tsne or run_unified_umap. Since the coordinates are not saved in the cell metadata but rather in the projections slot of the Zarr hierarchy, this function is needed to correctly fetch the values for reference and target cells. Additionally this function provides a way to colour target cells by bringing in external annotations for those cells.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • layout_key (Optional[str]) – Should be same as the parameter value for label in run_unified_umap or run_unified_tsne (Default value: ‘UMAP’)

  • show_target_only (bool) – If True then the reference cells are not shown (Default value: False)

  • ref_name (str) – A label for reference cells to be used in the legend. (Default value: ‘reference’)

  • target_groups (Optional[list]) – Categorical values to be used to colourmap target cells. (Default value: None)

  • width (float) – Figure width (Default value: 6)

  • height (float) – Figure height (Default value: 6)

  • cmap – A matplotlib colourmap to be used to colour categorical or continuous values plotted on the cells. (Default value: tab20 for categorical variables and cmocean.deep for continuous variables)

  • color_key (Optional[dict]) –

    A custom colour map for cells. These can be used for categorical variables only. The keys in this dictionary should be the category label as present in the color_by column and values should be

    valid matplotlib colour names or hex codes of colours. (Default value: None)

  • mask_color (str) – Color to be used for masked values. This should be a valid matplotlib named colour or a hexcode of a colour. (Default value: ‘k’)

  • point_size (float) – Size of each scatter point. This is overridden if size_vals is provided. Has no effect if do_shading is True. (Default value: 10)

  • ax_label_size (float) – Font size for the x and y axis labels. (Default value: 12)

  • frame_offset (float) – Extend the x and y axis limits by this fraction (Default value: 0.05)

  • spine_width (float) – Line width of the displayed spines (Default value: 0.5)

  • spine_color (str) – Colour of the displayed spines. (Default value: ‘k’)

  • displayed_sides (tuple) – Determines which figure spines are chosen. The spines to be shown can be supplied as a tuple. The options are: top, bottom, left and right. (Default value: (‘bottom’, ‘left) )

  • legend_ondata (bool) – Whether to show category labels on the data (scatter points). The position of the label is the centroid of the corresponding values. (Default value: True)

  • legend_onside (bool) – Whether to draw a legend table on the side of the figure. (Default value: True)

  • legend_size (float) – Font size of the legend text. (Default value: 12)

  • legends_per_col (int) – Number of legends to be used on each legend column. This value determines how many legend legend columns will be drawn (Default value: 20)

  • marker_scale (float) – The relative size of legend markers compared with the originally drawn ones. (Default value: 70)

  • lspacing (float) – The vertical space between the legend entries. Measured in font-size units. (Default value: 0.1)

  • cspacing (float) – The spacing between columns. Measured in font-size units. (Default value: 1)

  • savename (Optional[str]) – Path where the rendered figure is to be saved. The format of the saved image depends on the the extension present in the parameter value. (Default value: None)

  • save_dpi (int) – DPI when saving figure (Default value: 300)

  • ax – An instance of Matplotlib’s Axes object. This can be used to to plot the figure into an already created axes. (Default value: None)

  • fig – An instance of Matplotlib Figure. This is required to draw colorbar for continuous values. (Default value: None)

  • force_ints_as_cats (bool) – Force integer labels in color_by as categories. If False, then integer will be treated as continuous variables otherwise as categories. This effects how colourmaps are chosen and how legends are rendered. Set this to False if you are large number of unique integer entries (Default: True)

  • scatter_kwargs (Optional[dict]) – Keyword argument to be passed to matplotlib’s scatter command

  • shuffle_zorder (bool) – Whether to shuffle the plot order of data points in the figure. (Default value: True)

Returns:

run_mapping(*, target_assay, target_name, target_feat_key, from_assay=None, cell_key='I', feat_key=None, save_k=3, batch_size=1000, ref_mu=True, ref_sigma=True, run_coral=False, exclude_missing=False, filter_null=False, feat_scaling=True)

Projects cells from external assays into the cell-neighbourhood graph using existing PCA loadings and ANN index. For each external cell (target) nearest neighbours are identified and save within the Zarr hierarchy group projections.

Parameters
  • target_assay (Assay) – Assay object of the target dataset

  • target_name (str) – Name of target data. This used to keep track of projections in the Zarr hierarchy

  • target_feat_key (str) – This will used to name wherein the normalized target data will be saved in its own zarr hierarchy.

  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (str) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • save_k (int) – Number of nearest numbers to identify for each target cell (Default value: 3)

  • batch_size (int) – Number of cells that will be projected as a batch. This used to decide the chunk size when normalized data for the target cells is saved to disk.

  • ref_mu (bool) – If True (default), Then mean values of features as in the reference are used, otherwise mean is calculated using target cells. Turning this to False is not recommended.

  • ref_sigma (bool) – If True (default), Then standard deviation values of features as present in the reference are used, otherwise std. dev. is calculated using target cells. Turning this to False is not recommended.

  • run_coral (bool) – If True then CORAL feature rescaling algorithm is used to correct for domain shift in target cells. Read more about CORAL algorithm in function coral. This algorithm creates a m by m matrix where m is the number of features being used for mapping; so it is not advised to use this in a case where a large number of features are being used (>10k for example). (Default value: False)

  • exclude_missing (bool) – If set to True then only those features that are present in both reference and target are used. If not all reference features from feat_key are present in target data then a new graph will be created for reference and mapping will be done onto that graph. (Default value: False)

  • filter_null (bool) – If True then those features that have a total sum of 0 in the target cells are removed. This has an affect only when exclude_missing is True. (Default value: False)

  • feat_scaling (bool) – If False then features from target cells are not scaled. This is automatically set to False if run_coral is True (Default value: True). Setting this to False is not recommended.

Returns:

Return type

None

run_unified_tsne(*, target_names, from_assay=None, cell_key='I', feat_key=None, use_k=3, target_weight=0.5, lambda_scale=1.0, max_iter=500, early_iter=200, alpha=10, box_h=0.7, temp_file_loc='.', verbose=True, ini_embed_with='kmeans', label='unified_tSNE')

Calculates the tSNE embedding for graph obtained using load_unified_graph. The loaded graph is processed the same way as the graph as in run_tsne

Parameters
  • target_names (List[str]) – Names of target datasets to be included in the unified tSNE.

  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (str) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • use_k (int) – Number of nearest neighbour edges of each projected cell to be included. If this value is larger than than save_k parameter while running mapping for the target_name target then use_k is reset to ‘save_k’

  • target_weight (float) – A constant uniform weight to be ascribed to each target-reference edge.

  • lambda_scale (float) – λ rescaling parameter (Default value: 1.0)

  • max_iter (int) – Maximum number of iterations (Default value: 500)

  • early_iter (int) – Number of early exaggeration iterations (Default value: 200)

  • alpha (int) – Early exaggeration multiplier (Default value: 10)

  • box_h (float) – Grid side length (accuracy control). Lower values might drastically slow down the algorithm (Default value: 0.7)

  • temp_file_loc (str) – Location of temporary file. By default these files will be created in the current working directory. These files are deleted before the method returns.

  • verbose (bool) – If True (default) then the full log from SGtSNEpi algorithm is shown.

  • ini_embed_with (str) – Initial embedding coordinates for the cells in cell_key. Should have same number of columns as tsne_dims. If not value is provided then the initial embedding is obtained using get_ini_embed

  • label (str) – base label for tSNE dimensions in the cell metadata column (Default value: ‘tSNE’)

Returns:

Return type

None

run_unified_umap(*, target_names, from_assay=None, cell_key='I', feat_key=None, use_k=3, target_weight=0.1, spread=2.0, min_dist=1, fit_n_epochs=200, tx_n_epochs=100, set_op_mix_ratio=1.0, repulsion_strength=1.0, initial_alpha=1.0, negative_sample_rate=5, random_seed=4444, ini_embed_with='kmeans', label='unified_UMAP')

Calculates the UMAP embedding for graph obtained using load_unified_graph. The loaded graph is processed the same way as the graph as in run_umap

Parameters
  • target_names (List[str]) – Names of target datasets to be included in the unified UMAP.

  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (str) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in the desired graph. By default the latest used feature for the given assay will be used.

  • use_k (int) – Number of nearest neighbour edges of each projected cell to be included. If this value is larger than than save_k parameter while running mapping for the target_name target then use_k is reset to ‘save_k’

  • target_weight (float) – A constant uniform weight to be ascribed to each target-reference edge.

  • spread (float) – Same as spread in UMAP package. The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

  • min_dist (float) – Same as min_dist in UMAP package. The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. (Default value: 1)

  • fit_n_epochs (int) – Same as n_epochs in UMAP package. The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. (Default value: 200)

  • tx_n_epochs (int) – NUmber of epochs during transform (Default value: 100)

  • set_op_mix_ratio (float) – Same as set_op_mix_ratio in UMAP package. Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

  • repulsion_strength (float) – Same as repulsion_strength in UMAP package. Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. (Default value: 1.0)

  • initial_alpha (float) – Same as learning_rate in UMAP package. The initial learning rate for the embedding optimization. (Default value: 1.0)

  • negative_sample_rate (float) – Same as negative_sample_rate in UMAP package. The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy. (Default value: 5)

  • random_seed (int) – (Default value: 4444)

  • ini_embed_with (str) – either ‘kmeans’ or a column from cell metadata to be used as initial embedding coordinates

  • label (str) – base label for UMAP dimensions in the cell metadata column (Default value: ‘UMAP’)

Returns:

Return type

None

DataStore

This class extends MappingDatastore and consequently inherits methods of all the above DataStore classes. This class is the main user facing class as it provides most of the plotting functions. It also contains methods for cell filtering, feature selection, marker features identification, subsetting and aggregating cells. This class also contains methods that perform in-memory data exports.

class scarf.DataStore(zarr_loc, assay_types=None, default_assay=None, min_features_per_cell=10, min_cells_per_feature=20, mito_pattern=None, ribo_pattern=None, nthreads=2, zarr_mode='r+', synchronizer=None)

This class extends MappingDatastore and consequently inherits methods of all the *DataStore classes.

This class is the main user facing class as it provides most of the plotting functions. It also contains methods for cell filtering, feature selection, marker features identification, subsetting and aggregating cells. This class also contains methods that perform in-memory data exports. In other words, DataStore objects provide the primary interface to interact with the data.

cells

list of cell barcodes

assayNames

list of assay names in Zarr file, e. g. ‘RNA’ or ‘ATAC’

nthreads

number of threads to use for this datastore instance

z

the Zarr file (directory) used for for this datastore instance

auto_filter_cells()
filter_cells()
get_markers()
make_bulk()
mark_hvgs()
mark_prevalent_peaks()
plot_cells_dists()
plot_cluster_tree()
plot_layout()
plot_marker_heatmap()
run_cell_cycle_scoring()
show_zarr_tree()

prints the Zarr hierarchy of the DataStore

to_anndata()

writes an assay of the Zarr hierarchy to AnnData file format

auto_filter_cells(*, attrs=None, min_p=0.01, max_p=0.99, show_qc_plots=True)

Filter cells based on columns of the cell metadata table. This is wrapper function for filer_cells and determines the threshold values to be used for each column. For each cell metadata column, the function models a normal distribution using the median value and std. dev. of the column and then determines the point estimates of values at min_p and max_p fraction of densities.

Parameters
  • attrs (Optional[Iterable[str]]) – column names to be used for filtering

  • min_p (float) – fractional density point to be used for calculating lower bounds of threshold

  • max_p (float) – fractional density point to be used for calculating lower bounds of threshold

  • show_qc_plots (bool) – If True then violin plots with per cell distribution of features will be shown. This does not have an effect if auto_filter is False

Returns:

Return type

None

export_markers_to_csv(*, from_assay=None, cell_key=None, group_key=None, csv_filename=None)

Export markers of each cluster/group to a CSV file where each column contains the marker names sorted by score (descending order, highest first). This function does not export the scores of markers as they can be obtained using get_markers function

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – To run run the the test on specific subset of cells, provide the name of a boolean column in the cell metadata table.

  • group_key (Optional[str]) – Required parameter. This has to be a column name from cell metadata table. Usually this would be a column denoting cell clusters. Please use the same value as used when ran run_marker_search

  • csv_filename (Optional[str]) – Required parameter. Name, with path, of CSV file where the maker table is to be saved.

Returns:

Return type

None

filter_cells(*, attrs, lows, highs, reset_previous=False)

Filter cells based on the cell metadata column values. Filtering triggers update method on ‘I’ column of cell metadata which uses ‘and’ operation. This means that cells that are not within the filtering thresholds will have value set as False in ‘I’ column of cell metadata table. When performing filtering repeatedly, the cells that were previously filtered out remain filtered out and ‘I’ column is updated only for those cells that are filtered out due to the latest filtering attempt.

Parameters
  • attrs (Iterable[str]) – Names of columns to be used for filtering

  • lows (Iterable[int]) – Lower bounds of thresholds for filtering. Should be in same order as the names in attrs parameter

  • highs (Iterable[int]) – Upper bounds of thresholds for filtering. Should be in same order as the names in attrs parameter

  • reset_previous (bool) – If True, then results of previous filtering will be undone completely. (Default value: False)

Returns:

Return type

None

get_markers(*, from_assay=None, cell_key=None, group_key=None, group_id=None)

Returns a table of markers features obtained through run_maker_search for a given group. The table contains names of marker features and feature ids are used as table index.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – To run run the the test on specific subset of cells, provide the name of a boolean column in the cell metadata table.

  • group_key (Optional[str]) – Required parameter. This has to be a column name from cell metadata table. Usually this would be a column denoting cell clusters. Please use the same value as used when ran run_marker_search

  • group_id (Union[str, int, None]) – This is one of the value in group_key column of cell metadata. Results are returned for this group

Return type

DataFrame

Returns

Pandas dataframe with marker feature names and scores

make_bulk(from_assay=None, group_key=None, pseudo_reps=3, null_vals=None, random_seed=4466)

Merge data from cells to create a bulk profile.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • group_key (Optional[str]) – Name of the column in cell metadata table to be used for grouping cells.

  • pseudo_reps (int) – Within each group, cells will randomly be split into pseudo_reps partitions. Each partition is considered a pseudo-replicate. (Default value: 3)

  • null_vals (Optional[list]) – Values to be considered as missing values in the group_key column. These values will be

  • random_seed (int) – A random values to set seed while creating pseudo_reps partitions cells randomly.

Returns:

Return type

DataFrame

mark_hvgs(*, from_assay=None, cell_key=None, min_cells=None, top_n=500, min_var=- inf, max_var=inf, min_mean=- inf, max_mean=inf, n_bins=200, lowess_frac=0.1, blacklist='^MT-|^RPS|^RPL|^MRPS|^MRPL|^CCN|^HLA-|^H2-|^HIST', show_plot=True, hvg_key_name='hvgs', **plot_kwargs)

Identify and mark genes as highly variable genes (HVGs). This is a critical and required feature selection step and is only applicable to RNAassay type of assays.

Parameters
  • from_assay (Optional[str]) – Assay to use for graph creation. If no value is provided then defaultAssay will be used

  • cell_key (Optional[str]) – Cells to use for HVG selection. By default all cells with True value in ‘I’ will be used. The provided value for cell_key should be a column in cell metadata table with boolean values.

  • min_cells (Optional[int]) – Minimum number of cells where a gene should have non-zero expression values for it to be considered a candidate for HVG selection. Large values for this parameter might make it difficult to identify rare populations of cells. Very small values might lead to higher signal to noise ratio in the selected features. By default, a value is set assuming smallest population has no less than 1% of all cells. So for example, if you have 1000 cells (as per cell_key parameter) then min-cells will be set to 10.

  • top_n (int) – Number of top most variable genes to be set as HVGs. This value is ignored if a value is provided for min_var parameter. (Default: 500)

  • min_var (float) – Minimum variance threshold for HVG selection. (Default: -Infinity)

  • max_var (float) – Maximum variance threshold for HVG selection. (Default: Infinity)

  • min_mean (float) – Minimum mean value of expression threshold for HVG selection. (Default: -Infinity)

  • max_mean (float) – Maximum mean value of expression threshold for HVG selection. (Default: Infinity)

  • n_bins (int) – Number of bins into which the mean expression is binned. (Default: 200)

  • lowess_frac (float) – Between 0 and 1. The fraction of the data used when estimating the fit between mean and variance. This is same as frac in statsmodels.nonparametric.smoothers_lowess.lowess (Default: 0.1)

  • blacklist (str) – This is a regular expression (regex) string that can be used to exclude genes from being marked as HVGs. By default we exclude mitochondrial, ribosomal, some cell-cycle related, histone and HLA genes. (Default: “^MT-|^RPS|^RPL|^MRPS|^MRPL|^CCN|^HLA-|^H2-|^HIST” )

  • show_plot (bool) – If True then a diagnostic scatter plot is shown with HVGs highlighted. (Default: True)

  • hvg_key_name (str) – Base label for HVGs in the features metadata column. The value for ‘cell_key’ parameter is prepended to this value. (Default value: ‘hvgs’)

  • plot_kwargs – These named parameters are passed to plotting.plot_mean_var

Returns:

Return type

None

mark_prevalent_peaks(*, from_assay=None, cell_key=None, top_n=10000, prevalence_key_name='prevalent_peaks')

Feature selection method for ATACassay type assays. This method first calculates prevalence of each peak by computing sum of TF-IDF normalized values for each peak and then marks top_n peaks with highest prevalence as prevalent peaks.

Parameters
  • from_assay (Optional[str]) – Assay to use for graph creation. If no value is provided then defaultAssay will be used

  • cell_key (Optional[str]) –

    Cells to use for selection of most prevalent peaks. By default all cells with True value in ‘I’ will be used. The provided value for cell_key should be a column in cell metadata table

    with boolean values.

  • top_n (int) – Number of top prevalent peaks to be selected. This value is ignored if a value is provided for min_var parameter. (Default: 500)

  • prevalence_key_name (str) – Base label for marking prevalent peaks in the features metadata column. The value for ‘cell_key’ parameter is prepended to this value. (Default value: ‘prevalent_peaks’)

Returns:

Return type

None

plot_cells_dists(from_assay=None, cols=None, cell_key=None, group_key=None, color='steelblue', cmap='tab20', fig_size=None, label_size=10.0, title_size=10.0, sup_title=None, sup_title_size=12.0, scatter_size=1.0, max_points=10000, show_on_single_row=True)

Makes violin plots of the distribution of values present in cell metadata. This method is designed to distribution of nCounts, nFeatures, percentMito and percentRibo cell attrbutes.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cols (Optional[List[str]]) – Column names from cell metadata table to be used for plotting. Be default, nCounts, nFeatures, percentMito and percentRibo columns are chosen.

  • cell_key (Optional[str]) – One of the columns from cell metadata table that indicates the cells to be used for plotting. The values in the chosen column should be boolean (Default value: ‘I’)

  • group_key (Optional[str]) – A column name from cell metadata table that indicates how cells should be grouped. This can be any column that has either boolean or categorical values. By default, no grouping will be performed (Default value: None)

  • color (str) – Face color of the violin plots. The value can be valid matplotlib named colour. This is used only when there is a single group. (Default value: ‘steelblue’)

  • cmap (str) – A matplotlib colormap to be used to color different groups. (Default value: ‘tab20’)

  • fig_size (Optional[tuple]) – A tuple of figure width and figure height (Default value: Automatically determined by plot_qc)

  • label_size (float) – The font size of y-axis labels (Default value: 10.0)

  • title_size (float) – The font size of title. Median value is printed as title of each violin plot (Default value: 10.0)

  • sup_title (Optional[str]) – The title for complete figure panel (Default value: 12.0 )

  • sup_title_size (float) – The font size of title for complete figure panel (Default value: 12.0 )

  • scatter_size (float) – Size of each point in the violin plot (Default value: 1.0)

  • max_points (int) – Maximum number of points to display over violin plot. Random uniform sampling will be performed to bring down the number of datapoints to this value. This does not effect the violin plot. (Default value: 10000)

  • show_on_single_row (bool) – Show all subplots in a single row. It might be useful to set this to False if you have too many groups within each subplot (Default value: True)

Return type

None

Returns

None

plot_cluster_tree(*, from_assay=None, cell_key=None, feat_key=None, cluster_key=None, fill_by_value=None, force_ints_as_cats=True, width=1, lvr_factor=0.5, vert_gap=0.2, min_node_size=10, node_size_multiplier=10000.0, node_power=1.2, root_size=100, non_leaf_size=10, show_labels=True, fontsize=10, root_color='#C0C0C0', non_leaf_color='k', cmap='tab20', color_key=None, edgecolors='k', edgewidth=1, alpha=0.7, figsize=(5, 5), ax=None, show_fig=True, savename=None, save_dpi=300)

Plots a hierarchical layout of the clusters detected using run_clustering in a binary tree form. This helps evaluate the relationships between the clusters. This figure can complement embeddings likes tSNE where global distances are not preserved. The plot shows clusters as coloured nodes and the nodes are sized proportionally to the number of cells within the clusters. Root and branching nodes are shown to visually track the branching pattern of the tree. This figure is not scaled, i.e. the distances between the nodes are meaningless and only the branching pattern of the nodes must be evaluated.

https://epidemicsonnetworks.readthedocs.io/en/latest/functions/EoN.hierarchy_pos.html

Parameters
  • color_key (Optional[dict]) –

  • force_ints_as_cats (bool) –

  • fill_by_value (Optional[str]) –

  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – One of the columns from cell metadata table that indicates the cells to be used. Should be same as the one that was used in one of the run_clustering calls for the given assay. The values in the chosen column should be boolean (Default value: ‘I’)

  • feat_key (Optional[str]) – Feature key. Should be same as the one that was used in run_clustering calls for the given assay. By default the latest used feature for the given assay will be used.

  • cluster_key (Optional[str]) –

    Should be one of the columns from cell metadata table that contains the output of

    run_clustering method. For example if chosen assay is RNA and default value for label

    parameter was used in run_clustering then cluster_key can be ‘RNA_cluster’

  • width (float) – Horizontal space allocated for the branches. Larger values may disrupt the hierarchical layout of the cells (Default value: 1)

  • lvr_factor (float) – Leaf vs root factor. Controls the relative nodes horizontal spacing between as one moves up or down the tree. Higher values will cause terminal nodes to be more spread out at cost of nodes closer to the root and vice versa. (Default value: 0.5)

  • vert_gap (float) – Gap between levels of hierarchy (Default value: 0.2)

  • min_node_size (float) – Minimum size of a node (Default value: 10 )

  • node_size_multiplier (float) – Size of each leaf node is increased by this factor (Default value: 1e4)

  • node_power (float) – The number of cells within each cluster is raised to this value to scale up the node size. (Default value: 1.2)

  • root_size (float) – Size of the root node (Default value: 100)

  • non_leaf_size (float) – Size of the nodes that represent branch points in the tree (Default value: 10)

  • show_labels (bool) – Whether to show the cluster labels on the cluster nodes (Default value: True)

  • fontsize (float) – Font size of cluster labels. Only used when do_label is True (Default value: 10)

  • root_color (str) – Colour for root node. Acceptable values are Matplotlib named colours or hexcodes for colours. (Default value: ‘#C0C0C0’)

  • non_leaf_color (str) – Colour for branchpoint nodes. Acceptable values are Matplotlib named colours or hexcodes for colours. (Default value: ‘k’)

  • cmap – A colormap to be used to colour cluster nodes. Should be one of Matplotlib colourmaps. (Default value: ‘tab20’)

  • edgecolors (str) – Edge colour of circles representing nodes in the hierarchical tree (Default value: ‘k)

  • edgewidth (float) – Line width of the edges circles representing nodes in the hierarchical tree (Default value: 1)

  • alpha (float) – Alpha level (Opacity) of the displayed nodes in the figure. (Default value: 0.7)

  • figsize – A tuple with describing figure width and height (Default value: (5, 5))

  • ax – An instance of Matplotlib’s Axes object. This can be used to to plot the figure into an already created axes. (Default value: None)

  • show_fig (bool) – If, False then axes object is returned rather then rendering the plot (Default value: True)

  • savename (Optional[str]) – Path where the rendered figure is to be saved. The format of the saved image depends on the the extension present in the parameter value. (Default value: None)

  • save_dpi (int) – DPI when saving figure (Default value: 300)

Returns

None

plot_layout(*, from_assay=None, cell_key=None, layout_key=None, color_by=None, subselection_key=None, size_vals=None, clip_fraction=0.01, width=6, height=6, default_color='steelblue', cmap=None, color_key=None, mask_values=None, mask_name='NA', mask_color='k', point_size=10, do_shading=False, shade_npixels=1000, shade_sampling=0.1, shade_min_alpha=10, spread_pixels=1, spread_threshold=0.2, ax_label_size=12, frame_offset=0.05, spine_width=0.5, spine_color='k', displayed_sides=('bottom', 'left'), legend_ondata=True, legend_onside=True, legend_size=12, legends_per_col=20, marker_scale=70, lspacing=0.1, cspacing=1, shuffle_df=False, sort_values=False, savename=None, save_dpi=300, ax=None, fig=None, force_ints_as_cats=True, scatter_kwargs=None)

Create a scatter plot with a chosen layout. The methods fetches the coordinates based from the cell metadata columns with layout_key prefix. DataShader library is used to draw fast rasterized image is do_shading is True. This can be useful when large number of cells are present to quickly render the plot and avoid over-plotting. The description of shading parameters has mostly been copied from the Datashader API that can be found here: https://holoviews.org/_modules/holoviews/operation/datashader.html

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – One of the columns from cell metadata table that indicates the cells to be used. The values in the chosen column should be boolean (Default value: ‘I’)

  • layout_key (Optional[str]) – A prefix to cell metadata columns that contains the coordinates for the 2D layout of the cells. For example, ‘RNA_UMAP’ or ‘RNA_tSNE’

  • color_by (Optional[str]) – One of the columns of the metadata table or a feature names (for example gene, GATA2). (Default: None)

  • subselection_key (Optional[str]) – A column from cell metadata table to be used to show only a subselection of cells. This key can be used to hide certain cells from a 2D layout. (Default value: None)

  • size_vals – An array of values to be used to set sizes of each cell’s datapoint in the layout. By default all cells are of same size determined by point_size parameter. Has no effect if do_shading is True (Default value: None)

  • clip_fraction (float) – Same as clip_fraction parameter of ‘get_cell_vals’ method. This value is multiplied by 100 and the percentiles are soft-clipped from either end. (Default value: 0)

  • width (float) – Figure width (Default value: 6)

  • height (float) – Figure height (Default value: 6)

  • default_color (str) – A default color for the cells. (Default value: steelblue)

  • cmap – A matplotlib colourmap to be used to colour categorical or continuous values plotted on the cells. (Default value: tab20 for categorical variables and cmocean.deep for continuous variables)

  • color_key (Optional[dict]) –

    A custom colour map for cells. These can be used for categorical variables only. The keys in this dictionary should be the category label as present in the color_by column and values should be

    valid matplotlib colour names or hex codes of colours. (Default value: None)

  • mask_values (Optional[list]) – These can a subset of categorical variables that are present in color_by which you would like to mask away. These values would be combined under a same label (mask_name) and will be given same colour (mask_color)

  • mask_name (str) – Label to replace the masked value labels. (Default value : None)

  • mask_color (str) – Color to be used for masked values. This should be a valid matplotlib named colour or a hexcode of a colour. (Default value: ‘k’)

  • point_size (float) – Size of each scatter point. This is overridden if size_vals is provided. Has no effect if do_shading is True. (Default value: 10)

  • do_shading (bool) – Sets shading mode on/off. If shading mode is off (default) then matplotlib’s scatter function is is used otherwise a rasterized image is generated using datashader library. Turn this on if you have more than 100K cells to improve render time and also to avoid issues with overplotting. (Default value: False)

  • shade_npixels (int) –

    Number of pixels to rasterize (for both height and width). This controls the resolution of the figure. Adjust this according to the size of the image you want to generate.

    (Default value: 1000)

  • shade_sampling (float) – Specifies the smallest allowed sampling interval along the x and y axis. Larger values will lead loss of resolution (Default value: 0.1)

  • shade_min_alpha (int) – The minimum alpha value to use for non-empty pixels when doing colormapping, in [0, 255]. Use a higher value to avoid undersaturation, i.e. poorly visible low-value datapoints, at the expense of the overall dynamic range. (Default value: 10)

  • spread_pixels (int) – Maximum number of pixels to spread on all sides (Default value: 1)

  • spread_threshold (float) – When spreading, determines how far to spread. Spreading starts at 1 pixel, and stops when the fraction of adjacent non-empty pixels reaches this threshold. Higher values give more spreading, up to the spread_pixels allowed. (Default value: 0.2)

  • ax_label_size (float) – Font size for the x and y axis labels. (Default value: 12)

  • frame_offset (float) – Extend the x and y axis limits by this fraction (Default value: 0.05)

  • spine_width (float) – Line width of the displayed spines (Default value: 0.5)

  • spine_color (str) – Colour of the displayed spines. (Default value: ‘k’)

  • displayed_sides (tuple) – Determines which figure spines are chosen. The spines to be shown can be supplied as a tuple. The options are: top, bottom, left and right. (Default value: (‘bottom’, ‘left) )

  • legend_ondata (bool) – Whether to show category labels on the data (scatter points). The position of the label is the centroid of the corresponding values. Has no effect if color_by has continuous values. (Default value: True)

  • legend_onside (bool) – Whether to draw a legend table on the side of the figure. (Default value: True)

  • legend_size (float) – Font size of the legend text. (Default value: 12)

  • legends_per_col (int) – Number of legends to be used on each legend column. This value determines how many legend legend columns will be drawn (Default value: 20)

  • marker_scale (float) – The relative size of legend markers compared with the originally drawn ones. (Default value: 70)

  • lspacing (float) – The vertical space between the legend entries. Measured in font-size units. (Default value: 0.1)

  • cspacing (float) – The spacing between columns. Measured in font-size units. (Default value: 1)

  • savename (Optional[str]) – Path where the rendered figure is to be saved. The format of the saved image depends on the the extension present in the parameter value. (Default value: None)

  • save_dpi (int) – DPI when saving figure (Default value: 300)

  • shuffle_df (bool) – Shuffle the order of cells in the plot (Default value: False)

  • sort_values (bool) – Sort the values before plotting. Setting True will cause the datapoints with (cells) with larger values to be plotted over the ones with lower values. (Default value: False)

  • ax – An instance of Matplotlib’s Axes object. This can be used to to plot the figure into an already created axes. It is ignored if do_shading is set to True. (Default value: None)

  • fig – An instance of Matplotlib Figure. This is required to draw colorbar for continuous values. It is ignored if do_shading is set to True. (Default value: None)

  • force_ints_as_cats (bool) – Force integer labels in color_by as categories. If False, then integer will be treated as continuous variables otherwise as categories. This effects how colourmaps are chosen and how legends are rendered. Set this to False if you are large number of unique integer entries (Default: True)

  • scatter_kwargs (Optional[dict]) – Keyword argument to be passed to matplotlib’s scatter command

Returns

None

plot_marker_heatmap(*, from_assay=None, group_key=None, cell_key=None, topn=5, log_transform=True, vmin=- 1, vmax=2, savename=None, save_dpi=300, **heatmap_kwargs)

Displays a heatmap of top marker gene expression for the chosen groups (usually cell clusters). Z-scores are calculated for each marker gene before plotting them. The groups are subjected to hierarchical clustering to bring groups with similar expression pattern in proximity.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • group_key (Optional[str]) – Required parameter. This has to be a column name from cell metadata table. This column dictates how the cells will be grouped. This value should be same as used for run_marker_search

  • cell_key (Optional[str]) – One of the columns from cell metadata table that indicates the cells to be used. Should be same as the one that was used in one of the run_marker_search calls for the given assay. The values in the chosen column should be boolean (Default value: ‘I’)

  • topn (int) – Number of markers to be displayed for each group in group_key column. The markers are sorted based on obtained scores by run_marker_search. (Default value: 5)

  • log_transform (bool) – Whether to log-transform the values before displaying them in the heatmap. (Default value: True)

  • vmin (float) – z-scores lower than this value are ceiled to this value. (Default value: -1)

  • vmax (float) – z-scores higher than this value are floored to this value. (Default value: 2)

  • savename (Optional[str]) – Path where the rendered figure is to be saved. The format of the saved image depends on the the extension present in the parameter value. (Default value: None)

  • save_dpi (int) – DPI when saving figure (Default value: 300)

  • **heatmap_kwargs – Keyword arguments to be forwarded to seaborn.clustermap

Returns

None

run_cell_cycle_scoring(*, from_assay=None, cell_key=None, s_genes=None, g2m_genes=None, n_bins=50, rand_seed=4466, s_score_label='S_score', g2m_score_label='G2M_score', phase_label='cell_cycle_phase')

Computes S and G2M phase scores by taking into account the average expression of S and G2M phase genes respectively. Following steps are taken for each phase: - Average expression of all the genes in across cell_key cells is calculated - The log average expression is divided in n_bins bins - A control set of genes is identified by sampling genes from same expression bins where phase’s genes are present. - The average expression of phase genes (Ep) and control genes (Ec) is calculated per cell. - A phase score is calculated as: Ep-Ec Cell cycle phase is assigned to each cell based on following rule set: - G1 phase: S score < -1 > G2M sore - S phase: S score > G2M score - G2M phase: G2M score > S score

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Cell key. Should be same as the one that was used in the desired graph. (Default value: ‘I’)

  • s_genes (Optional[List[str]]) – A list of S phase genes. If not provided then Scarf loads pre-saved genes accessible at scarf.bio_data.s_phase_genes

  • g2m_genes (Optional[List[str]]) – A list of G2M phase genes. If not provided then Scarf loads pre-saved genes accessible at scarf.bio_data.g2m_phase_genes

  • n_bins (int) – Number of bins into which average expression of genes is divided.

  • rand_seed (int) – A random values to set seed while sampling cells from a cluster randomly. (Default value: 4466)

  • s_score_label (str) – A base label for saving the S phase scores into a cell metadata column (Default value: ‘S_score’)

  • g2m_score_label (str) – A base label for saving the G2M phase scores into a cell metadata column (Default value: ‘G2M_score’)

  • phase_label (str) – A base label for saving the inferred cell cycle phase into a cell metadata column (Default value: ‘cell_cycle_phase’)

Returns:

run_marker_search(*, from_assay=None, group_key=None, cell_key=None, threshold=0.25, gene_batch_size=50)

Identifies group specific features for a given assay. Please check out the find_markers_by_rank function for further details of how marker features for groups are identified. The results are saved into the Zarr hierarchy under markers group.

Parameters
  • from_assay (Optional[str]) – Name of the assay to be used. If no value is provided then the default assay will be used.

  • group_key (Optional[str]) – Required parameter. This has to be a column name from cell metadata table. This column dictates how the cells will be grouped. Usually this would be a column denoting cell clusters.

  • cell_key (Optional[str]) – To run the test on specific subset of cells, provide the name of a boolean column in the cell metadata table. (Default value: ‘I’)

  • threshold (float) – This value dictates how specific the feature value has to be in a group before it is considered a marker for that group. The value has to be greater than 0 but less than or equal to 1 (Default value: 0.25)

  • gene_batch_size (int) – Number of genes to be loaded in memory at a time. All cells (from ell_key) are loaded for these number of cells at a time.

Returns:

Return type

None

show_zarr_tree(start='/', depth=None)
Parameters
  • start

  • depth

Return type

None

Returns

None

to_anndata(from_assay=None, cell_key=None, layers=None)

Writes an assay of the Zarr hierarchy to AnnData file format.

Parameters
  • from_assay (Optional[str]) – Name of assay to be used. If no value is provided then the default assay will be used.

  • cell_key (Optional[str]) – Name of column from cell metadata that has boolean values. This is used to subset cells

  • layers (Optional[dict]) – A mapping of layer names to assay names. Ex. {‘spliced’: ‘RNA’, ‘unspliced’: ‘URNA’}. The raw data from the assays will be stored as sparse arrays in the corresponding layer in anndata.

Returns: anndata object

Assay

A generic Assay class that contains methods to calculate feature level statistics. It also provides a method for saving normalized subset of data for later KNN graph construction.

class scarf.assay.Assay(z, name, cell_data, nthreads, min_cells_per_feature=10)

A generic Assay class that contains methods to calculate feature level statistics.

It also provides a method for saving normalized subset of data for later KNN graph construction.

name
z
cells
nthreads
rawData
feats
attrs
normMethod
sf
normed()
to_raw_sparse()
add_percent_feature()
save_normalized_data()
score_features()
add_percent_feature(feat_pattern, name)
Parameters
  • feat_pattern (str) –

  • name (str) –

Returns:

Return type

None

normed(cell_idx=None, feat_idx=None, **kwargs)
Parameters
  • cell_idx (Optional[ndarray]) –

  • feat_idx (Optional[ndarray]) –

  • **kwargs

Returns:

save_normalized_data(cell_key, feat_key, batch_size, location, log_transform, renormalize_subset, update_keys)
Parameters
  • cell_key (str) –

  • feat_key (str) –

  • batch_size (int) –

  • location (str) –

  • log_transform (bool) –

  • renormalize_subset (bool) –

  • update_keys (bool) –

Returns:

Return type

<module ‘dask.array’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.7.8/lib/python3.8/site-packages/dask/array/__init__.py’>

score_features(feature_names, cell_key, ctrl_size, n_bins, rand_seed)
Parameters
  • feature_names (List[str]) –

  • cell_key (str) –

  • ctrl_size (int) –

  • n_bins (int) –

  • rand_seed (int) –

Returns:

Return type

ndarray

to_raw_sparse(cell_key)
Parameters

cell_key

Returns:

RNAassay

This assay is designed for feature selection and normalization of scRNA-Seq data

class scarf.assay.RNAassay(z, name, cell_data, **kwargs)

This assay is designed for feature selection and normalization of scRNA-Seq data.

Subclass of Assay.

mark_hvgs(cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, hvg_key_name, show_plot, **plot_kwargs)
Parameters
  • cell_key (str) –

  • min_cells (int) –

  • top_n (int) –

  • min_var (float) –

  • max_var (float) –

  • min_mean (float) –

  • max_mean (float) –

  • n_bins (int) –

  • lowess_frac (float) –

  • blacklist (str) –

  • hvg_key_name (str) –

  • show_plot (bool) –

  • **plot_kwargs

Return type

None

normed(cell_idx=None, feat_idx=None, renormalize_subset=False, log_transform=False, **kwargs)
Parameters
  • cell_idx (Optional[ndarray]) –

  • feat_idx (Optional[ndarray]) –

  • renormalize_subset (bool) –

  • log_transform (bool) –

  • **kwargs

Returns:

set_feature_stats(cell_key, min_cells)
Parameters
  • cell_key (str) –

  • min_cells (int) –

Returns:

Return type

None

ATACassay

This assay is designed for feature selection and normalization of scATAC-Seq data

class scarf.assay.ATACassay(z, name, cell_data, **kwargs)
mark_prevalent_peaks(cell_key, top_n, prevalence_key_name)
Parameters
  • cell_key (str) –

  • top_n (int) –

  • prevalence_key_name (str) –

Returns:

Return type

None

normed(cell_idx=None, feat_idx=None, **kwargs)
Parameters
  • cell_idx (Optional[ndarray]) –

  • feat_idx (Optional[ndarray]) –

  • **kwargs

Returns:

set_feature_stats(cell_key)
Parameters

cell_key (str) –

Returns:

Return type

None

ADTassay

This assay is designed for feature selection and normalization of ADTs from CITE-Seq data

class scarf.assay.ADTassay(z, name, cell_data, **kwargs)
normed(cell_idx=None, feat_idx=None, **kwargs)
Parameters
  • cell_idx (Optional[ndarray]) –

  • feat_idx (Optional[ndarray]) –

  • **kwargs

Returns:

MetaData

class scarf.metadata.MetaData(zgrp)

MetaData class for cells and features

active_index(key)
Parameters

key (str) –

Returns:

Return type

ndarray

property columns: List[str]

Returns:

Return type

List[str]

drop(column)
Parameters

column (str) –

Returns:

Return type

None

fetch(column, key='I')

Get column values for only valid rows

Parameters
  • column (str) –

  • key (str) –

Returns:

Return type

ndarray

fetch_all(column)
Parameters

column (str) –

Returns:

Return type

ndarray

get_dtype(column)
Parameters

column (str) – Column name of the table

Returns:

Return type

type

get_index_by(value_targets, column, key=None)
Parameters
  • value_targets (List[Any]) –

  • column (str) –

  • key (Optional[str]) –

Returns:

Return type

ndarray

grep(pattern, only_valid=False)
Parameters
  • pattern (str) –

  • only_valid

Returns:

Return type

List[str]

head(n=5)
Parameters

n (int) –

Returns:

Return type

DataFrame

index_to_bool(idx, invert=False)
Parameters
  • idx (ndarray) –

  • invert (bool) –

Returns:

Return type

ndarray

insert(column_name, values, fill_value=nan, key='I', overwrite=False, location='primary', force=False)

add

Parameters
  • column_name (str) –

  • values (array) –

  • fill_value (Any) –

  • key (str) –

  • overwrite (bool) –

  • location (str) –

  • force (bool) –

Returns:

Return type

None

mount_location(zgrp, identifier)
Parameters
  • zgrp (<module 'zarr.hierarchy' from '/home/docs/checkouts/readthedocs.org/user_builds/scarf/envs/0.7.8/lib/python3.8/site-packages/zarr/hierarchy.py'>) –

  • identifier (str) –

Returns:

Return type

None

multi_sift(columns, lows, highs)
Parameters
  • columns (List[str]) –

  • lows (Iterable) –

  • highs (Iterable) –

Returns:

Return type

ndarray

remove_trend(x, y, n_bins=200, lowess_frac=0.1)
Parameters
  • x (str) –

  • y (str) –

  • n_bins (int) –

  • lowess_frac (float) –

Returns:

Return type

ndarray

reset_key(key)
Parameters

key (str) –

Returns:

Return type

None

sift(column, min_v=- inf, max_v=inf)
Parameters
  • column (str) –

  • min_v (float) –

  • max_v (float) –

Returns:

Return type

ndarray

to_pandas_dataframe(columns, key=None)
Parameters
  • columns (List[str]) –

  • key (Optional[str]) –

Returns:

Return type

DataFrame

unmount_location(identifier)
Parameters

identifier (str) –

Returns:

Return type

None

update_key(values, key)
Parameters
  • values (array) –

  • key

Returns:

Return type

None

Readers

scarf.readers

alias of <module ‘scarf.readers’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/checkouts/0.7.8/scarf/readers.py’>

Writers

scarf.writers

alias of <module ‘scarf.writers’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/scarf/checkouts/0.7.8/scarf/writers.py’>