Core Functions

This module defines core functions used by eco_helper and its submodules.

eco_helper.core.CellTypeCollection

This class handles cell type sub-datasets from EcoTyper.

class eco_helper.core.cell_types.CellTypeCollection(directories: list)

Bases: object

This class assembles the cell types from multiple EcoTyper results directories. It will store for each cell type the corresponding data directory from the given EcoTyper results directories. This class is iterable over the cell types identified, and can be indexed by the cell type name.

Parameters

directories (list or str) – A single or a list of multiple EcoTyper results (output) directories to get cell types from.

eco_helper.core.CellStateCollection

This class handles Cell State assignments for different EcoTyper runs.

class eco_helper.core.cell_states.CellStateCollection(directories: list)

Bases: CellTypeCollection

This class handles the state assignments between different EcoTyper runs. It will store for each cell type a dataframe with the cell type’s genes and their corresponding state assingnments. This class is iterable over the cell types identified, and can be indexed by the cell type name.

Parameters

directories (list) – List of EcoTyper results (output) directories to get state assignments from.

compare_gene_overlaps(percent: bool = False) pandas.DataFrame

Compares the gene set overlaps between differen EcoTyper runs over per cell type for all cell states.

Note

A per-state comparison makes no sense because the state labelling is arbitrary for each clustering and therefore S01 from two different runs need not correspond to the same state.

Parameters

percent (bool) – If True, compute the overlap as a percentage of the total set of genes per state.

Returns

df – Dataframe with the gene set overlaps for each cell type between different runs.

Return type

pd.DataFrame

export_to_gseapy(directory: str, prerank: bool = False, enrichr: bool = True)

Export the gene sets for each cell type and state into separate files in a directory. If get_genes() has not been called yet, this method will call it automatically.

Note

If both prerank and enrichr are set to true, then the ouput files will be placed in separate subdirectories.

Parameters
  • directory (str) – The directory to export the gene sets to.

  • prerank (bool) – Export both the gene names alongside the max. Fold change for gseapy prerank (default False).

  • enrichr (bool) – Export only the gene names as a simple text file for gseapy enrichr (default True).

get_genes()

Get the gene info with the fold-change data for each cell type and the assigned cell state.

Returns

genes – A GeneSetCollection with the gene info with the fold-change data for each cell type and the assigned cell state.

Return type

GeneSetCollection

save(directory: str)

Save the state assignments of each cell type to a directory (one file per cell type).

Note

The export_to_gseapy method allows streamlined export of gene sets destined for subsequent analysis with gseapy prerank or enrichr.

Parameters

directory (str) – The directory to save the state assignments to.

eco_helper.core.EcoTypeCollection

This class handles EcoType assignments between different EcoTyper runs.

class eco_helper.core.ecotypes.Ecotype(cell_types: Optional[list] = None, states: Optional[list] = None, genes: Optional[list] = None, label: Optional[str] = None)

Bases: object

The base class of an Ecotype holding cell types and associated states associated with the Ecotype.

Parameters
  • cell_types (list) – List of cell types associated with the Ecotype.

  • states (list) – List of cell states associated with the Ecotype.

  • genes (list) – List of pandas dataframes containing the genes associated with each celltype and state.

  • label (str) – An arbitrary identifier for the Ecotype.

add(cell_type: str, state: str, genes: Optional[pandas.DataFrame] = None)

Add a cell type and state to the Ecotype.

Parameters
  • cell_type (str) – The cell type to add.

  • state (str) – The cell type’s cell state.

  • genes (pd.DataFrame) – The genes associated with the cell state.

property cell_types
gene_set_filenames()

Assemble a list of gene set filenames (as created by the eco_helper.enrich.collect_gene_sets function) for all celltypes and states contributing to the Ecotype.

Returns

List of gene set filenames.

Return type

list

property genes
remove(cell_type: str, state: Optional[str] = None)

Remove a cell type and state from the Ecotype. If no state is given then all states associated with the cell-type are removed.

property states
to_df()

Convert the Ecotype to a pandas DataFrame with two columns, one for cell types and one for their states (as string identifiers/labels).

to_dict()

Convert the Ecotype to a dictionary with cell types and states keys and their associated genes as values.

class eco_helper.core.ecotypes.EcotypeCollection(directories: list)

Bases: CellStateCollection

This class handles Ecotype assignments between separate EcoTyper runs.

Parameters

directories (list) – List of EcoTyper results (output) directories to get ecotypes from.

match_genes_to_states()

Get the gene sets associated with each cell type’s cell state. This will replace the simple string description of the cell state with the respective dataframe within the ecotype_assignments dictionary.

eco_helper.core.gene_sets module

Classes to handle gene sets.

class eco_helper.core.gene_sets.BaseOverlap(a: set, b: set)

Bases: object

This class handles a single overlap between two sets of genes.

Parameters
  • a (set) – The first set of genes.

  • b (set) – The second set of genes.

get(percent: bool = False)

Get a pandas dataframe of the overlaps between the two sets, either in percentages or in absolute counts (in which case a “total” column is added).

Parameters

percent (bool) – If True, return the overlap in percentages.

class eco_helper.core.gene_sets.GeneSetCollection(gene_sets: Optional[dict] = None)

Bases: object

This class handles a collection of gene sets for different cell types. It stores for each cell type a dataframe with the gene sets for each state. This class is iterable over cell types and can be indexed by cell type name and additionally by cell state.

Parameters

gene_sets (dict) – A dictionary with cell type labels as keys and a dataframe of extracted genes with a “State” column to describe their assigned state.

property cell_types
items()
keys()
save(file_or_directory: str)

Save the gene sets either to a single condensed file or as separate files (one per cell type) into a directory.

Parameters

file_or_directory (str) – The file or directory to save the gene sets to.

subsets(cell_type: Optional[str] = None)

Return a groupby object for the given cell_type dataframe.

Parameters

cell_type (str) – The cell type label. If none is provided a generator is returned with a groupby object for each cell type.

Returns

A groupby object for the given cell_type dataframe. Or a generator with a groupby object for each cell type.

Return type

groupby

values()
class eco_helper.core.gene_sets.GeneSetOverlap(cell_type: str, state_assignments: pandas.DataFrame)

Bases: object

A class to handle the overlap between different cell states and separate Ecotyper runs for a single cell type.

Parameters
  • cell_type (str) – The cell type label.

  • state_assignments (dict) – A pandas dataframe with a genes as index, a “State” column specifying the state to which the gene was assigned, and a “run” column specifying which Ecotyper run the assignment is from.

compute_overlap(percent: bool = False)

Compute the overlap between between separate Ecotyper runs for each cell state individually.

Parameters

percent (bool) – If True, compute the overlap as a percentage of the total set of genes per state.

eco_helper.core.EcoTyperConfig

Read an EcoTyper config yaml file.

class eco_helper.core.ecotyper_config.EcoTyperConfig(filename: str)

Bases: object

This class handles the EcoTyper configuration yaml data.

Parameters

filename (str) – The path to the config file.

property annotation_columns

The annotation columns used for plotting the heatmaps

property annotation_file

The annotation file used

property cophentic_cutoff

The cophentic cutoff used

property dataset

The dataset name used

property expression_matrix

The expression matrix used

property output_dir

The output directory used

eco_helper.core.ecotyper_config.read_ecotyper_config(filename: str)

Reads the config file for an EcoTyper experiment.

Parameters

filename (str) – The path to the config file.

Returns

config – The config file as a dictionary.

Return type

dict

eco_helper.core.Dataset

The class to handle EcoTyper datasets as pairs of annotation-tables and expression-matrices.

class eco_helper.core.dataset.Dataset(annotation: str, expression: str)

Bases: object

This class handles an EcoTyper dataset.

Parameters
  • annotation (str) – The filename of the annotation file.

  • expression (str) – The filename of the expression matrix file.

read(annotation: Optional[str] = None, expression: Optional[str] = None)

Read in a (new) dataset from files.

Parameters
  • annotation (str) – The filename of the annotation file.

  • expression (str) – The filename of the expression matrix file.

write(annotation: Optional[str] = None, expression: Optional[str] = None)

Write the dataset to files.

Parameters
  • annotation (str) – The filename of the annotation file.

  • expression (str) – The filename of the expression matrix file.

eco_helper.core.dataset.read_anotation(filename: str)

Reads in an annotation file and returns a pandas DataFrame.

Parameters

filename (str) – The filename of the annotation file.

Returns

annotation – The annotation file as a pandas DataFrame.

Return type

pandas.DataFrame

eco_helper.core.dataset.read_expression(filename: str)

Reads in an expression matrix file and returns a pandas DataFrame.

Parameters

filename (str) – The filename of the expression matrix file.

Returns

expression – The expression matrix file as a pandas DataFrame.

Return type

pandas.DataFrame

eco_helper.core.dataset.write_annotation(dataset: Dataset, filename: str)

Writes the annotation file of a dataset to a file.

Parameters
  • dataset (Dataset) – The dataset to write the annotation file for.

  • filename (str) – The filename to write the annotation file to.

eco_helper.core.dataset.write_expression(dataset: Dataset, filename: str)

Writes the expression matrix of a dataset to a file.

Parameters
  • dataset (Dataset) – The dataset to write the expression matrix for.

  • filename (str) – The filename to write the expression matrix to.

eco_helper.core settings

Generic settings for eco_helper

eco_helper.core.settings.cell_type_col = 'CellType'

The data column handling the “cell type” assignment.

eco_helper.core.settings.ecotype_col = 'Ecotype'

The data column handling the ecotype assignment.

eco_helper.core.settings.ecotyper_experiment_col = 'run'

The data column handling the Ecotyper experiment name.

eco_helper.core.settings.ecotypes_assignment_file = 'ecotype_assignment.txt'

The file containing the Ecotypes assignment to samples from an EcoTyper experiment.

eco_helper.core.settings.ecotypes_composition_file = 'ecotypes.txt'

The file containing the composition data for Ecotypes from an EcoTyper experiment.

eco_helper.core.settings.ecotypes_folder = 'Ecotypes'

The folder containing the Ecotypes from an EcoTyper experiment.

eco_helper.core.settings.enrichr_outdir = 'gseapy_enrichr'

The output directory for the gseapy enrichr gene sets.

eco_helper.core.settings.enrichr_results_suffix = '.enrichr.txt'

The suffix for gseapy enrichr results files.

eco_helper.core.settings.gene_col = 'Gene'

The data handling the gene names or identifiers.

eco_helper.core.settings.gene_info_file = 'gene_info.txt'

The file containing the gene info per celltype, including max fold change and state assignments.

eco_helper.core.settings.gene_sets_outdir = 'gene_sets'

The output directory for extracted gene sets files.

eco_helper.core.settings.gene_sets_suffix = '.genes.txt'

The suffix for a celltype gene sets file.

eco_helper.core.settings.gseapy_outdir = 'gseapy_results'

The output directory for the gseapy results.

eco_helper.core.settings.prerank_outdir = 'gseapy_prerank'

The output directory for the gseapy prerank gene sets.

eco_helper.core.settings.prerank_results_suffix = '.prerank.txt'

The suffix for gseapy prerank results files.

eco_helper.core.settings.rel_expr_col = 'MaxFC'

The data column handling the relative expression.

eco_helper.core.settings.state_assignments_suffix = '.state_assignment.txt'

The suffix for a celltype state assignments file.

eco_helper.core.settings.state_col = 'State'

The data column handling the “state assignment”.

eco_helper.core.terminal_funcs module

These are core functions of eco_helper that work with the terminal and running subprocesses. They are mostly wrappers for subprocess.run(...) that capture output directly without the need for manually catching and decoding them.

eco_helper.core.terminal_funcs.bash()

Get the current bash executable.

Returns

The path to the bash executable.

Return type

str

eco_helper.core.terminal_funcs.from_terminal(cmd: str) TerminalOutput

Run a command in the terminal and return the output.

Parameters

cmd (str) – The command to run.

Returns

The output of the command. Which stores the stdout, stderr, and returncode.

Return type

TerminalOutput

eco_helper.core.terminal_funcs.returncode(cmd: str) int

Run a command in the terminal and return the returncode.

Parameters

cmd (str) – The command to run.

Returns

The returncode of the command.

Return type

int

eco_helper.core.terminal_funcs.run(cmd: str)

Run a command in the terminal without catching any outputs. Note, this will run in shell=True.

Parameters

cmd (str) – The command to run.

eco_helper.core.terminal_funcs.stderr(cmd: str, file: Optional[str] = None) str

Run a command in the terminal and return the stderr.

Parameters

cmd (str) – The command to run.

Returns

The stderr of the command.

Return type

str

eco_helper.core.terminal_funcs.stdout(cmd: str, file: Optional[str] = None) str

Run a command in the terminal and return the stdout.

Parameters
  • cmd (str) – The command to run.

  • file (str) – A file to write the stdout to. Note this will overwrite any previously existing file of the same name!

Returns

The stdout of the command.

Return type

str

eco_helper.core.TerminalOutput

This class handles the stdout and stderr of a subprocess run.

class eco_helper.core.TerminalOutput.TerminalOutput(process: CompletedProcess)

Bases: object

A class to capture the output of a subprocess run.

Parameters

process (subprocess.CompletedProcess) – The completed process object from which to read the output.

stdout

The stdout of the subprocess run.

Type

str

stderr

The stderr of the subprocess run.

Type

str

returncode

The return code of the subprocess run.

Type

int

read_output(process: CompletedProcess)

Read the output of a subprocess run.

Parameters

process (subprocess.CompletedProcess) – The completed process object from which to read the output.

success()

Check if the subprocess run was successful.

eco_helper.core.find module

Find data files within the EcoTyper output directories or the EcoTyper internal directories.

eco_helper.core.find.find_files(parent: str, pattern: str)

Find files within a directory using glob.

Parameters
  • parent (str) – The path to the parent directory.

  • pattern (str) – The pattern to use for finding files.

Returns

files – The files within the directory.

Return type

list or None

eco_helper.core.find.find_subdirs(parent: str, pattern: str)

Find subdirectories within a directory using glob.

Parameters
  • parent (str) – The path to the parent directory.

  • pattern (str) – The pattern to use for finding subdirectories.

Returns

subdirs – The subdirectories within the directory.

Return type

list or None