Gene Set Enrichment Analysis¶

This module provides the functionality to perform gene-set enrichment analysis on the results of an Ecotyper run.

eco_helper uses the gseapy package to perform gene set enrichment analysis. By default gseapy enrichr and gseapy prerank analyses are offered. The former is an API to the enrichr web service that requires only gene names as inputs. The latter requires an additional “rank” dataset associated with the respective gene names. eco_helper uses the associated maximum fold change from each gene within the respective cell state as rank data.

Usage¶

>>> eco_helper enrich [--prerank] [--enrichr] [--assemble] [--gene_sets <gene sets>] [--output <output>] <input>

where <input> is the path to the EcoTyper results directory, and <output> is the path to the output directory. eco_helper offers by default either the gseapy prerank (--prerank option) or the enrichr method (--enrichr option) for gene set enrichment analysis. Both can be passed at the same time. By default each cell-type will produce a separate data file for each of its cell-states. Using the --assemble option, these individual files will be merged into one single data file for each cell type, including the enrichment results for all its cell-states. In this case the individual files will be removed. The --gene_sets option can be used to specify the reference gene sets to query when performing the enrichment analysis. Multiple inputs of any format that are accepted by gseapy are allowed, and at least one input is required.

Full CLI¶

usage: eco_helper enrich [-h] [-o OUTPUT]
                     [-g GENE_SETS [GENE_SETS ...]] [-p] [-e]
                     [-a] [-E] [-n]
                     [--notebook_config NOTEBOOK_CONFIG]
                     [--pickle] [--organism ORGANISM]
                     [--size SIZE SIZE]
                     [--permutations PERMUTATIONS]
                     input

This command performs gene set enrichment analysis using `gseapy` on the results of an EcoTyper analysis.

positional arguments:
input                 The directory storing the EcoTyper results.

options:
-h, --help            show this help message and exit
-o OUTPUT, --output OUTPUT
                        Output directory. By default a '<input>_gseapy_results' directory within the same location as the input directory.
-g GENE_SETS [GENE_SETS ...], --gene_sets GENE_SETS [GENE_SETS ...]
                        The reference gene sets to use for enrichment analysis. This can be any number of accepted gene set inputs for gseapy enrichr or prerank.
-p, --prerank         Use this to perform gseapy prerank analysis.
-e, --enrichr         Use this to perform gseapy enrichr analysis.
-a, --assemble        By default each cell type will produce a separate file for each cell state enrichment analysis. Using the `--assemble` option, all cell-
                        state files from one cell type will be merged together to a single file. In this case the individual files are removed.
-E, --ecotypes        Use this to only analyse cell-types and states contributing to Ecotypes. In this case each Ecotype will receive a subdirectory with its
                        enrichment results files. Note, in this case the files will *not* be assembled, and any non-Ecotype-contributing cell-type and state will
                        not be analysed.
-n, --notebook        Generate a jupyter notebook to analyse the enrichment results. If this option is specified, then the <intput> argument is interpreted as
                        the filename of the notebook to generate. By specifying '-' as filename a default filename with the dataset name is used.
--notebook_config NOTEBOOK_CONFIG
                        The configuration file for notebook
                        generation. This is required for the
                        notebook to be generated.
--pickle              Export a pickle file of the enrichment
                        results as an EnrichmentCollection. This
                        can be used in the web-viewer to further
                        inspect the enrichment results.
--organism ORGANISM   Set the reference organism. By default the
                        organism is set to 'human'.
--size SIZE SIZE      [prerank only] Set the minimum and maximum
                        number of gene matches for the reference
                        gene sets and the data. By default 5 and
                        500 are used. Note, this will require a two
                        number input for min and max.
--permutations PERMUTATIONS
                        [prerank only] Set the number of
                        permutations to use for the prerank
                        analysis. By default 1000 is used.

Web-Viewer¶

The eco_helper web-viewer can be used to further inspect the enrichment results. It is a stand-alone streamlit web app that can be run locally from command line or online. It requires a pickle file containing an EnrichmentCollection object. This can be generated by passing the --pickle option to the eco_helper enrich command.

To run the web-viewer locally, first install streamlit and clone the web-viewer repository from github. Then run the following command from the root directory of the repository:

streamlit run src/main.py

The web-viewer will then be available by default at localhost:8501.

Alternatively, the web-viewer is hosted on streamlit and can be directly accessed here. It is possible that the app is dormant to save resources. In this case it will take a few seconds to load.

Jupyter Notebook¶

eco_helper can auto-generate a jupyter notebook to analyse the enrichment results. This notebook will contain a number of cells for plots to visualise the enrichment results. To generate a notebook, the --notebook option must be passed, and a notebook config file must be provided using the --notebook_config option.

>>> eco_helper enrich --notebook --notebook_config my_enrichment_config.yaml my_enrichment.ipynb

In this case the notebook will automatically call eco_helper enrich in case no enrichment results are present yet for the desired dataset. The notebook will also contain cells to call eco_helper enrich whenever desired - so each notebook is its own little analysis pipeline with input sections, autmated processing, and output sections. However, certain options such as --size or --organism will be missing from the notebook-internal call to eco_helper enrich. Therefore, to retain full customization of the command it is recommended to first run eco_helper enrich manually to generate the desired enrichment results, and then run again with the notebook option. If a notebook encounters already existing results it will simply load these and perform its preset analysis without recomputing the enrichment itself (unless forced by the config).

At the core, the preset analysis consists of highlighting subsets of enriched terms which can be supplied in the notebook config file. The notebook will automatically pre-scan these subsets and remove any subsets that fall below a cutoff value among the highest most enriched terms for each cell-state dataset.

The notebook will then automatically generate cells to visualise plotly-based interactive scatterplots to quickly visualise the enrichment results and place the retained subsets in dedicated cells for each cell-state. These cells can later be modified manually by the user of course to generate streamlined figures. Cells for seaborn figures are also prepared but commented out by default.

Notebook config¶

In order to generate a notebook, a config file must be provided. This config file is a yaml file which contains the following keys:

# ----------------------------------------------------------------
#   Main directory settings for input and output data
# ----------------------------------------------------------------
directories :

    # available wildcards for filepaths are:
    # - {user}    | the current username
    # - {parent}  | the project parent directory
    # - {results} | the project's raw results directory
    # - {scripts} | the project's scripts directory ( is {parent}/scripts )

    # the ecotyper results directory for which to perform or load results of
    # enrichment analysis.
    ecotyper_dir : "{results}/your_ecotyper_results"

    # the directory where outputs (e.g. figures)
    # from within the notebook should be saved
    outdir : "{parent}/gsea_results/your_ecotyper_results"

    # the project directory
    parent : "/data/users/{user}/EcoTyper"

    # the directory of EcoTyper raw results
    results : "{parent}/results"

    # the directory where outputs from the notebook
    # should be saved (this is an optional
    # variable to faciliate working with the notebook)
    outdir : "{parent}/gsea_results"

# ----------------------------------------------------------------
#   Enrichment analysis settings
# ----------------------------------------------------------------
enrichment :

    # enrichment is automatically performed when no
    # enrichment results are found. If set to True then
    # re-computation of enrichment is forced even
    # when results are present already.
    perform_enrichment : False

    # if True only ecotype-contributing cell states are analysed
    # and results are stored in ecotype-specific subdirectories.
    # Otherwise all cell states are analysed and stored in cell-type
    # specific files.
    ecotype_resolution : True

    # perform GSEAPY enrichr
    enrichr : True

    # perform GSEAPY prerank
    prerank : False

    # the reference gene sets against which to query.
    # this can be any input type accepted by GSEAPY
    gene_sets :
        - "Reactome_2016"
        - "WikiPathway_2021_Human"
        - "Panther_2016"
        - "KEGG_2021_Human"
        - "GO_Biological_Process_2021"
        - "GO_Molecular_Function_2021"
        - "GO_Cellular_Component_2021"

# ----------------------------------------------------------------
#   Results analysis settings for automated gene set highlighting
# ----------------------------------------------------------------
analysis :

    # the topmost fraction of enriched terms to use for determining
    # if a category might be interesting (i.e. wheter or not
    # to keep it for a speific cell-state).
    top_most_fraction :  0.3

    # the minimum number of hits of a category among the topmost enriched terms
    # required to keep a category for a specific cell-state.
    cutoff : 5

    # provide a dictionary of reference categories / super-terms
    # which to query within the enrichment datasets in each cell-state.
    # or set to NULL to disable.
    references :

        # for example highlighting lipid associated terms
        "lipid associated" :
        - "lipid"
        - "lipo(protein)?"
        - "triacyl"
        - "lipase"
        - "acylglycer"
        - "triglycer"
        - "chylomicron"
        - "fat"
        - "fatty ?-?_?acid"
        - "L( |_|-)?DL"
        - "H( |_|-)?DL"
        - "V( |_|-)?LDL"

        "another category" :
            - "another pattern1"
            - "another pattern2"

eco_helper.enrich.funcs module¶

These are the main functions that are used by eco_helper enrich.

eco_helper.enrich.funcs.assemble_enrichr_results(directory: str, cell_types: CellTypeCollection, outdir: Optional[str] = None, remove_raw: bool = True)¶

Assemble the raw per cell-type and cell-state enrichr output text files into a single file per cell-type including all respective cell states.

Parameters

directory (str) – The directory storing the raw enrichr output files for each cell type and state.
cell_types (CellTypeCollection) – The cell types to use for assembling the results.
outdir (str) – The output directory to store the assembled results. If not specified, the results will be stored in the same directory as the raw enrichr results.
remove_raw (bool) – If True, the raw enrichr results will be removed after assembling.

eco_helper.enrich.funcs.assemble_prerank_results(directory: str, cell_types: CellTypeCollection, outdir: Optional[str] = None, remove_raw: bool = True)¶

Assemble the per-state prerank output text files into a single file per cell type.

Parameters

directory (str) – The directory storing the raw prerank output files for each cell type and state.
cell_types (CellTypeCollection) – The cell types to use for assembling the results.
outdir (str) – The output directory to store the assembled results. If not specified, the results will be stored in the same directory as the raw prerank results.
remove_raw (bool) – If True, the raw prerank results will be removed after assembling.

eco_helper.enrich.funcs.collect_gene_sets(directory: str, outdir: str, enrichr: bool = True, prerank: bool = False)¶

Collect gene sets from a EcoTyper output directory for subsequent gene set enrichment analysis.

Parameters

directory (str) – The path to the EcoTyper output directory.
outdir (str) – The path to the output directory.
enrichr (bool) – Set to True to export only gene names for subsequent gseapy enrichr.
prerank (bool) – Set to True to export gene names with max. Fold Change for pseapy prerank analysis.

eco_helper.enrich.funcs.enrichr(directory: str, outdir: str, gene_sets: list = 'KEGG_2021_Human', organism: str = 'human')¶

Perform gene set enrichment using gseapy enrichr for each cell type and each cell-state therein.

Parameters

directory (str) – The path to the directory where extracted gene sets are stored in separate text files. Note, if both prerank and enrichr sets were extracted, this will automatically adjust to the gseapy_enrichr subdirectory if necessary.
outdir (str) – The path to the output directory.
gene_sets (list or str) – The gene sets to use for enrichment analysis. By default the latest KEGG gene sets are used.
organism (str) – The organism to use for enrichment analysis. By default “human” is assumed.

eco_helper.enrich.funcs.enrichr_ecotypes(directory: str, outdir: str, ecotypes: EcotypeCollection, gene_sets: list = 'KEGG_2021_Human', organism: str = 'human')¶

Perform gene set enrichment analysis only on cell types and states contributing to Ecotypes. This will create dedicatd subdirectories within outdir for each ecotype, containing the corresponding enrichment results.

Parameters

directory (str) – The directory storing the extracted gene sets from an EcoTyper results directory. Note, this requires that the gene sets have already been extracted from the results directory!
outdir (str) – The directory to store the enrichment results in.
ecotypes (EcotypeCollection) – The EcotypeCollection specifying the cell-type and state assignments to ecotypes.
gene_sets (list or str) – The gene sets to perform enrichment analysis on.
organism (str) – The reference organism.

eco_helper.enrich.funcs.prerank(directory: str, outdir: str, gene_sets: list = 'KEGG_2021_Human', organism: str = 'human', min_size: int = 5, max_size=500, permutations: int = 1000, **kwargs)¶

Perform gene set enrichment using gseapy prerank for each cell type and each cell-state therein.

Parameters

directory (str) – The path to the directory where extracted gene sets are stored in separate text files. Note, if both prerank and prerank sets were extracted, this will automatically adjust to the gseapy_prerank subdirectory if necessary.
outdir (str) – The path to the output directory.
gene_sets (list or str) – The gene sets to use for enrichment analysis. By default the latest KEGG gene sets are used.
organism (str) – The organism to use for enrichment analysis. By default “human” is assumed.
min_size (int) – The minimum number of genes required to be found in a gene set.
max_size (int) – The maximum number of genes allowed to be found in a gene set.
permutations (int) – The number of permutations to use for the permutation test.
**kwargs – Any additional keyword arguments to pass to gseapy.prerank.

eco_helper.enrich.funcs.prerank_ecotypes(directory: str, outdir: str, ecotypes: EcotypeCollection, gene_sets: list = 'KEGG_2021_Human', organism: str = 'human', **kwargs)¶

Parameters

directory (str) – The directory storing the extracted gene sets from an EcoTyper results directory. Note, this requires that the gene sets have already been extracted from the results directory!
outdir (str) – The directory to store the enrichment results in.
ecotypes (EcotypeCollection) – The EcotypeCollection specifying the cell-type and state assignments to ecotypes.
gene_sets (list or str) – The gene sets to perform enrichment analysis on.
organism (str) – The reference organism.

eco_helper.enrich.visualise module¶

Core functions for data visualization. These are primarily intended for manual use when analyzing enrichment results generated through eco_helper enrich and are not part of the CLI.

StateScatterplot¶

The StateScatterplot class is a wrapper for the scatterplot function that allows for the visualization of enrichment data and selective highlighting of subsets within the data. To do so, set up a StateScatterplot object with the data of interest and call the highlight method with the desired subsets. subsets may be a dictionary of subset keys (labels) and lists of regex patterns to match against a reference column within the data. Alternatively, a function can be supplied that accepts the dataframe as sole argument and returns an array-like object that can be used as subset column. The subsets are always applied through differential coloring.

Example

from eco_helper.visualise import StateScatterplot

# Create a StateScatterplot object
sc = StateScatterplot( df, x = "enrichment_score", y = "log10_pvalue", hue = None, style = "gene_set" )

By itself the StateScatterplot object will will not visualise anything. To generate a basic plot use the plot method:

# Generate a basic plot
fig = sc.plot()
fig.show()

To highlight subsets of the data, use the highlight method. For instance, if we liked to highlight all enriched terms containing terms or fragments associated with sugar metabolism and G-protein coupled receptors, we could do:

subsets = {
            # anything that contains *ose*, or *glyco*, or *gluco* (these are always case insensitive)
            "sugar metabolism" : [ "ose", "glyco", "gluco" ],

            # anything that contains *gpcr* or G-protein-coupled-receptors (or any combination with space, dash, or underscore)
            "GPCR associated" : [ "gpcr", "g( |-|_)protein( |-|_)coupled( |-|_)receptor" ]
        }

# Highlight the subsets
fig = sc.highlight( subsets, ref_col = "Term" )
fig.show()

Plotting backends¶

eco_helper enrich can be run in two different backends: matplotlib and plotly. The matplotlib backend is the default and and uses seaborn and matplotlib to generate figures. The plotly backend uses plotly and plotly.graph_objs to generate figures. These figures are interactive and allow features such as hoverinfo or zooming.

To switch the backend back and forth set eco_helper.visualise.backend to either “matplotlib” or “plotly”.

import eco_helper.visualise as vis

vis.backend = "matplotlib" # to use matplotlib figures (default)

vis.backend = "plotly" # to use plotly figures

class eco_helper.enrich.visualise.StateScatterplot(df: pandas.DataFrame, x: str, y: str, hue: Optional[str] = None, style: Optional[str] = None)¶

Bases: object

This class allows scatterplot visualisation of a single celltype / state dataframe. Especially, it allows to selectively highlight subsets within the data easily for quicker insight.

Parameters

df (pd.DataFrame) – The source dataframe.
x (str) – The column to use as x-axis.
y (str) – The column to use as y-axis.
hue (str) – The column to use as hue.
style (str) – The column to use as marker style.

count_highlights(subsets: dict, ref_col: Optional[str] = None, topmost: Optional[float] = None, cutoff: Optional[int] = None, dual: bool = True, absolute: bool = False, ax: Optional[matplotlib.pyplot.Axes] = None, **kwargs)¶

Highlight subsets within the dataframe based on a reference column and a dictionary of subsets to highlight, showing the counts of terms associated with each subset in a barplot.

Parameters

subsets (dict or function) – The dictionary of subsets to highlight. This requires strings as keys and lists of regex patterns of associated terms as values. In case a function is provided, this function may take exactly one argument (the dataframe) and must return an array-like object suitable as a new dataframe column on which to base the highlighting.
ref_col (str) – The reference column of the dataframe. This is only required if a dictionary of subsets is provided.
topmost (float (optional)) – Count term association in the fraction of topmost enriched terms. This can either be used to restrict counting in total or to add a second counting dataset.
cutoff (int (optional)) – Remove any subsets that do not have at least this number of counts associated with them. Note, this will affect both the full count and a topmost count in the same way!
dual (bool (optional)) – If True, the full counts and topmost counts are plotted as separate subsets. Otherwise if topmost is provided, only the topmost counts are shown.
absolute (bool (optional)) – If True the counts are plotted as absolute values. Otherwise the counts are converted to fractions of all terms within the reference dataset.
ax (plt.Axes) – The subplot in which to plot. By default a new figure is being created. This is ignored in plotly backend.
**kwargs – Additional keyword arguments.

Returns

fig – The figure object.

Return type

matplotlib.figure.Figure or plotly.graph_objs.Figure

highlight(subsets: dict, ref_col: Optional[str] = None, other_color: str = 'gray', other_alpha: float = 0.1, ax: Optional[matplotlib.pyplot.Axes] = None, **kwargs)¶

Highlight subsets within the dataframe based on a reference column and a dictionary of subsets to highlight, showing the highlighted terms in a colored scatterplot.

Note

The other subset fading is only available in the matplotlib backend. In plotly backend other is just another regular subset. However, in this case the subset can simply be turned off in the legend.

Parameters

subsets (dict or function) – The dictionary of subsets to highlight. This requires strings as keys and lists of regex patterns of associated terms as values. In case a function is provided, this function may take exactly one argument (the dataframe) and must return an array-like object suitable as a new dataframe column on which to base the highlighting.
ref_col (str) – The reference column of the dataframe. This is only required if a dictionary of subsets is provided.
other_color (str) – The color to use for any data points not belonging to any of the highlighted subsets.
other_alpha (float) – The factor by which to reduce the opacity of non-highlighted data points relative to the highlighted subsets.
ax (plt.Axes) – The subplot in which to plot. By default a new figure is being created. This is ignored in plotly backend.
**kwargs – Additional keyword arguments.

Returns

fig – The figure object.

Return type

matplotlib.figure.Figure or plotly.graph_objs.Figure

plot(**kwargs)¶

The base plotting function.

Returns: fig – The figure object.
Return type: matplotlib.figure.Figure or plotly.graph_objs.Figure

top_gene_sets(subsets: Optional[dict] = None, ref_col: Optional[str] = None, n: Optional[int] = None, x_threshold: Optional[float] = None, y_threshold: Optional[float] = None, x: Optional[str] = None, y: Optional[str] = None, size: Optional[str] = None, hue: Optional[str] = None, ax: Optional[matplotlib.pyplot.Axes] = None, **kwargs)¶

Highlight the topmost-enriched gene sets. This can be either globally or based on a reference column and a dictionary of subsets to highlight, showing the topmost gene sets for each subset highlighted terms in a colored scatterplot.

Note

The other subset fading is only available in the matplotlib backend. In plotly backend other is just another regular subset. However, in this case the subset can simply be turned off in the legend.

Parameters

subsets (dict or function) – The dictionary of subsets to highlight. This requires strings as keys and lists of regex patterns of associated terms as values. In case a function is provided, this function may take exactly one argument (the dataframe) and must return an array-like object suitable as a new dataframe column on which to base the highlighting.
ref_col (str) – The reference column of the dataframe. This is only required if a dictionary of subsets is provided.
n (int) – The number of top gene sets to highlight. If this is provided, then the threshold arguments are ignored.
x_threshold (float) – The threshold for the x-axis. Only gene sets with an x-axis value above this threshold will be highlighted.
y_threshold (float) – The threshold for the y-axis. Only gene sets with an y-axis value above this threshold will be highlighted.
ax (plt.Axes) – The subplot in which to plot. By default a new figure is being created. This is ignored in plotly backend.
**kwargs – Additional keyword arguments.

Returns

fig – The figure object.

Return type

matplotlib.figure.Figure or plotly.graph_objs.Figure

eco_helper.enrich.visualise.backend = 'matplotlib'¶: The plotting backend to use. This can be either matplotlib or plotly.

eco_helper.enrich.visualise.collection_scatterplot(collection: EnrichmentCollection, x: str, y: str, hue: Optional[str] = None, style: Optional[str] = None, **kwargs)¶

Visualize the enrichment data from the prerank or enrichr dictionaries by a scatterplot, one for each ecotype.

Parameters

collection (EnrichmentCollection) – The EnrichmentCollection to visualize.
x (str) – The column to use as x-axis.
y (str) – The column to use as y-axis.
hue (str) – The column to use as hue.
style (str) – The column to use as style.
**kwargs – Additional keyword arguments to pass to seaborn.scatterplot.

Returns

fig – The figure object.

Return type

matplotlib.figure.Figure or plotly.graph_objs.Figure

eco_helper.enrich.visualise.scatterplot(df: pandas.DataFrame, x: str, y: str, hue: Optional[str] = None, style: Optional[str] = None, **kwargs)¶

Visualize the enrichment data from the prerank or enrichr dictionaries by a scatterplot.

Parameters

df (pd.DataFrame) – The dataframe to visualize.
x (str) – The column to use as x-axis.
y (str) – The column to use as y-axis.
hue (str) – The column to use as hue.
style (str) – The column to use as style.
**kwargs – Additional keyword arguments to pass to seaborn.scatterplot.

Returns

fig – The figure object.

Return type

matplotlib.figure.Figure