Data Normalisation¶

This module provides the functionality to normalise raw counts data to TPM or CPM. It is the core of the eco_helper normalise command.

Note

This subcommand exclusively works with TSV (tab-delimited) formatted files!

Warning

Depending on the size of the expression matrix, this can be very memory consuming! If working on a cluster, make sure to use a node with enough memory.

Usage¶

>>> eco_helper normalise <norm> [--lengths <lengths>] [--gtf <gtf>] [--names] [--output <output>] <input>

where <norm> is the kind of normalisation to perform, which can be eithe tpm or cpm, <input> is the input file, and <output> is the output file. By default the normalised data will be written to the same file as the input file. In case of tpm also the lengths of the transcripts must be provided. This can be done using the --lengths option.

It is possible to provide a GTF file to the --gtf option, instead of a lengths file. In this case eco_helper will use gtftools to extract the lengths and use the merged transcript length. Also, using the --names option the gene names (symbols) can be used instead of gene ids.

Full CLI¶

The full command line of eco_helper normalise with all options is as follows:

usage: eco_helper normalise [-h] [-o OUTPUT] [-l LENGTHS] [-g GTF] [-n]
                        [-d DIGITS] [-log]
                        {tpm,cpm} input

Normalise raw counts data to TPM or CPM.

positional arguments:
{tpm,cpm}             The type of normalisation to perform. Can be either
                        'tpm' or 'cpm'.
input                 Input file.

options:
-h, --help            show this help message and exit
-o OUTPUT, --output OUTPUT
                        Output file. By default the same as the input with
                        added suffix. The suffix will be either '.cpm' or
                        '.tpm'
-l LENGTHS, --lengths LENGTHS
                        Lengths file. If not provided, the lengths will be
                        extracted from the GTF file.
-g GTF, --gtf GTF     Reference GTF file for transcript lengths and/or gene
                        names.
-n, --names           Use gene names instead of gene ids. This will replace
                        the gene ids (index) in the expression matrix and
                        lengths file with gene symbols from the GTF file or
                        lengths file (if provided). Note, if a length file is
                        provided then it must include gene names in the second
                        column!
-d DIGITS, --digits DIGITS
                        The number of digits to round the values to.
-log, --logscale      Use this to log-scale the normalised values.

Using Gene Names¶

A note on using gene names instead of gene ids: By default eco_helper assumes that your input data uses Ensemble gene Ids as primary identifiers in the first column of your expression matrix. Therefore, it will use the gene ids as index for extracted lengths and names form a given GTF file. If your data, however, works with gene names (symbols) inistead of ids, then no (tpm) normalisation will be possible because eco_helper is trying to match lengths to genes using their index. If the data works on gene names and the lengths on gene ids then no overlap will be found! To prevent this, use the --swap when computing new lengths to swap the gene ids (first column that is used for matching when normalising) with the gene names (second column). If you wish to use the second column for your normalised output file, then use the --names option.

Examples:
>>> eco_helper normalise tpm --swap --gtf gencode.GTF --output my_normalised.tsv my_data.tsv
In the above example, my_data.tsv uses gene names (symbols) instead of gene ids. Therefore, when computing new lengths from the GTF file we specify --swap to make sure the lengths file and our data will match. However, my_output.tsv will use gene ids as index in the first column.
>>> eco_helper normalise tpm --names --gtf gencode.GTF --output my_normalised.tsv my_data.tsv
In this second example, my_data.tsv uses gene ids as identifiers. Therefore, we do not need to swap the ids and names when computing the lengths. However, my_output.tsv will use gene names as index in the first column instead of the gene ids because we specified --names.

The procedure also works backwards, of course. We can use --swap and --names to start with a data file using gene names and generate one that uses gene ids.

eco_helper.normalise.NormTable module¶

A class to read raw counts data from an expression matrix file and normalise the data. This is the main class of the eco_helper normalise command that will be called by the CLI.

class eco_helper.normalise.NormTable.NormTable(filename: str, **kwargs)¶

Bases: object

A class to read raw counts from an expression matrix file and normalise the data to either tpm or cpm.

Parameters: filename (str) – The input count table. By default this is assumed to be tab-delimited. Pass a sep argument to specify a different separator.

adopt_name_index()¶: Adopts the extracted name column of the lengths dataframe as the new dataframe index for both the lengths and counts data.

Note

This will only affect the raw and final counts (normalized), but it will not affect the original counts!

property counts¶

Returns the raw or TPM counts (if normalisation was performed).

Note

These are cropped to only genes that were found to have a corresponding length in the lengths file and for which therefore TPM conversion could be performed. If you wish to access the original (pre filtered) data, use the raw_data attribute instead.

Returns: counts – The raw or TPM counts.
Return type: np.ndarray

get()¶

Returns the table of normalized values.

Returns: The table.
Return type: pandas.DataFrame

get_lengths()¶

Returns the lengths of the features.

Returns: lengths – The lengths of the features.
Return type: pandas.Series or None

property ids: numpy.ndarray¶

Returns the IDs of the features.

Returns: ids – The IDs of the features.
Return type: np.ndarray

property lengths: numpy.ndarray¶

Returns the lengths of the features.

Returns: lengths – The lengths of the features.
Return type: numpy.ndarray or None

memorize()¶: Store the original (pre-filtered) and raw (unnormalised) counts.

property names: numpy.ndarray¶

Returns the names of the features.

Returns: names – The names of the features.
Return type: np.ndarray

property raw_counts¶

Returns the raw counts.

Note

Returns: counts – The raw counts.
Return type: np.ndarray

property raw_data¶

Returns the originally provided counts data.

Note

This contains all the provided genes and their counts, including those for which no lengths are available.

Returns: raw_data – The originally provided counts data.
Return type: pandas.DataFrame

read(filename: str, sep: str = '\t', **kwargs) → pandas.DataFrame¶

Reads a table from a file.

Parameters

filename (str) – The input file.
sep (str, optional) – The separator of the table. The default is ” “.

Returns

df – The table.

Return type

pandas.DataFrame

round(digits)¶

Round tpm values to a given number of digits.

Parameters: digits (int) – The number of digits to round to.

save(filename: str, use_names: bool = False)¶

Saves the table to a file.

Parameters

filename (str) – The output file.
use_names (bool) – Save the file with gene_names instead of gene_ids in the first column.

set_lengths(filename: str, which: Optional[str] = None, id_col: Optional[str] = None, name_col: Optional[str] = None, **kwargs)¶

Sets the lengths of the features.

Parameters

filename (str) – The file containing the lengths of the features.
which (str, optional) – The column name of the lengths. The default is None (in which case the last column is used).
id_col (str, optional) – The column name of the IDs. The default is None (in which case the first column is used).
name_col (str, optional) – The column name of the (gene) names. The default is None (in which case the second column is used). Note, even if your datafile does not specify gene names a “name column” will still be extracted. However, you can adjust not to include the column later for saving the TPM-converted file.

to_cpm(digits: int = 5, log: bool = False)¶

Normalise the raw counts to CPM.

Parameters

digits (int, optional) – The number of digits to round to. The default is 5.
log (bool, optional) – If True, the CPM values are logarithmically scaled. The default is False.

to_tpm(digits: int = 5, log: bool = False)¶

Normalise the raw counts to TPM.

Parameters

digits (int, optional) – The number of digits to round to. The default is 5.
log (bool, optional) – If True, the TPM values are logarithmically scaled. The default is False.

eco_helper.normalise.funcs module¶

These are core functions of the normalise submodule.

eco_helper.normalise.funcs.add_gtf_gene_names(filename: str, outfile: str, swap_ids_and_names: bool = False, **kwargs)¶

Adds the gene names to the GTF file.

Parameters

filename (str) – The input GTF file.
outfile (str) – The output file.
swap_ids_and_names (bool, optional) – Whether to swap the IDs and names. The default is False. If True then the Ids (1st column) and names (2nd column by default) will be swapped so that names are the 1st column and IDs are the 2nd column.

eco_helper.normalise.funcs.array_to_cpm(array: numpy.ndarray, log: bool = True)¶

Convert raw counts to CPM.

Parameters

array (np.ndarray) – The raw counts. As a 2D ndarray.
log (bool, optional) – Whether to use log-scale. The default is True.

Returns

The CPM values.

Return type

np.ndarray

eco_helper.normalise.funcs.array_to_tpm(array: numpy.ndarray, lengths: numpy.ndarray, log: bool = False)¶

Convert raw counts to TPM.

Parameters

array (np.ndarray) – The raw counts. As a 2D ndarray.
lengths (np.ndarray) – The lengths of the features. As a 1D ndarray.
log (bool, optional) – Whether to use log-scale. The default is False.

Returns

The TPM values.

Return type

np.ndarray

eco_helper.normalise.funcs.call_gtftools(filename: str, output: str, mode: str = 'l')¶

Calls gtftools from CLI to perform a computation. By default to calculate lengths.

Parameters

filename (str) – The input GTF file.
output (str) – The output file.
mode (str, optional) – The mode of the computation. The default is “l”. Any valid gtftools mode is allowed.

eco_helper.normalise.funcs.match_regex_pattern(pattern: str, df: pandas.DataFrame)¶

Matches a regex pattern to a dataframe using it’s “attributes” column.

Parameters

pattern (str) – The regex pattern.
df (pd.DataFrame) – The dataframe.

Returns

The matched values.

Return type

list

eco_helper.normalise.funcs.round_values(values: numpy.ndarray, digits: int = 5)¶

Rounds an array of values to a certain number of digits.

Parameters

values (np.ndarray) – The raw values.
digits (int, optional) – The number of digits to round to. The default is 5.

Returns

The rounded values.

Return type

np.ndarray