Data Normalisation¶
This module provides the functionality to normalise raw counts data to TPM or CPM.
It is the core of the eco_helper normalise command.
Note
This subcommand exclusively works with TSV (tab-delimited) formatted files!
Warning
Depending on the size of the expression matrix, this can be very memory consuming! If working on a cluster, make sure to use a node with enough memory.
Usage¶
>>> eco_helper normalise <norm> [--lengths <lengths>] [--gtf <gtf>] [--names] [--output <output>] <input>
where <norm> is the kind of normalisation to perform, which can be eithe tpm or cpm, <input> is the input file, and <output> is the output file. By default the normalised data will be written to the same file as the input file.
In case of tpm also the lengths of the transcripts must be provided.
This can be done using the --lengths option.
It is possible to provide a GTF file to the --gtf option, instead of a lengths file.
In this case eco_helper will use gtftools to extract the lengths and use the merged transcript length.
Also, using the --names option the gene names (symbols) can be used instead of gene ids.
Full CLI¶
The full command line of eco_helper normalise with all options is as follows:
usage: eco_helper normalise [-h] [-o OUTPUT] [-l LENGTHS] [-g GTF] [-n]
[-d DIGITS] [-log]
{tpm,cpm} input
Normalise raw counts data to TPM or CPM.
positional arguments:
{tpm,cpm} The type of normalisation to perform. Can be either
'tpm' or 'cpm'.
input Input file.
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file. By default the same as the input with
added suffix. The suffix will be either '.cpm' or
'.tpm'
-l LENGTHS, --lengths LENGTHS
Lengths file. If not provided, the lengths will be
extracted from the GTF file.
-g GTF, --gtf GTF Reference GTF file for transcript lengths and/or gene
names.
-n, --names Use gene names instead of gene ids. This will replace
the gene ids (index) in the expression matrix and
lengths file with gene symbols from the GTF file or
lengths file (if provided). Note, if a length file is
provided then it must include gene names in the second
column!
-d DIGITS, --digits DIGITS
The number of digits to round the values to.
-log, --logscale Use this to log-scale the normalised values.
Using Gene Names¶
A note on using gene names instead of gene ids: By default eco_helper assumes that your input data uses Ensemble gene Ids as primary identifiers in the first column of your expression matrix.
Therefore, it will use the gene ids as index for extracted lengths and names form a given GTF file. If your data, however, works with gene names (symbols) inistead of ids, then no (tpm) normalisation will be possible
because eco_helper is trying to match lengths to genes using their index. If the data works on gene names and the lengths on gene ids then no overlap will be found! To prevent this, use the --swap when computing
new lengths to swap the gene ids (first column that is used for matching when normalising) with the gene names (second column). If you wish to use the second column for your normalised output file, then use the --names option.
Examples:
>>> eco_helper normalise tpm --swap --gtf gencode.GTF --output my_normalised.tsv my_data.tsvIn the above example, my_data.tsv uses gene names (symbols) instead of gene ids. Therefore, when computing new lengths from the GTF file we specify
--swapto make sure the lengths file and our data will match. However, my_output.tsv will use gene ids as index in the first column.>>> eco_helper normalise tpm --names --gtf gencode.GTF --output my_normalised.tsv my_data.tsvIn this second example, my_data.tsv uses gene ids as identifiers. Therefore, we do not need to swap the ids and names when computing the lengths. However, my_output.tsv will use gene names as index in the first column instead of the gene ids because we specified
--names.The procedure also works backwards, of course. We can use
--swapand--namesto start with a data file using gene names and generate one that uses gene ids.
eco_helper.normalise.NormTable module¶
A class to read raw counts data from an expression matrix file and normalise the data.
This is the main class of the eco_helper normalise command that will be called by the CLI.
- class eco_helper.normalise.NormTable.NormTable(filename: str, **kwargs)¶
Bases:
objectA class to read raw counts from an expression matrix file and normalise the data to either
tpmorcpm.- Parameters
filename (str) – The input count table. By default this is assumed to be tab-delimited. Pass a
separgument to specify a different separator.
- adopt_name_index()¶
Adopts the extracted name column of the lengths dataframe as the new dataframe index for both the lengths and counts data.
Note
This will only affect the raw and final counts (normalized), but it will not affect the original counts!
- property counts¶
Returns the raw or TPM counts (if normalisation was performed).
Note
These are cropped to only genes that were found to have a corresponding length in the lengths file and for which therefore TPM conversion could be performed. If you wish to access the original (pre filtered) data, use the raw_data attribute instead.
- Returns
counts – The raw or TPM counts.
- Return type
np.ndarray
- get()¶
Returns the table of normalized values.
- Returns
The table.
- Return type
pandas.DataFrame
- get_lengths()¶
Returns the lengths of the features.
- Returns
lengths – The lengths of the features.
- Return type
pandas.Series or None
- property ids: numpy.ndarray¶
Returns the IDs of the features.
- Returns
ids – The IDs of the features.
- Return type
np.ndarray
- property lengths: numpy.ndarray¶
Returns the lengths of the features.
- Returns
lengths – The lengths of the features.
- Return type
numpy.ndarray or None
- memorize()¶
Store the original (pre-filtered) and raw (unnormalised) counts.
- property names: numpy.ndarray¶
Returns the names of the features.
- Returns
names – The names of the features.
- Return type
np.ndarray
- property raw_counts¶
Returns the raw counts.
Note
These are cropped to only genes that were found to have a corresponding length in the lengths file and for which therefore TPM conversion could be performed. If you wish to access the original (pre filtered) data, use the raw_data attribute instead.
- Returns
counts – The raw counts.
- Return type
np.ndarray
- property raw_data¶
Returns the originally provided counts data.
Note
This contains all the provided genes and their counts, including those for which no lengths are available.
- Returns
raw_data – The originally provided counts data.
- Return type
pandas.DataFrame
- read(filename: str, sep: str = '\t', **kwargs) pandas.DataFrame¶
Reads a table from a file.
- Parameters
filename (str) – The input file.
sep (str, optional) – The separator of the table. The default is ” “.
- Returns
df – The table.
- Return type
pandas.DataFrame
- round(digits)¶
Round tpm values to a given number of digits.
- Parameters
digits (int) – The number of digits to round to.
- save(filename: str, use_names: bool = False)¶
Saves the table to a file.
- Parameters
filename (str) – The output file.
use_names (bool) – Save the file with gene_names instead of gene_ids in the first column.
- set_lengths(filename: str, which: Optional[str] = None, id_col: Optional[str] = None, name_col: Optional[str] = None, **kwargs)¶
Sets the lengths of the features.
- Parameters
filename (str) – The file containing the lengths of the features.
which (str, optional) – The column name of the lengths. The default is None (in which case the last column is used).
id_col (str, optional) – The column name of the IDs. The default is None (in which case the first column is used).
name_col (str, optional) – The column name of the (gene) names. The default is None (in which case the second column is used). Note, even if your datafile does not specify gene names a “name column” will still be extracted. However, you can adjust not to include the column later for saving the TPM-converted file.
- to_cpm(digits: int = 5, log: bool = False)¶
Normalise the raw counts to CPM.
- Parameters
digits (int, optional) – The number of digits to round to. The default is 5.
log (bool, optional) – If True, the CPM values are logarithmically scaled. The default is False.
- to_tpm(digits: int = 5, log: bool = False)¶
Normalise the raw counts to TPM.
- Parameters
digits (int, optional) – The number of digits to round to. The default is 5.
log (bool, optional) – If True, the TPM values are logarithmically scaled. The default is False.
eco_helper.normalise.funcs module¶
These are core functions of the normalise submodule.
- eco_helper.normalise.funcs.add_gtf_gene_names(filename: str, outfile: str, swap_ids_and_names: bool = False, **kwargs)¶
Adds the gene names to the GTF file.
- Parameters
filename (str) – The input GTF file.
outfile (str) – The output file.
swap_ids_and_names (bool, optional) – Whether to swap the IDs and names. The default is False. If True then the Ids (1st column) and names (2nd column by default) will be swapped so that names are the 1st column and IDs are the 2nd column.
- eco_helper.normalise.funcs.array_to_cpm(array: numpy.ndarray, log: bool = True)¶
Convert raw counts to CPM.
- Parameters
array (np.ndarray) – The raw counts. As a 2D ndarray.
log (bool, optional) – Whether to use log-scale. The default is True.
- Returns
The CPM values.
- Return type
np.ndarray
- eco_helper.normalise.funcs.array_to_tpm(array: numpy.ndarray, lengths: numpy.ndarray, log: bool = False)¶
Convert raw counts to TPM.
- Parameters
array (np.ndarray) – The raw counts. As a 2D ndarray.
lengths (np.ndarray) – The lengths of the features. As a 1D ndarray.
log (bool, optional) – Whether to use log-scale. The default is False.
- Returns
The TPM values.
- Return type
np.ndarray
- eco_helper.normalise.funcs.call_gtftools(filename: str, output: str, mode: str = 'l')¶
Calls gtftools from CLI to perform a computation. By default to calculate lengths.
- Parameters
filename (str) – The input GTF file.
output (str) – The output file.
mode (str, optional) – The mode of the computation. The default is “l”. Any valid gtftools mode is allowed.
- eco_helper.normalise.funcs.match_regex_pattern(pattern: str, df: pandas.DataFrame)¶
Matches a regex pattern to a dataframe using it’s “attributes” column.
- Parameters
pattern (str) – The regex pattern.
df (pd.DataFrame) – The dataframe.
- Returns
The matched values.
- Return type
list
- eco_helper.normalise.funcs.round_values(values: numpy.ndarray, digits: int = 5)¶
Rounds an array of values to a certain number of digits.
- Parameters
values (np.ndarray) – The raw values.
digits (int, optional) – The number of digits to round to. The default is 5.
- Returns
The rounded values.
- Return type
np.ndarray