Data Re-Formatting

This module provides the functionality to re-format data columns and/or rows by regex-based substitutions. It is the core of the eco_helper format command.

Usage

>>> eco_helper format [--index] [--names] [--columns <columns>] [--output <output>] [--pseudo] [--formats <formats>] <input>

where <input> is the input file and <output> is the output file. By default the reformatted data will be written to the same file as the input file. Using the --index option, the data index will be re-formatted. Using the --names option, the data headers (column names) will be re-formatted. <columns> can be any number of specific columns present within the input file.

The --pseudo option can be used to exclusively re-format the index and column headers of a data file. In this case only these parts of the file will be read without loading the entire data and thus saving memory. This is intended for large data files that would otherwise consume large resources or time. Note, that when speciying --pseudo then other options such as --index or --columns are ignored.

A file specifying a python dictionary regex patterns can be passed to the –formats option. Alternatively, eco_helper offers the “EcoTyper” format which will simply replace “-” by “.” and ” ” (space) by “_”. To use this simply pass --format EcoTyper.

Full CLI

The full command line of eco_helper format with all options is as follows:

usage: eco_helper format [-h] [-o OUTPUT] [-f FORMAT] [-s SUFFIX] [-i]
                        [-iname INDEXNAME] [-noid] [-n]
                        [-c COLUMNS [COLUMNS ...]] [-p] [-sep SEPARATOR] [-e]
                        [-ee] [-a]
                        input

Fix annotations in columns, index, and column names of tabular data files such
as expression matrices and annotation files.

positional arguments:
input                 The input file.

options:
-h, --help            show this help message and exit
-o OUTPUT, --output OUTPUT
                        The output path. By default the file is saved to the
                        same path it was read from (thereby overwriting the
                        previous one!).
-f FORMAT, --format FORMAT
                        A file specifying a dictionary of regex patterns for
                        replacement.
-s SUFFIX, --suffix SUFFIX
                        A suffix to add to the output file. This will not
                        affect the file format and only serves to add
                        additional information to the filename.
-i, --index           Use this if the index should be re-formatted.
-iname INDEXNAME, --indexname INDEXNAME
                        Use this to specify a name which the index should be
                        given a name in the output file. Since the index is
                        turned to a regular data column, the (replacement)
                        index will not be written anymore.
-noid, --noindex      Use this if the index should not be written to the
                        output file.
-n, --names           Use this if the column names (headers) should be re-
                        formatted.
-c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
                        Specify any number of columns within the annotation
                        file to reformat values in.
-p, --pseudo          Use this to only pseudo-read the given file. This is
                        useful when the datafiles are very large to save
                        memory.
-sep SEPARATOR, --separator SEPARATOR
                        Use this to specify the separator to use when reading
                        the file. By default, the separator is guessed from
                        the file extension. Otherwise `tsv` (for tab), `csv`
                        (for comma), or `txt` (for space) can be specified.
-e, --expression      A preset for expression matrices equivalent to '--
                        index --names --pseudo'
-ee, --ecoexpression  A preset for EcoTyper expression matrices equivalent
                        to '--index --names --pseudo --format EcoTyper'
-a, --annotation      A preset for EcoTyper annotation files corresponding
                        to '--index --indexname ID --columns CellType Sample
                        --format EcoTyper'

eco_helper.format.Formatter module

The main class that handles re-formatting input datafile index, columns, and headers.

class eco_helper.format.Formatter.Formatter(formats: Optional[dict] = None)

Bases: object

A cass to read tabular data files, and re-format index, column names (headers) and/or specific columns using regex substitution.

Parameters

formats (dict) – A dictionary of invalid characters which must be replaced with a valid character. By default the pre-set EcoTyper dictionary is used.

get()

Returns the dataframe (actual or pseudo).

index_to_column(colname: str)

Converts the index of the dataframe to a column.

Parameters

colname (str) – The name of the column to store the index.

read_table(filename: str, sep: Optional[str] = None, pseudo: bool = False, **kwargs)

Reads a tabular data file.

Parameters
  • filename (str) – The name of the tabular data file.

  • fmt (str) – The tabular format whose separator to use. If not provided, the separator will be guessed from the file extension. Otherwise, csv (comma only), tsv, and txt can be used.

  • pseudo (bool) – If True, the dataframe will only be read by the index and column names into a PseudoDataFrame. Since this will ** NOT ** store any actual data only the index and column names can be re-formated in this case!

reformat(index: bool, names: bool, columns: list)

Reformats parts of the read dataframe.

write_table(filename: str, suffix: Optional[str] = None, **kwargs)

Writes the dataframe to a tabular data file.

Parameters
  • filename (str) – The name of the output file.

  • suffix (str) – Any suffix to add to the filename. This will ** NOT ** affect the data format in any way. For instance, a filename = "myfile.tsv" and a suffix = ".gz" will result in a file named myfile.tsv.gz, but it will still be a regular tsv file, and not a compressed one! It is important to either include a tabular format such as .tsv in the filename or specify a sep argument through the kwargs in order to be able to save the file properly.

eco_helper.format.funcs module

These are core functions to reformat datafiles.

eco_helper.format.funcs.read_formats_file(filename: str) dict

Reads a file containing a dictionary of invalid characters to valid characters.

Parameters

filename (str) – The name of the file containing the dictionary.

Returns

A dictionary of invalid characters to valid characters.

Return type

dict

eco_helper.format.Pseudo module

These are classes to pseudo-read large data files only by their first column (index) and their first line (column headers). These classes are intended to work with the core Formatter class and imitate the file reading and workflow of a real pandas dataframe, without actually reading or storing data.

class eco_helper.format.Pseudo.Pseudo(*args: Any, **kwargs: Any)

Bases: Series

A class to imitate the dataframe columns and indices. It is just a pd.Series but giving them extra names will make the code easier to understand…

class eco_helper.format.Pseudo.PseudoColumns(*args: Any, **kwargs: Any)

Bases: Pseudo

A pseudo-columns class to imitate the dataframe columns.

Parameters
  • values (iterable) – The names of the columns.

  • name (str) – The name of the underlying pandas Series (not used).

class eco_helper.format.Pseudo.PseudoDataFrame(source: str, sep='\t', **kwargs)

Bases: object

A class to imitate the relevant methods and attributes for re-formatting of a pandas DataFrame withou actually storing any of it’s data.

Parameters
  • source (str) – The data source file to read.

  • sep (str) – The separator. By default tab.

read(filename: str, index_col=0, index_has_header=False, sep='\t', **kwargs) None

Read a data source file and get the column (first line) names and the index column.

Note

The column names must be in the first line, no comments may be present in the file!

Parameters
  • filename (str) – The path to the data source file.

  • index_col (int) – The index column. By default the first column.

  • index_has_header (bool) – Set to True if the index column has a header.

  • sep (str) – The separator. By default tab.

replace_delims()

Removes any whitespace characters used for delimintation from the index and columns.

to_csv(filename: Optional[str] = None, sep='\t', **kwargs)

Write the edited column names and indices to a csv file.

Parameters
  • filename (str) – The path to the output file.

  • sep (str) – The separator. By default tab.

class eco_helper.format.Pseudo.PseudoIndex(*args: Any, **kwargs: Any)

Bases: Pseudo

A pseudo-index class to imitate the dataframe index.

Parameters
  • values (iterable) – The values of the index.

  • name (str) – The name of the index.

eco_helper.format.formats

Pre-defined substitution formats.

eco_helper.format.formats.EcoTyper = {' ': '_', '-': '.'}

The default substitutions to make data headers and index conform to EcoTyper requirements.

eco_helper.format.formats.available_formats = {'EcoTyper': {' ': '_', '-': '.'}}

The available formats to use when reformatting data.