Data Re-Formatting¶
This module provides the functionality to re-format data columns and/or rows by regex-based substitutions.
It is the core of the eco_helper format command.
Usage¶
>>> eco_helper format [--index] [--names] [--columns <columns>] [--output <output>] [--pseudo] [--formats <formats>] <input>
where <input> is the input file and <output> is the output file. By default the reformatted data will be written to the same file as the input file.
Using the --index option, the data index will be re-formatted.
Using the --names option, the data headers (column names) will be re-formatted.
<columns> can be any number of specific columns present within the input file.
The --pseudo option can be used to exclusively re-format the index and column headers of a data file.
In this case only these parts of the file will be read without loading the entire data and thus saving memory.
This is intended for large data files that would otherwise consume large resources or time.
Note, that when speciying --pseudo then other options such as --index or --columns are ignored.
A file specifying a python dictionary regex patterns can be passed to the –formats option.
Alternatively, eco_helper offers the “EcoTyper” format which will simply replace “-” by “.” and ” ” (space) by “_”.
To use this simply pass --format EcoTyper.
Full CLI¶
The full command line of eco_helper format with all options is as follows:
usage: eco_helper format [-h] [-o OUTPUT] [-f FORMAT] [-s SUFFIX] [-i]
[-iname INDEXNAME] [-noid] [-n]
[-c COLUMNS [COLUMNS ...]] [-p] [-sep SEPARATOR] [-e]
[-ee] [-a]
input
Fix annotations in columns, index, and column names of tabular data files such
as expression matrices and annotation files.
positional arguments:
input The input file.
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
The output path. By default the file is saved to the
same path it was read from (thereby overwriting the
previous one!).
-f FORMAT, --format FORMAT
A file specifying a dictionary of regex patterns for
replacement.
-s SUFFIX, --suffix SUFFIX
A suffix to add to the output file. This will not
affect the file format and only serves to add
additional information to the filename.
-i, --index Use this if the index should be re-formatted.
-iname INDEXNAME, --indexname INDEXNAME
Use this to specify a name which the index should be
given a name in the output file. Since the index is
turned to a regular data column, the (replacement)
index will not be written anymore.
-noid, --noindex Use this if the index should not be written to the
output file.
-n, --names Use this if the column names (headers) should be re-
formatted.
-c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
Specify any number of columns within the annotation
file to reformat values in.
-p, --pseudo Use this to only pseudo-read the given file. This is
useful when the datafiles are very large to save
memory.
-sep SEPARATOR, --separator SEPARATOR
Use this to specify the separator to use when reading
the file. By default, the separator is guessed from
the file extension. Otherwise `tsv` (for tab), `csv`
(for comma), or `txt` (for space) can be specified.
-e, --expression A preset for expression matrices equivalent to '--
index --names --pseudo'
-ee, --ecoexpression A preset for EcoTyper expression matrices equivalent
to '--index --names --pseudo --format EcoTyper'
-a, --annotation A preset for EcoTyper annotation files corresponding
to '--index --indexname ID --columns CellType Sample
--format EcoTyper'
eco_helper.format.Formatter module¶
The main class that handles re-formatting input datafile index, columns, and headers.
- class eco_helper.format.Formatter.Formatter(formats: Optional[dict] = None)¶
Bases:
objectA cass to read tabular data files, and re-format index, column names (headers) and/or specific columns using regex substitution.
- Parameters
formats (dict) – A dictionary of invalid characters which must be replaced with a valid character. By default the pre-set EcoTyper dictionary is used.
- get()¶
Returns the dataframe (actual or pseudo).
- index_to_column(colname: str)¶
Converts the index of the dataframe to a column.
- Parameters
colname (str) – The name of the column to store the index.
- read_table(filename: str, sep: Optional[str] = None, pseudo: bool = False, **kwargs)¶
Reads a tabular data file.
- Parameters
filename (str) – The name of the tabular data file.
fmt (str) – The tabular format whose separator to use. If not provided, the separator will be guessed from the file extension. Otherwise, csv (comma only), tsv, and txt can be used.
pseudo (bool) – If True, the dataframe will only be read by the index and column names into a PseudoDataFrame. Since this will ** NOT ** store any actual data only the index and column names can be re-formated in this case!
- reformat(index: bool, names: bool, columns: list)¶
Reformats parts of the read dataframe.
- write_table(filename: str, suffix: Optional[str] = None, **kwargs)¶
Writes the dataframe to a tabular data file.
- Parameters
filename (str) – The name of the output file.
suffix (str) – Any suffix to add to the filename. This will ** NOT ** affect the data format in any way. For instance, a
filename = "myfile.tsv"and asuffix = ".gz"will result in a file namedmyfile.tsv.gz, but it will still be a regular tsv file, and not a compressed one! It is important to either include a tabular format such as .tsv in the filename or specify a sep argument through the kwargs in order to be able to save the file properly.
eco_helper.format.funcs module¶
These are core functions to reformat datafiles.
- eco_helper.format.funcs.read_formats_file(filename: str) dict¶
Reads a file containing a dictionary of invalid characters to valid characters.
- Parameters
filename (str) – The name of the file containing the dictionary.
- Returns
A dictionary of invalid characters to valid characters.
- Return type
dict
eco_helper.format.Pseudo module¶
These are classes to pseudo-read large data files only by their first column (index) and their first line (column headers). These classes are intended to work with the core Formatter class and imitate the file reading and workflow of a real pandas dataframe, without actually reading or storing data.
- class eco_helper.format.Pseudo.Pseudo(*args: Any, **kwargs: Any)¶
Bases:
SeriesA class to imitate the dataframe columns and indices. It is just a pd.Series but giving them extra names will make the code easier to understand…
- class eco_helper.format.Pseudo.PseudoColumns(*args: Any, **kwargs: Any)¶
Bases:
PseudoA pseudo-columns class to imitate the dataframe columns.
- Parameters
values (iterable) – The names of the columns.
name (str) – The name of the underlying pandas Series (not used).
- class eco_helper.format.Pseudo.PseudoDataFrame(source: str, sep='\t', **kwargs)¶
Bases:
objectA class to imitate the relevant methods and attributes for re-formatting of a pandas DataFrame withou actually storing any of it’s data.
- Parameters
source (str) – The data source file to read.
sep (str) – The separator. By default tab.
- read(filename: str, index_col=0, index_has_header=False, sep='\t', **kwargs) None¶
Read a data source file and get the column (first line) names and the index column.
Note
The column names must be in the first line, no comments may be present in the file!
- Parameters
filename (str) – The path to the data source file.
index_col (int) – The index column. By default the first column.
index_has_header (bool) – Set to True if the index column has a header.
sep (str) – The separator. By default tab.
- replace_delims()¶
Removes any whitespace characters used for delimintation from the index and columns.
- to_csv(filename: Optional[str] = None, sep='\t', **kwargs)¶
Write the edited column names and indices to a csv file.
- Parameters
filename (str) – The path to the output file.
sep (str) – The separator. By default tab.
eco_helper.format.formats¶
Pre-defined substitution formats.
- eco_helper.format.formats.EcoTyper = {' ': '_', '-': '.'}¶
The default substitutions to make data headers and index conform to EcoTyper requirements.
- eco_helper.format.formats.available_formats = {'EcoTyper': {' ': '_', '-': '.'}}¶
The available formats to use when reformatting data.