Module `utilities.table_utils`

Utility functions for table extraction

Functions

def drop_keys_from_rows(table: T_INT_RESULT, drop_keys: list[Callable[[str], bool]] | None) ‑> ~T_INT_RESULT

Drop keys from table rows. Processes the InterpreterKwArgs drop_key_checks list to determine which keys should be removed.

Args

table : list[dict[str, str]] | SubtableParser: table rows
drop_keys : list[str]: list of strings that cannot appear in a key

Returns

type[table]: table rows with keys containing any of the

strings in drop_keys removed

def find_split_rows(dict_list_rows: dict[str, list[str]], split_table_columns: list[str], min_match_ratio: float = 0.4) ‑> dict[str, list[str]]

Detect and correct raw tables created from "nested" table representations.

This function updates dict_list_rows as if the originally processed data was in its unnested form.

Args

dict_list_rows: dict[str, list[str]] whose keys are column headers and values are lists of column values.
split_table_columns : list[str]: if every value in a table row matches one of these strings, that row is removed and used as the keys for values in the next row.
min_match_ratio: min exact char match to initiate a row split. implemented to prevent long text row values from triggering splits. e.g. "attempted to contact patient via Email" should not trigger a split due to "Email" split_table_columns entry.

Returns

dict[str, list[str]]: corrected table rows.

Example

>>> dict_list_rows = {
...     "Heading1": ["Value1", "Heading3", "Value3", "Heading5", "Value5"],
...     "Heading2": ["Value2", "Heading4", "Value4", "Heading6", "Value6"]
... }
>>> split_table_columns = ["Heading3", "Heading4", "Heading5", "Heading6"]
>>> find_split_rows(dict_list_rows, split_table_columns)
defaultdict(<class 'list'>, {'Heading1': ['Value1'], 'Heading2': ['Value2'], 'Heading3': ['Value3'], 'Heading4': ['Value4'], 'Heading5': ['Value5'], 'Heading6': ['Value6']})

def listify_list_dict(list_dict: dict[str, list[str]]) ‑> list[dict[str, str]]

Inverse of utils.dictify_dict_list. Convert a dict of lists to a list of dicts. Each dict in the list will have keys from the original dict and values from the corresponding index in the original dict's values.

Args

list_dict: dict[str, list[str]] whose keys are column headers and values are lists of column values.

Returns

list[dict[str, str]]: list of dicts with keys from list_dict and values from the corresponding index in the original dict's values.

Example

>>> list_dict = {
...     "column1": ["val1", "val2"],
...     "column2": ["val3", "val4"]
... }
>>> listify_list_dict(list_dict)
[{'column1': 'val1', 'column2': 'val3'},
{'column1': 'val2', 'column2': 'val4'}]

def prepend_heading(prepend_text, this_heading)

helper function for roll_headings and roll_heading_rows.

def roll_heading_rows(values_lists: list[list[str]], split_cols: list[str], min_match_ratio: float = 0.4, debug: bool = False) ‑> list[list[str]]

IN PLACE MANIPULATION OF ARGUMENT "values_lists" performs same function as roll_headings but applies the transformation to ALL rows with values that match with entries in the split_column_values field in table_specs.

def roll_headings(lines: list[str], split_cols: Sequence[str] = (), min_column_len: int = 5) ‑> tuple[bool, list[str]]

Usage

Detect cases when column headings roll onto a second line, e.g.: Rate/Dose/ Administering Medication Volume Action User Audit medication 1 dose 1 action 1 admin user 1 aud 1 medication 2 dose 2 action 2 admin user 2 aud 2 should have headings: Medication, Rate/Dose/Volume, Action, Administering User, and Audit

Function

Checks number of headings detected for lines[0] and lines[1] and compares that to the max number of fields detected on all other lines. if the number of headings in lines[0] < the max number of fields detected for the remaining lines <= the number headings in lines[1], append values from lines[0] to the correct column heading from lines[1] and return (True, [amended headings]). Otherwise, return (False, [first line headings]).

def split_and_stack_columns(lines: list[str], table_name: str, **kwargs)

Helper function for splitting multicolumn input lines.

Args

lines : list[str]: List of lines in the table.
table_name : str: Name of the table being interpreted.

KwArgs

debug : bool: if True, log the input and output lines
min_split_spaces : int: minimum number of spaces for triggering splits
force_page_breaks : list[Callable[[str], bool]]: list of functions that lines will be passed to. If any function returns True, a page break will be forced at that line. Serves as an arg to utils.columns() when vertically partitioning lines into columns.

Example

>>> unsplit = [
...     "Column1 Entry:                  Column2 Entry:",
...     "   Value                            Value",
...     "Column1 Entry:                  Column2 Entry:",
...     "   Value                            Value",
... ]
>>> split_and_stack_columns(unsplit, "TestTable")
['Column1 Entry:', 'Value', 'Column1 Entry:', 'Value', 'Column2 Entry:', 'Value', 'Column2 Entry:', 'Value']

def split_column_check(page_table: list[dict[str, str]], split_distance: int = 5) ‑> list[dict[str, str]]

Check for columns that should be split into two by looking for stretches of spaces longer than split distance, e.g. a column named: Column Header That Should Split Should really be two columns named Column Header and That Should Split.

def standardize_list_dict(list_dict: dict[str, list[str]]) ‑> dict[str, list[str]]

Find the mode len of the lists in list_dict.values() and add, remove, or combine elements in the list for any key whose current value does not equal the mode until all value lists have a length equal to the mode.

Args

list_dict : dict[str, list[str]]: dict representing a table of values for example: { "column1": ["val1", "val2", …], "column2": ["val1", "val2"], … }

Returns

list[dict[str, str]]: as above with list lengths standardized

def unmatched_char_ratio(arr_1d, strip_list, debug=False)

Sum of lengths for entries in arr_1d after removing all instances of strings in strip_list ÷ sum of lengths for entries in arr_1d

Classes

class InterpreterKwArgs (*args, **kwargs)

The set of kwargs that are valid across all the interpreter functions in table_interpreters.py. Some settings (e.g. drop_key_checks, debug_table, skip_line_checks, etc.) are implemented by the @interpreter_check decorator and are thereby universal, but most are only applicable in one or two interpreter functions. Consult the definition of your chosen interpreter function in table_interpreters.py to determine if a given setting has been implemented.

Attributes

debug_table : list[str]: list of table names for which debug output should be enabled
drop_key_checks : list[Callable[[str], bool]]: list of functions to that keys will be passed to. If any function returns True, the key will be dropped from the table.
value_append_separator : str: string to use to separate values when appending them together
roll_keys : list[tuple[str, …]]: list of tuples of keys that are allowed to collect data across multiple lines. Each list entry is a tuple to allow for multiple keys to be rolled together, e.g. " Procedure: Start of Proc Desc Diagnosis: Start of Dx Desc" " End of Proc Desc End of Dx Desc" Used in fields_interpreter only.
min_split_spaces : int: minimum number of spaces for triggering splits in key/value data, individual values, and table columns.
force_save_keys : list[str]: list of keys that should always be saved even if they are not followed by a ":". Used in fields_interpreter only.
min_val_length : int: minimum length for a value to be considered valid
skip_line_checks : list[Callable[[str], bool]]: list of functions that lines will be passed to. If any function returns True, the line will be skipped.
split_table_columns : list[str]: if every value in a table row matches one of these strings, that row is removed and used as the keys for values in the next row. Applied universally during table cleaning. also a TableSpec level setting.
force_page_breaks : list[Callable[[str], bool]]: list of functions that lines will be passed to. If any function returns True, a page break will be forced at that line. Serves as an arg to utils.columns() when vertically partitioning lines into columns.
roll_on_ending_colon : bool: if True, a line that ends with a colon will trigger 'roll_keys' behavior automatically.
roll_on_titles : bool: if True, a line that contains no ":" that is formatted in title or upper case will trigger 'roll_keys' behavior automatically.
subtable_parser : Callable[…, SubtableParser]: use with subtable_interpreter to define behaviors and perform custom processing on tables that may or may not have repetative sections corresponding to independent value sets, e.g. an insurance table that may or may not list a primary, secondary, and tertiary.
min_rows : int: minimum number of value lines that a table must have to be eligible for interpreation.
regex_expr : re.Pattern: compiled regex expression used by regex_interpreter. The joined lines are passed to this pattern's finditer method at interpretation time and the key value pairs returned in each matches' groupdict() are added to the table, ergo the pattern MUST have named groups.

NOTE: This class was originally implemented as InterpreterArgsUpdate.

Ancestors

builtins.dict

Class variables

var debug_table : list[str]
var drop_key_checks : list[collections.abc.Callable[[str], bool]]
var force_page_breaks : list[collections.abc.Callable[[str], bool]]
var force_save_keys : list[str]
var min_rows : int
var min_split_spaces : int
var min_val_length : int
var regex_expr : re.Pattern
var roll_keys : list[tuple[str, ...]]
var roll_on_ending_colon : bool
var roll_on_titles : bool
var skip_line_checks : list[collections.abc.Callable[[str], bool]]
var split_table_columns : list[str]
var subtable_parser : collections.abc.Callable[..., SubtableParser]
var value_append_separator : str

class InterpreterSwap (table_name: ForwardRef('str'), interpreter: ForwardRef('Callable[[list[str], str], SubtableParser | list[dict[str, str]]]'), interpreter_kwargs: ForwardRef('InterpreterKwArgs') = {})

Specify an alternative interpreter to use for tables with specific titles.

Args

table_name : str: the table name to which the alt interpreter should be applied. in practice, this value is used as the pattern in an re.match operation to allow for partial and multi table matches.
interpreter : Callable: the alternate interpreter function to be applied in lieu of the TableSpec interpreter
interpreter_kwargs : InterpreterKwArgs: (optional) each supplied key will override the corresponding key in the TableSpec's section level interpreter_kwargs.

Ancestors

builtins.tuple

Instance variables

var interpreter : collections.abc.Callable[[list[str], str], SubtableParser | list[dict[str, str]]]: Alias for field number 1
var interpreter_kwargs : InterpreterKwArgs: Alias for field number 2
var table_name : str: Alias for field number 0

class SubtableParser (sub_interpreter: Callable[[list[str], str], list[dict[str, str]]], condense_results: bool = False, fill_value: str = '', bypass_func: Callable[[list[str]], bool] = <function SubtableParser.<lambda>>, window_func: Callable[[list[str]], list[int]] = <function SubtableParser.<lambda>>, lines_func: Callable[[list[str], tuple[int, int]], list[str]] = <function SubtableParser.<lambda>>, title_func: Callable[[int, list[str], tuple[int, int]], str] = <function SubtableParser.<lambda>>, multicolumn: bool = False)

Determines if the lines collected for a table are a single table or a collection of subtables and interprets them accordingly.

If subtables are found, interpret each separately and store the results in self.data with the subtable titles as keys and the interpretation results for each as values. If condense_results is True, combined the results from all subtables into a single list of dicts after standardizing the keys in all rows.

Args

sub_interpreter : Callable[[list[str], str], list[dict[str, str]]]

    function to interpret each subtable

condense_results : bool

if True, combine the results of each subtable into a single list of dicts

fill_value : str

value to use for missing keys in each row

bypass_func : Callable[[list[str]], bool]

function to determine if the table should be interpreted as a single table. If True, the original lines are passed directly to sub_interpreter.

window_func : Callable[[list[str]], list[int]]

function to determine the start index of each subtable. pairwise tuples of this output become the members of the windows attribute which in turn are used as input to the lines_func function.

lines_func : Callable[[list[str], tuple[int, int]], list[str]]

function to extract the lines for a subtable from the original lines

title_func : Callable[[int, list[str], tuple[int, int]], str]

function to determine the title of each subtable

multicolumn : bool

if True, split and stack columns before processing

Ancestors

collections.UserDict
collections.abc.MutableMapping
collections.abc.Mapping
collections.abc.Collection
collections.abc.Sized
collections.abc.Iterable
collections.abc.Container

Methods

def updated_instance(self, **kwargs) ‑> SubtableParser: Return a new instance of SubtableParser after updating any attributes supplied in kwargs.
def validate_all(self) ‑> str: ensure proper types for all entries

class SwappingInterpreterKwArgs (*args, **kwargs)

Extends InterpreterKwArgs with two additional keys that are only applicable in the interpreter_kwargs of a TableSpec.

Attributes

interpreter_swaps : list[InterpreterSwap]: list of InterpreterSwap instances. Each instance specifies an alternative interpreter that should be applied for a specific table(s).
source_id : str: identifier for the source of the table. Typically the filename where the source section originated.

Ancestors

builtins.dict

Class variables

var debug_table : list[str]
var drop_key_checks : list[collections.abc.Callable[[str], bool]]
var force_page_breaks : list[collections.abc.Callable[[str], bool]]
var force_save_keys : list[str]
var interpreter_swaps : list[InterpreterSwap]
var min_rows : int
var min_split_spaces : int
var min_val_length : int
var regex_expr : re.Pattern
var roll_keys : list[tuple[str, ...]]
var roll_on_ending_colon : bool
var roll_on_titles : bool
var skip_line_checks : list[collections.abc.Callable[[str], bool]]
var source_id : str
var split_table_columns : list[str]
var subtable_parser : collections.abc.Callable[..., SubtableParser]
var value_append_separator : str

class TableEndCheck (func: ForwardRef('Callable[[Sequence[str]], bool]'), tables: ForwardRef('Sequence[str]') = ())

Tests the remaining lines in the source section with func if the currently collecting table title appears in tables or tables isn't supplied. If func returns True, data collection stops and the search for the next table begins.

Attributes

func : Callable[[Sequence[str]], bool]: args[0] is all lines remaining
tables : Sequence[str]: only apply this check when this is an empty sequence or the current table's title appears in the sequence

Ancestors

builtins.tuple

Instance variables

var func : collections.abc.Callable[[collections.abc.Sequence[str]], bool]: Alias for field number 0
var tables : collections.abc.Sequence[str]: Alias for field number 1

class TableStartCheck (title_offset: ForwardRef('int') = 0, func: ForwardRef('Callable[[list[str]], bool]') = <function TableStartCheck.<lambda>>, preset_title: ForwardRef('str | None') = None, data_start_offset: ForwardRef('int') = 1)

Defines a test function for triggering data collection for a table and the offset from the triggered position for the table's title and first line.

Usage

As a member of a start_checks list in a TableSpec. See the table_specs.py module in the specs subpackage for examples.

NOTE: the default data_start_offset 1 currently results in a frustrating side effect, namely that the line at index 0 is completely ignored by the data collection process outside of it's potential use as a title. Frustrated by an apparently missing line in the source section's residual? This is your problem. It's a known issue and may be addressed in the future. In the interim, you should adjust func to trigger on the subsequent line and update data_start_offset and title_offset to 0 and -1 respectively.

Attributes

title_offset : int: offset from the triggered line index for the table's title. If the title is on the same line as the trigger, this value should be 0.
func : Callable[[list[str]], bool]: test function that returns True to trigger data collection for a new table.
preset_title : str: if not None, the title of the table will be set to this value. This is useful for tables that have no title in the source data.
data_start_offset : int: offset from the triggered line index for the first line of data. If table data begins on the line immediately following the trigger, this value should be 1 (default).

Ancestors

builtins.tuple

Instance variables

var data_start_offset : int: Alias for field number 3
var func : collections.abc.Callable[[list[str]], bool]: Alias for field number 1
var preset_title : str | None: Alias for field number 2
var title_offset : int: Alias for field number 0