Module utilities.table_utils
Utility functions for table extraction
Functions
def drop_keys_from_rows(table: T_INT_RESULT, drop_keys: list[Callable[[str], bool]] | None) ‑> ~T_INT_RESULT
-
Drop keys from table rows. Processes the InterpreterKwArgs drop_key_checks list to determine which keys should be removed.
Args
table
:list[dict[str, str]] | SubtableParser
- table rows
drop_keys
:list[str]
- list of strings that cannot appear in a key
Returns
type[table]
- table rows with keys containing any of the
strings in drop_keys removed
def find_split_rows(dict_list_rows: dict[str, list[str]], split_table_columns: list[str], min_match_ratio: float = 0.4) ‑> dict[str, list[str]]
-
Detect and correct raw tables created from "nested" table representations.
This function updates dict_list_rows as if the originally processed data was in its unnested form.
Args
dict_list_rows
- dict[str, list[str]] whose keys are column headers and values are lists of column values.
split_table_columns
:list[str]
- if every value in a table row matches one of these strings, that row is removed and used as the keys for values in the next row.
min_match_ratio
- min exact char match to initiate a row split. implemented to prevent long text row values from triggering splits. e.g. "attempted to contact patient via Email" should not trigger a split due to "Email" split_table_columns entry.
Returns
dict[str, list[str]]
- corrected table rows.
Example
>>> dict_list_rows = { ... "Heading1": ["Value1", "Heading3", "Value3", "Heading5", "Value5"], ... "Heading2": ["Value2", "Heading4", "Value4", "Heading6", "Value6"] ... } >>> split_table_columns = ["Heading3", "Heading4", "Heading5", "Heading6"] >>> find_split_rows(dict_list_rows, split_table_columns) defaultdict(<class 'list'>, {'Heading1': ['Value1'], 'Heading2': ['Value2'], 'Heading3': ['Value3'], 'Heading4': ['Value4'], 'Heading5': ['Value5'], 'Heading6': ['Value6']})
def listify_list_dict(list_dict: dict[str, list[str]]) ‑> list[dict[str, str]]
-
Inverse of
utils.dictify_dict_list
. Convert a dict of lists to a list of dicts. Each dict in the list will have keys from the original dict and values from the corresponding index in the original dict's values.Args
list_dict
- dict[str, list[str]] whose keys are column headers and values are lists of column values.
Returns
list[dict[str, str]]
- list of dicts with keys from list_dict and values from the corresponding index in the original dict's values.
Example
>>> list_dict = { ... "column1": ["val1", "val2"], ... "column2": ["val3", "val4"] ... } >>> listify_list_dict(list_dict) [{'column1': 'val1', 'column2': 'val3'}, {'column1': 'val2', 'column2': 'val4'}]
def prepend_heading(prepend_text, this_heading)
-
helper function for roll_headings and roll_heading_rows.
def roll_heading_rows(values_lists: list[list[str]], split_cols: list[str], min_match_ratio: float = 0.4, debug: bool = False) ‑> list[list[str]]
-
IN PLACE MANIPULATION OF ARGUMENT "values_lists" performs same function as roll_headings but applies the transformation to ALL rows with values that match with entries in the split_column_values field in table_specs.
def roll_headings(lines: list[str], split_cols: Sequence[str] = (), min_column_len: int = 5) ‑> tuple[bool, list[str]]
-
Usage
Detect cases when column headings roll onto a second line, e.g.: Rate/Dose/ Administering Medication Volume Action User Audit medication 1 dose 1 action 1 admin user 1 aud 1 medication 2 dose 2 action 2 admin user 2 aud 2 should have headings: Medication, Rate/Dose/Volume, Action, Administering User, and Audit
Function
Checks number of headings detected for lines[0] and lines[1] and compares that to the max number of fields detected on all other lines. if the number of headings in lines[0] < the max number of fields detected for the remaining lines <= the number headings in lines[1], append values from lines[0] to the correct column heading from lines[1] and return (True, [amended headings]). Otherwise, return (False, [first line headings]).
def split_and_stack_columns(lines: list[str], table_name: str, **kwargs)
-
Helper function for splitting multicolumn input lines.
Args
lines
:list[str]
- List of lines in the table.
table_name
:str
- Name of the table being interpreted.
KwArgs
debug
:bool
- if True, log the input and output lines
min_split_spaces
:int
- minimum number of spaces for triggering splits
force_page_breaks
:list[Callable[[str], bool]]
- list of functions that lines will be passed to. If any function returns True, a page break will be forced at that line. Serves as an arg to utils.columns() when vertically partitioning lines into columns.
Example
>>> unsplit = [ ... "Column1 Entry: Column2 Entry:", ... " Value Value", ... "Column1 Entry: Column2 Entry:", ... " Value Value", ... ] >>> split_and_stack_columns(unsplit, "TestTable") ['Column1 Entry:', 'Value', 'Column1 Entry:', 'Value', 'Column2 Entry:', 'Value', 'Column2 Entry:', 'Value']
def split_column_check(page_table: list[dict[str, str]], split_distance: int = 5) ‑> list[dict[str, str]]
-
Check for columns that should be split into two by looking for stretches of spaces longer than split distance, e.g. a column named:
Column Header That Should Split
Should really be two columns namedColumn Header
andThat Should Split
. def standardize_list_dict(list_dict: dict[str, list[str]]) ‑> dict[str, list[str]]
-
Find the mode len of the lists in list_dict.values() and add, remove, or combine elements in the list for any key whose current value does not equal the mode until all value lists have a length equal to the mode.
Args
list_dict
:dict[str, list[str]]
- dict representing a table of values for example: { "column1": ["val1", "val2", …], "column2": ["val1", "val2"], … }
Returns
list[dict[str, str]]
- as above with list lengths standardized
def unmatched_char_ratio(arr_1d, strip_list, debug=False)
-
Sum of lengths for entries in arr_1d after removing all instances of strings in strip_list ÷ sum of lengths for entries in arr_1d
Classes
class InterpreterKwArgs (*args, **kwargs)
-
The set of kwargs that are valid across all the interpreter functions in table_interpreters.py. Some settings (e.g. drop_key_checks, debug_table, skip_line_checks, etc.) are implemented by the @interpreter_check decorator and are thereby universal, but most are only applicable in one or two interpreter functions. Consult the definition of your chosen interpreter function in table_interpreters.py to determine if a given setting has been implemented.
Attributes
debug_table
:list[str]
- list of table names for which debug output should be enabled
drop_key_checks
:list[Callable[[str], bool]]
- list of functions to that keys will be passed to. If any function returns True, the key will be dropped from the table.
value_append_separator
:str
- string to use to separate values when appending them together
roll_keys
:list[tuple[str, …]]
- list of tuples of keys that are allowed to collect data across multiple lines. Each list entry is a tuple to allow for multiple keys to be rolled together, e.g. " Procedure: Start of Proc Desc Diagnosis: Start of Dx Desc" " End of Proc Desc End of Dx Desc" Used in fields_interpreter only.
min_split_spaces
:int
- minimum number of spaces for triggering splits in key/value data, individual values, and table columns.
force_save_keys
:list[str]
- list of keys that should always be saved even if they are not followed by a ":". Used in fields_interpreter only.
min_val_length
:int
- minimum length for a value to be considered valid
skip_line_checks
:list[Callable[[str], bool]]
- list of functions that lines will be passed to. If any function returns True, the line will be skipped.
split_table_columns
:list[str]
- if every value in a table row matches one of these strings, that row is removed and used as the keys for values in the next row. Applied universally during table cleaning. also a TableSpec level setting.
force_page_breaks
:list[Callable[[str], bool]]
- list of functions that lines will be passed to. If any function returns True, a page break will be forced at that line. Serves as an arg to utils.columns() when vertically partitioning lines into columns.
roll_on_ending_colon
:bool
- if True, a line that ends with a colon will trigger 'roll_keys' behavior automatically.
roll_on_titles
:bool
- if True, a line that contains no ":" that is formatted in title or upper case will trigger 'roll_keys' behavior automatically.
subtable_parser
:Callable[…, SubtableParser]
- use with subtable_interpreter to define behaviors and perform custom processing on tables that may or may not have repetative sections corresponding to independent value sets, e.g. an insurance table that may or may not list a primary, secondary, and tertiary.
min_rows
:int
- minimum number of value lines that a table must have to be eligible for interpreation.
regex_expr
:re.Pattern
- compiled regex expression used by regex_interpreter. The joined lines are passed to this pattern's finditer method at interpretation time and the key value pairs returned in each matches' groupdict() are added to the table, ergo the pattern MUST have named groups.
NOTE: This class was originally implemented as
InterpreterArgsUpdate
.Ancestors
- builtins.dict
Class variables
var debug_table : list[str]
var drop_key_checks : list[collections.abc.Callable[[str], bool]]
var force_page_breaks : list[collections.abc.Callable[[str], bool]]
var force_save_keys : list[str]
var min_rows : int
var min_split_spaces : int
var min_val_length : int
var regex_expr : re.Pattern
var roll_keys : list[tuple[str, ...]]
var roll_on_ending_colon : bool
var roll_on_titles : bool
var skip_line_checks : list[collections.abc.Callable[[str], bool]]
var split_table_columns : list[str]
var subtable_parser : collections.abc.Callable[..., SubtableParser]
var value_append_separator : str
class InterpreterSwap (table_name: ForwardRef('str'), interpreter: ForwardRef('Callable[[list[str], str], SubtableParser | list[dict[str, str]]]'), interpreter_kwargs: ForwardRef('InterpreterKwArgs') = {})
-
Specify an alternative interpreter to use for tables with specific titles.
Args
table_name
:str
- the table name to which the alt interpreter should be applied. in practice, this value is used as the pattern in an re.match operation to allow for partial and multi table matches.
interpreter
:Callable
- the alternate interpreter function to be applied in lieu of the TableSpec interpreter
interpreter_kwargs
:InterpreterKwArgs
- (optional) each supplied key will override the corresponding key in the TableSpec's section level interpreter_kwargs.
Ancestors
- builtins.tuple
Instance variables
var interpreter : collections.abc.Callable[[list[str], str], SubtableParser | list[dict[str, str]]]
-
Alias for field number 1
var interpreter_kwargs : InterpreterKwArgs
-
Alias for field number 2
var table_name : str
-
Alias for field number 0
class SubtableParser (sub_interpreter: Callable[[list[str], str], list[dict[str, str]]], condense_results: bool = False, fill_value: str = '', bypass_func: Callable[[list[str]], bool] = <function SubtableParser.<lambda>>, window_func: Callable[[list[str]], list[int]] = <function SubtableParser.<lambda>>, lines_func: Callable[[list[str], tuple[int, int]], list[str]] = <function SubtableParser.<lambda>>, title_func: Callable[[int, list[str], tuple[int, int]], str] = <function SubtableParser.<lambda>>, multicolumn: bool = False)
-
Determines if the lines collected for a table are a single table or a collection of subtables and interprets them accordingly.
If subtables are found, interpret each separately and store the results in self.data with the subtable titles as keys and the interpretation results for each as values. If condense_results is True, combined the results from all subtables into a single list of dicts after standardizing the keys in all rows.
Args
sub_interpreter
:Callable[[list[str], str], list[dict[str, str]]]
-
function to interpret each subtable
condense_results
:bool
- if True, combine the results of each subtable into a single list of dicts
fill_value
:str
- value to use for missing keys in each row
bypass_func
:Callable[[list[str]], bool]
- function to determine if the table should be interpreted as a single table. If True, the original lines are passed directly to sub_interpreter.
window_func
:Callable[[list[str]], list[int]]
- function to determine the start index of each subtable. pairwise tuples of this output become the members of the windows attribute which in turn are used as input to the lines_func function.
lines_func
:Callable[[list[str], tuple[int, int]], list[str]]
- function to extract the lines for a subtable from the original lines
title_func
:Callable[[int, list[str], tuple[int, int]], str]
- function to determine the title of each subtable
multicolumn
:bool
- if True, split and stack columns before processing
Ancestors
- collections.UserDict
- collections.abc.MutableMapping
- collections.abc.Mapping
- collections.abc.Collection
- collections.abc.Sized
- collections.abc.Iterable
- collections.abc.Container
Methods
def updated_instance(self, **kwargs) ‑> SubtableParser
-
Return a new instance of SubtableParser after updating any attributes supplied in kwargs.
def validate_all(self) ‑> str
-
ensure proper types for all entries
class SwappingInterpreterKwArgs (*args, **kwargs)
-
Extends
InterpreterKwArgs
with two additional keys that are only applicable in the interpreter_kwargs of aTableSpec
.Attributes
interpreter_swaps
:list[InterpreterSwap]
- list of InterpreterSwap instances. Each instance specifies an alternative interpreter that should be applied for a specific table(s).
source_id
:str
- identifier for the source of the table. Typically the filename where the source section originated.
Ancestors
- builtins.dict
Class variables
var debug_table : list[str]
var drop_key_checks : list[collections.abc.Callable[[str], bool]]
var force_page_breaks : list[collections.abc.Callable[[str], bool]]
var force_save_keys : list[str]
var interpreter_swaps : list[InterpreterSwap]
var min_rows : int
var min_split_spaces : int
var min_val_length : int
var regex_expr : re.Pattern
var roll_keys : list[tuple[str, ...]]
var roll_on_ending_colon : bool
var roll_on_titles : bool
var skip_line_checks : list[collections.abc.Callable[[str], bool]]
var source_id : str
var split_table_columns : list[str]
var subtable_parser : collections.abc.Callable[..., SubtableParser]
var value_append_separator : str
class TableEndCheck (func: ForwardRef('Callable[[Sequence[str]], bool]'), tables: ForwardRef('Sequence[str]') = ())
-
Tests the remaining lines in the source section with
func
if the currently collecting table title appears intables
ortables
isn't supplied. Iffunc
returnsTrue
, data collection stops and the search for the next table begins.Attributes
func
:Callable[[Sequence[str]], bool]
- args[0] is all lines remaining
tables
:Sequence[str]
- only apply this check when this is an empty sequence or the current table's title appears in the sequence
Ancestors
- builtins.tuple
Instance variables
var func : collections.abc.Callable[[collections.abc.Sequence[str]], bool]
-
Alias for field number 0
var tables : collections.abc.Sequence[str]
-
Alias for field number 1
class TableStartCheck (title_offset: ForwardRef('int') = 0, func: ForwardRef('Callable[[list[str]], bool]') = <function TableStartCheck.<lambda>>, preset_title: ForwardRef('str | None') = None, data_start_offset: ForwardRef('int') = 1)
-
Defines a test function for triggering data collection for a table and the offset from the triggered position for the table's title and first line.
Usage
As a member of a start_checks list in a TableSpec. See the table_specs.py module in the specs subpackage for examples.
NOTE: the default data_start_offset 1 currently results in a frustrating side effect, namely that the line at index 0 is completely ignored by the data collection process outside of it's potential use as a title. Frustrated by an apparently missing line in the source section's residual? This is your problem. It's a known issue and may be addressed in the future. In the interim, you should adjust
func
to trigger on the subsequent line and updatedata_start_offset
andtitle_offset
to 0 and -1 respectively.Attributes
title_offset
:int
- offset from the triggered line index for the table's title. If the title is on the same line as the trigger, this value should be 0.
func
:Callable[[list[str]], bool]
- test function that returns True to trigger data collection for a new table.
preset_title
:str
- if not None, the title of the table will be set to this value. This is useful for tables that have no title in the source data.
data_start_offset
:int
- offset from the triggered line index for the first line of data. If table data begins on the line immediately following the trigger, this value should be 1 (default).
Ancestors
- builtins.tuple
Instance variables
var data_start_offset : int
-
Alias for field number 3
var func : collections.abc.Callable[[list[str]], bool]
-
Alias for field number 1
var preset_title : str | None
-
Alias for field number 2
var title_offset : int
-
Alias for field number 0