Module utilities.utils

Utiltily functions with global scope

Functions

def as_date(str_or_dt: datetime.date | datetime.datetime | str, date_only=False, as_dt=False, futr=False, warn=True)

Convert a date, datetime, or string into a formatted date string or datetime object.

This function converts the input str_or_dt into a date string formatted according to gvars.DATE_FORMAT. If date_only is True, it returns only the date part with '00:00:00' as the time. If as_dt is True, it returns a datetime object instead of a string. If the input cannot be converted, it returns an empty string and logs a warning if warn is True. It also handles future dates by adjusting them if futr is False.

Args

str_or_dt : date | datetime | str
The input date, datetime, or string to convert.
date_only : bool
Optional. If True, returns only the date part. Defaults to False.
as_dt : bool
Optional. If True, returns a datetime object. Defaults to False.
futr : bool
Optional. If False, adjusts future dates to be within the past 100 years. Defaults to False.
warn : bool
Optional. If True, logs a warning when conversion fails. Defaults to True.

Returns

str | datetime
The formatted date string or datetime object, or an empty string if conversion fails.

Example

>>> as_date("2020-12-31T23:59")
'2020-12-31 23:59:00'
>>> as_date("2020-12-31", date_only=True)
'2020-12-31'
>>> as_date("2020-12-31", as_dt=True)
datetime.datetime(2020, 12, 31, 0, 0)
>>> as_date("not a date")
''
def as_name(name_string: str | vStr | list[str | vStr]) ‑> vStr

Returns name_string formatted as a name via the nameparser.HumanName class.

If name_string is passed as a list, call as_name() recursively on each element and return the most frequent as_name() result.

Args

name_string : vStr | str | list[str | vStr]
The input name string or list of name strings to format.

Returns

vStr
The formatted name string.

Example

>>> as_name("Primary: John Doe")
'DOE, JOHN'
>>> as_name(["John Doe", "Jane Smith", "John Doe"])
'DOE, JOHN'
def as_phone(phone_string: str | list[str]) ‑> str

Convert an input str to international format or return an empty string if invalid.

If a list is supplied, returns the most frequently occuring successfully converted value after converting each element individually.

Args

phone_string : str | list[str]
The input phone number string or list of phone number strings to process.

Returns

str
The phone number in standard international format, or an empty string if the input is invalid.

Example

>>> as_phone("123-456-7890")  # this is NOT a valid US phone number
''
>>> as_phone("843-563-7890")  # this IS a valid US phone number
'+1 843-563-7890'
>>> as_phone(["123-456-7890", "123-456-7890", "843-563-7890"])
'+1 843-563-7890'
def as_time(time_str: datetime.datetime | str, default_date: str = '', with_date: bool = True, strict: bool = False) ‑> str | vStr

Format time info in the input to standard hh24:mm format, e.g. 19:05. Includes the date in the output by default if such is present.

Use has_time() (not this function) to determine if a string contains time info.

Args

time_str
time string to format
with_date
include date in output. Default True.
default_date : str
date to use if time_str does not include a date. Overridden by gvars.DEFAULT_DATE if no value is supplied.
strict : bool
if True, return empty string if default_date is falsy and time_str does not include a date. Default is False.

Returns

str
YYYY-MM-DD hh24:mm if with_date==True and time_str includes a date, hh24:mm if with_date=False or time_str does not include a date, and an empty string if time_str does not contain time data at all.

Example

>>> as_time("2020-12-31T23:59")
'2020-12-31 23:59'
>>> as_time("23:59", default_date="2020-12-31")
'2020-12-31 23:59'
>>> as_time("23:59", with_date=False)
'23:59'
>>> as_time("not a time")
''
def check_indent(line: str | collections.abc.Sequence[str], has_indent: int, strict: bool) ‑> bool

Check if a line is indented by a specified number of characters.

Args

line : str | Sequence[str]
The input line or sequence of lines to check.
has_indent : int
The number of characters to check for indentation.
strict : bool
If True, checks for exact indentation. If False, checks for at least the specified indentation.

Returns

bool
True if the line meets the indentation criteria, False otherwise.

Example

>>> check_indent("    indented line", 4, True)
True
>>> check_indent("      indented line", 4, False)
True
>>> check_indent("  indented line", 4, True)
False
def check_start(line: str | collections.abc.Sequence[str], has_start: str, strict: bool) ‑> bool

Check if a line starts with a specified string, with optional leading spaces.

Args

line : str | Sequence[str]
The input line or sequence of lines to check.
has_start : str
The string to check for at the start of the line.
strict : bool
If True, checks for an exact match at the start. If False, allows additional leading spaces before the start string.

Returns

bool
True if the line starts with the specified string, False otherwise.

Example

>>> check_start("    start of line", "start", False)
True
>>> check_start("start of line", "start", True)
True
>>> check_start("    start of line", "start", True)
False
def close_log()

write buffered log data to log file on disk.

def columns(lines: list[str], min_rows: int = 2, min_split_spaces: int = 1, **kwargs) ‑> list[list[str]]

Segment a list of lines of text representing a table into logical columns.

Args

lines : list[str]
A list of strings representing the rows of a table.
min_rows : int
Optional. Minimum number of rows to consider for a page break. Defaults to 2.
min_split_spaces : int
Optional. Minimum number of contiguous spaces to consider for a column split. Defaults to 1.

KwArgs

force_page_breaks : list[Callable[[str], bool]]
Optional. List of functions to which each line is passed. If any returns True, parse lines from that point forward as an independent area in the table.
debug : bool
Optional. Enable debug logging. Defaults to False.
min_column_width : int
Optional. Minimum width of columns. Defaults to 5.
min_val_length : int
Optional. Alias for min_column_width.
split_table_columns : list
Optional. List of predefined column names that should not be 'munged' when condensing the output to the header columns.

Returns

list[list[str]]
A list of lists, where each inner list represents a row with its columns split appropriately.

Example

>>> lines = [
...     ' COL1        COL2        COL3        COL4',
...     ' VALUE1-1   VALUE2-1    VALUE3-1    VALUE4-1',
...     'VAL1-2      VAL2-2     VAL3-2        VAL4-2',
... ]
>>> columns(lines)
[['COL1', 'COL2', 'COL3', 'COL4'],
 ['VALUE1-1', 'VALUE2-1', 'VALUE3-1', 'VALUE4-1'],
 ['VAL1-2', 'VAL2-2', 'VAL3-2', 'VAL4-2']]
def condense_to_headings(raw_vals_list: list[list[str]], split_cols: list[str] | None = None) ‑> list[list[str]]

Combine values for which the column header is empty with the previous column value.

This function processes a list of rows, combining values in columns where the header is an empty string with the previous column's value. It then removes the columns with empty headers. This is useful for condensing tables where values that contain multiple spaces extend beyond their respective column heading.

Args

raw_vals_list : list[list[str]]
A list of rows, where each row is a list of column values.
split_cols : list[str]
Optional. A list of predefined column names used to identify additional header rows in wrapped table formats.

Returns

list[list[str]]
A list of rows with condensed column values and headers.

Example

>>> raw_vals_list = [
...     ['heading1', '', 'heading2', '', 'heading3'],
...     ['value1-1', 'continued', 'value2-1', 'continued', 'value3-1'],
...     ['value1-2', '', 'value2-2', '', 'value3-2'],
... ]
>>> split_cols = []
>>> condense_to_headings(raw_vals_list, split_cols)
[['heading1', 'heading2', 'heading3'],
 ['value1-1 continued', 'value2-1 continued', 'value3-1'],
 ['value1-2', 'value2-2', 'value3-2']]
def contiguous_ints(ints: collections.abc.Collection[int]) ‑> list[tuple[int, int]]

Given a list of integers, return a list of tuples representing the first and last integer for any section of the list that corresponds to a set of contiguous integers.

Args

ints : Sequence[int]
A list of integers to process.

Returns

list[tuple[int, int]]
A list of tuples, each containing the start and end integers of a contiguous sequence.

Example

>>> lst = [1, 3, 2, 5, 6, 7, 8, 9, 12]
>>> contiguous_ints(lst)
[(1, 3), (5, 9), (12, 12)]
def deduped_key(value_dict: dict[str, typing.Any], this_key: str, value: Any = None) ‑> str

Ensure a key is unique within a dictionary, adjusting it if necessary.

If value_dict contains this_key, append a numeric suffix or, if present, replace the array index in this_key with increasing numerical strings until it is no longer present in value_dict. Replaces occurrences of '[*]' in this_key with '[0]' by default.

Args

value_dict : dict[str, Any]
The dictionary to check for key uniqueness.
this_key : str
The key to ensure is unique.
value : Any
Optional. The value associated with the key. If supplied and equal to value_dict[this_key], return this_key without deduping.

Returns

str
A unique key that is not present in the dictionary.

Example

>>> value_dict = {"Provider[0]": "Dr. Smith, MD", "Diagnosis": "GERD"}
>>> deduped_key(value_dict, "Provider[*]")
'Provider[1]'
>>> deduped_key(value_dict, "Provider[0]", "Dr. Smith, MD")
'Provider[0]'
>>> deduped_key(value_dict, "Diagnosis", "Acid Reflux")
'Diagnosis.0'
def dictify_dict_list(dict_list: list[dict[str, str]]) ‑> dict[str, list[str]]

Transforms a list of dicts into a dict of lists with standard lengths.

Args

dict_list : list[dict[str, str]]
A list of dictionaries representing the rows of an extracted table.

Returns

dict[str, list[str]]
A dictionary with the set of keys across all input dicts as keys and a list of strings of a common length representing the value from each input dict 'row'.

Example

>>> dictify_dict_list(
...     [
...         {"key1": "value1", "key2": "value2"},
...         {"key1": "value1", "key2": "value2", "extra key": "I'm extra!"},
...         {"key1": "value1", "key2": "value2"},
...     ]
... )
{'key1': ['value1', 'value1', 'value1'], 'key2': ['value2', 'value2', 'value2'], 'extra key': ['', "I'm extra!", '']}
def dq_field_colon_ranges(line: str) ‑> list[tuple[int, int]]

List start and stop positions of ranges that cannot contain field colons.

Currently excludes all text within parentheticals, such as the ':' following 'Left' in the string 'Procedure: Amputation (Left: Toe)'.

Args

line : str
The input string to analyze.

Returns

list[tuple[int, int]]
A list of tuples, each containing the start and stop positions of ranges that cannot contain field colons.

Example

>>> dq_field_colon_ranges("Procedure: Amputation (Left: Toe)")
[(22, 33)]
def exec_on_exit(*args, **kwargs)

Wrapper function for forcing execution of a function when IPython execution completes or upon interpreter exit for cmd line execution.

def extract_numerics(string: str | list[str], as_list: bool = False, joined: bool = False, tiebreak_func: collections.abc.Callable[[str, str], str] = <function <lambda>>) ‑> str | list[str]

Extract numeric substrings from a given string or list of strings.

If the input is a list of strings, the list is joined into a single string prior to extraction.

Args

string : str | list[str]
The input string or list of strings to process.
as_list : bool
Optional. If True, returns a list of numeric substrings. If False, returns the most frequently occurring numeric substring or the concatenated numeric substrings if joined is True. Defaults to False.
joined : bool
Optional. If True, concatenates all numeric substrings into a single string. Only used if as_list is False. Defaults to False.
tiebreak_func : Callable[[str, str], str]
Optional. A function to break ties between numeric substrings of the same frequency. Defaults to a function that returns the longer substring. Ignored if either as_list or joined is set to True.

Returns

str | list[str]
A list of numeric substrings or a numeric string.

Example

>>> extract_numerics("abc123def456", as_list=True)
['123', '456']
>>> extract_numerics("abc123def456", joined=True)
'123456'
>>> tiebreak_func = lambda x, y: y if int(y) > int(x) else x
>>> extract_numerics(["abc123", "def456", "ghi98"], tiebreak_func=tiebreak_func)
'456'
def find_page_breaks(lines: collections.abc.Sequence[str], min_rows: int, force_page_breaks: list[collections.abc.Callable[[str], bool]] | None = None) ‑> list[tuple[int, int]]

Find page breaks by looking for repeating column headers.

This function identifies page breaks in a list of lines by detecting repeating column headers. It returns a list of tuples, each representing the start and end indices of a page in the lines. The function also considers forced page breaks provided by the user.

Args

lines : list[str]
A list of strings representing the lines to analyze.
min_rows : int
The minimum number of rows required for a valid page.
force_page_breaks : list[Callable[[str], bool]]
Optional. A list of functions that take a line as input and return True to force a page break.

Returns

list[tuple[int, int]]
A list of tuples, each containing the start and end indices of a 'page' in the lines.

Example

>>> lines = [
...     'Header1  Header2  Header3',
...     'Value1-1 Value2-1 Value3-1',
...     'Value1-2 Value2-2 Value3-2',
...     '     Header1   Header2  Header3',
...     '     Value1-3  Value2-3 Value3-3',
... ]
>>> find_page_breaks(lines, min_rows=1)
[(0, 3), (3, 5)]
def first_field_colon_idx(line: str) ‑> int

Find the index of the first colon not surrounded by numbers.

This function identifies the index of the first colon in a string that is not part of a time structure (e.g., "##:##") or contained within a parenthetical.

Args

line : str
The input string to analyze.

Returns

int
The index of the first colon not surrounded by numbers or within

excluded ranges. Returns -1 if no such colon is found.

Example

>>> first_field_colon_idx("(Left: Toe) at 12:30    Diagnosis: Diabetes Mellitus")
33
def flatten_list_nest(list_nest: Any) ‑> Any

Recursively flatten a nested list into a single-depth list.

This function takes a nested list and returns a new list with all nested elements flattened to a single depth.

Args

list_nest
The nested list to flatten.

Returns

Any
A single-depth list containing all elements from the nested list. If list_nest is 'falsy' or not a list, return it unchanged (hence the Any return type.)

Example

>>> flatten_list_nest([1, [2, [3, 4], 5], 6])
[1, 2, 3, 4, 5, 6]
def has_time(val: datetime.datetime | str)

Determine if a value contains time information.

This function checks if the given value, either a datetime object or a string, contains interpretable hours and minutes of a datetime. It handles short date formats and strings that may not initially appear to contain time information.

Args

val : datetime | str
The value to check.

Returns

bool
True if the value contains time information, False otherwise.

Example

>>> has_time("2020-12-31T23:59")
True
>>> has_time("2020-12-31")
False
>>> has_time(datetime(2020, 12, 31, 23, 59))
True
def is_date(val: datetime.date | datetime.datetime | str)

Determine if a value can be interpreted as a date.

This function checks if the given value is a date, datetime, or a string that can be interpreted as a date. It handles short date formats and strings that include additional information such as age.

Args

val : date | datetime | str
The value to check.

Returns

bool
True if the value can be interpreted as a date, False otherwise.

Example

>>> is_date("12/31/2020")
True
>>> is_date("12/31/2020 (age 30 yrs)")
True
>>> is_date("not a date")
False
def key_value_split(key_val_string: str) ‑> tuple[str, typing.Any]

Split a string into a key-value pair at the first 'field colon'.

See first_field_colon_idx() for additional details. Recursively processes the value if the initially extracted value contains additional field colons.

Args

key_val_string : str
The input string to split into a key-value pair.

Returns

tuple[str, Any]
A tuple containing the key and value split at the first 'field colon'.

Example

>>> key_val_string = "Procedure: Amputation (Left: Toe) at 12:30"
>>> key_value_split(key_val_string)
('Procedure', 'Amputation (Left: Toe) at 12:30')
def key_value_strings(line: str) ‑> list[str]

Extract key-value pairs from a string.

This function processes a given line and returns a list of strings, each representing a key-value pair. It first inserts a space after any colon that is immediately followed by an uppercase letter or digit. Then, it splits the line using the split_unlikely_fields() function and applies a regular expression to find key-value pairs.

Args

line : str
The input string to be processed.

Returns

list[str]
A list of strings, each containing a key-value pair.

Example

>>> line = 'A:B  C:D'
>>> key_value_strings(line)
['A: B', 'C: D']
def last_field_colon_idx(line: str) ‑> int

Find the index of the last colon not surrounded by numbers.

This function identifies the index of the last colon in a string that is not part of a time structure (e.g., "##:##") or contained within a parenthetical.

Args

line : str
The input string to analyze.

Returns

int
The index of the last colon not surrounded by numbers or within excluded ranges. Returns -1 if no such colon is found.

Example

>>> last_field_colon_idx("Procedure: Amputation (Left: Toe) at 12:30")
9
def lindent(line: str) ‑> int

Convenience function to count spaces at start of a string

def line_splits(line: str) ‑> list[tuple[int, str]]

Find column names from a line of text and return (start index, column name) tuples.

This function processes a given line of text to identify column names based on a regular expression. It returns a list of tuples, each containing the starting index of the column and the column text. If no matches are found, it returns a single tuple with the entire line. Preprocessing via split_unlikely_fields() allows for desired splits when spacing is otherwise inadequate to trigger a split based on gvars.LINE_SPLIT_REGEX alone.

Args

line : str
The input line of text to be processed.

Returns

list[tuple[int, str]]
A list of tuples, each containing the starting index

and the column text.

Example

>>> line_splits("Begin End         Start Stop")
[(0, 'Begin'), (6, 'End'), (18, 'Start'), (24, 'Stop')]
def line_startswith_any(line: str | collections.abc.Sequence[str], starts_list: collections.abc.Sequence[str], strict=False) ‑> bool

Check if a line starts with any string from a list of start strings.

Args

line : str | Sequence[str]
The input line or sequence of lines to check. If not a string, checks occur against line[0].
starts_list : Sequence[str]
A list of start strings to check against the line.
strict : bool
Optional. If False, allows additional leading spaces. If True, the result is equivalent to line.startswith(tuple(starts_list)). Defaults to False.

Returns

bool
True if the line starts with any string from the starts_list, False otherwise.

Example

>>> line_startswith_any("    start of line", ["start", "begin"], strict=False)
True
>>> line_startswith_any("start of line", ["start", "begin"], strict=True)
True
>>> line_startswith_any("    start of line", ["start", "begin"], strict=True)
False
def lstripped_char_array(lines: collections.abc.Sequence[str], max_indent: int = -1) ‑> numpy.ndarray

Convert a list of strings to a character array and drop leading space columns up to a maximum indent.

Args

lines : list[str]
A list of strings to be converted and stripped.
max_indent : int
The maximum number of leading space columns to remove. Supply -1 to remove all leading space columns. Defaults to -1.

Returns

np.ndarray
The left stripped character array for the supplied lines.

Example

>>> lines = ["    line1", "    line2", "    line3"]
>>> lstripped_char_array(lines, max_indent=3)
array([[' ', 'l', 'i', 'n', 'e', '1'],
       [' ', 'l', 'i', 'n', 'e', '2'],
       [' ', 'l', 'i', 'n', 'e', '3']], dtype='<U1')
def match_ratio(string_1: collections.abc.Sequence, string_2: collections.abc.Sequence, zero_if_empty=False) ‑> float

Conveniece function for calling builtin SequenceMatcher.ratio()

def most_freq_element(check_lists, tiebreak_func=<function vstr_confidence_tiebreak>, xform: collections.abc.Callable[[typing.Any], typing.Union[str, vStr, typing.Any]] = utilities.v_str.vStr)

Return the most frequent element from a list of lists.

Args

check_lists : list[list[Any]]
A list of lists containing elements to check.
tiebreak_func : Callable[[Any, Any], Any]
A function to break ties between elements with the same frequency. Default is vstr_confidence_tiebreak().
xform : Callable[[Any], str | vStr | Any]
A transformation function applied to each element. Defaults to vStr.

Returns

Any
The most frequent element after applying the transformation and tiebreak functions.

Example

>>> check_lists = [["a", "b", "a"], ["a", "c", "b", "b"]]
>>> tiebreak_func = lambda *args: sorted(args)[0]
>>> xform = str.upper
>>> most_freq_element(check_lists, tiebreak_func, xform)
'A'
>>> tiebreak_func = lambda *args: sorted(args)[-1]
>>> most_freq_element(check_lists, tiebreak_func, xform)
'B'
def multiline_check_start(lines: collections.abc.Sequence[str], line_start_tuples: collections.abc.Sequence[LineStart | tuple[int, str, bool, bool]]) ‑> bool

Test the start of multiple indexes in a list of strings.

This function checks whether specified lines in a sequence of strings start with given substrings. It allows for both strict and non-strict matching, and compares the results against expected truth values.

Args

lines : Sequence[str]
A sequence of strings to check.
line_start_tuples : Sequence[LineStart | tuple[int, str, bool, bool]]
A sequence of tuples. See LineStart docstring for additional details.

Returns

bool
True if all specified LineStart checks match their defined truth targets.

Example

>>> lines = ["    start of line", "another line", "  yet another line"]
>>> line_start_tuples = [(0, "start", False, True), (1, "another", True, True)]
>>> multiline_check_start(lines, line_start_tuples)
True
def pbar(iterable: collections.abc.Iterable[~T] = (), total: int = None, desc: str = None, postfix: str = None) ‑> collections.abc.Iterable[~T]

Set the global PBAR variable to a new tqdm instance for the given iterable.

This function initializes a new tqdm progress bar for the provided iterable and assigns it to the global PBAR variable. If PBAR is already an instance of tqdm, it is closed before creating a new instance. If logging to the console is enabled (gvars.LOG_CONSOLE), the function returns the iterable without creating a progress bar.

Args

iterable : Iterable[gvars.T]
Optional. The iterable to wrap with a progress bar. Defaults to an empty tuple.
total : int
Optional. The total number of iterations. If not provided, it will be inferred from the iterable.
desc : str
Optional. A description to display alongside the progress bar.
postfix : dict
Optional. A dictionary of additional information to display at the end of the progress bar.

Returns

Iterable[gvars.T]
The iterable wrapped with a tqdm progress bar, or the original iterable if logging to the console is enabled.

Example

>>> for item in pbar(range(10), desc="Processing"):
...     pass
def pbprint(msg, post=False, level=LogLevel.INFO, n=None)

Print a message to the tqdm progress bar or log it.

If gvars.LOG_CONSOLE is False, this function prints the message to the tqdm progress bar. It also forwards the message to logprint for recording in the log file or stdout. The msg parameter can be any object and will be returned to the caller unchanged when the log operation completes. This is convenient for logging the output of a function call even if that output is required in later operations.

Args

msg : Any
The message to print or log.
post : bool
Optional. If True, sets the message as the postfix of the progress bar. Defaults to False.
level : log_levels
Optional. The log level for the message. Defaults to log_levels.INFO.
n : int
Optional. The number of iterations to update the progress bar by. Defaults to None.

Returns

Any
The original message, unchanged.
def set_log_path(log_dir: str = None, log_path: str = None, overwrite: bool = False, log_console: bool = None)

Initialize global log objects, creating the path to the desired log location if it doesn't exist.

This function sets up the global logging configuration, including the log file path and console logging options. It ensures that the log directory exists and creates the log file if it does not already exist. If logging to the console is enabled or changed, it closes the existing log and updates the configuration.

Args

log_dir : str
Optional. The directory where the log file should be created. If not provided, the default directory is used.
log_path : str
Optional. The name of the log file. If not provided, a default name based on gvars.LOG_NAME and gvars.RUN_ID is used.
overwrite : bool
Optional. If True, the log file is overwritten. If False, log entries are appended to the existing file. Defaults to False.
log_console : bool
Optional. If True, enables logging to the console. If False, disables console logging. If not provided, the current console logging setting is used.
def split_unlikely_fields(line: str) ‑> list[tuple[int, str]]

Split a line by unlikely fields defined by a regular expression.

This function splits a given line into a list of tuples based on matches of the gvars.UNLIKELY_FIELDS_REGEX regular expression to account for inadequate spacing between columnar data. line_splits() should be called to fully segment the input line.

Args

line : str
The input string to be split.

Returns

list[tuple[int, str]]
A list of tuples where each tuple contains the starting index and the corresponding segment of the split string.

Example

>>> split_unlikely_fields("Begin End         Start Stop")
[(0, 'Begin'), (6, 'End         Start'), (24, 'Stop')]
>>> line_splits("Begin End         Start Stop")
[(0, 'Begin'), (6, 'End'), (18, 'Start'), (24, 'Stop')]
def strtobool(val: vStr | str | int | bool, strict: bool = True) ‑> bool

Convert a string representation of truth to a boolean value.

Args

val : vStr | str | int | bool
The value to convert to a boolean.
strict : bool
Optional. If True, raises a ValueError for invalid truth values. If False, assumes False for invalid values. Defaults to True.

Returns

bool
The boolean representation of the input value.

Raises

ValueError
If strict is True and the input value is not a valid truth representation.

Example

>>> strtobool("yes")
True
>>> strtobool("no")
False
>>> strtobool(1)
True
>>> strtobool(0)
False
>>> strtobool("invalid", strict=True)
Traceback (most recent call last):
    ...
ValueError: invalid truth value: invalid
def transpose_lists(iterable_of_iterable: collections.abc.Iterable[collections.abc.Iterable[~T]]) ‑> list[list[~T]]

Transpose a list of lists.

Args

iterable_of_iterable : Iterable[Iterable[gvars.T]]
An iterable of iterables to transpose.

Returns

list[list[gvars.T]]
A transposed list of lists.

Example

>>> iterable_of_iterable = [
...     ["a", "b", "c"],
...     ["1", "2", "3"],
...     ["4", "5", "6"]
... ]
>>> transpose_lists(iterable_of_iterable)
[['a', '1', '4'], ['b', '2', '5'], ['c', '3', '6']]
def trim_leading_spaces(lines: list[str]) ‑> list[str]

Remove leading columns consisting only of spaces from a list of strings.

This function processes a list of strings, converts them to a character array, and removes all leading columns that consist entirely of spaces. It returns the modified list of strings with leading spaces removed.

Args

lines : list[str]
The input list of strings to process.

Returns

list[str]
The list of strings with leading columns of spaces removed.

Example

>>> trim_leading_spaces(lines=["    abc", "  def", "    ghi"])
['  abc', 'def', '  ghi']
def trim_vertical_spaces(lines: str | collections.abc.Sequence[str], max_space: int = 5) ‑> str

Remove excess vertical whitespace between column values in a multiline string.

This function processes a multiline string representation of tabular values and removes excess vertical whitespace between column values. It ensures that the spacing between columns does not exceed the specified maximum space.

Args

lines : str | Sequence[str]
A multiline string or a sequence of strings representing the lines to process.
max_space : int
Optional. The maximum allowed vertical space between columns. Defaults to 5.

Returns

str
The processed multiline string with excess vertical whitespace removed.

Example

>>> lines = "col1    col2    col3\nval1    val2    val3"
>>> trim_vertical_spaces(lines, max_space=2)
'col1  col2  col3\nval1  val2  val3'
def tuple_w_next(val_list: collections.abc.Sequence[~T], final_val: ~T) ‑> list[tuple[~T, ~T]]

Return a list of consecutive value tuples from the input list pairing the final val_list element with final_val.

Args

val_list : Sequence[gvars.T]
The input list of values to process.
final_val : gvars.T
The value to pair with the last element of val_list.

Returns

list[tuple[gvars.T, gvars.T]]
The list of paired value tuples.

Example

>>> tuple_w_next([1, 2, 3], 4)
[(1, 2), (2, 3), (3, 4)]
def us_state_abbr(state_string: str) ‑> str

Convert full state names and 4-char abbreviations to 2-char US state abbreviations.

Args

state_string : str
The state string to be converted.

Returns

str
The 2-character US state abbreviation.

Example

>>> us_state_abbr("California")
'CA'
>>> us_state_abbr("Cali")
'CA'
>>> us_state_abbr("CA")
'CA'
def vertical_split_idxs(char_arr: numpy.ndarray, min_v_split: int, min_len: int = 5) ‑> list[int]

Get the indices of the middlemost member of each contiguous all-spaces column set.

Args

char_arr : np.ndarray
A NumPy character array representing the lines of text.
min_v_split : int
The minimum number of contiguous space columns required for a split.
min_len : int
Optional. The minimum length of non-space content required for a valid split. Defaults to 5.

Returns

list[int]
A list of indexes representing the middlemost columns of contiguous space columns. Always includes 0 as the first element and the subarray length as the final element.

Example

>>> lines = ['aaaa       bbbbb    ccc   dddd',
...          'aaa          bbb    cc    ddd ',
...          'aaaaaa             cccc       ']
>>> char_arr = np.array([list(line) for line in lines])
>>> vertical_split_idxs(char_arr, min_v_split=2, min_len=2)
[0, 8, 17, 24, 30]
def vstr_confidence_tiebreak(value1: str | vStr, value2: str | vStr) ‑> str | vStr

Select the vStr argument with the highest confidence.

Args

value1 : str | vStr
first value to compare
value2 : str | vStr
second value to compare

Returns

str | vStr
The value with the highest confidence if both are vStrs. If only one is a vStr, return the vStr. If both are vStrs of equal confidence OR neither value is a vStr, return the value having the greatest length.

Classes

class FileContentsEntry (*args, **kwargs)

A typed dict for entries in a pdf_extractor.py function output.

Attributes

lines : list[str]
a list containing all lines of extracted text.
src_docs : list[str]
a list, one per line in lines, detailing the PDF of origin for each line.

Ancestors

  • builtins.dict

Class variables

var lines : list[str]
var src_docs : list[str]
class KeyDeduper

Creates a reference used to deduplicate keys in a dict comprehension.

Methods

def clear(self) ‑> bool

Call to reinitialize deduping during nested comprehensions.

Returns

bool
always returns True to allow resets in outer loop conditionals.
class LineStart (offset: int, startswith: str | tuple[str, ...], strict: bool = True, truth_target: bool = True)

Line start tuple. Used with multiline_check_start to define which line indexes should start with which strings.

Attributes

offset : int
line index in list of lines to check
line_start : str
string to check for at start of line
strict : bool
If True, line must start with line_start exactly. If False, case is ignored and indent must be >= line_start. Default is True.
truth_target : bool
True when line must match line_start. False when line must NOT match line_start. Default is True.

Ancestors

  • builtins.tuple

Instance variables

var offset : int

Alias for field number 0

var startswith : str | tuple[str, ...]

Alias for field number 1

var strict : bool

Alias for field number 2

var truth_target : bool

Alias for field number 3