Module utilities.utils
Utiltily functions with global scope
Functions
def as_date(str_or_dt: datetime.date | datetime.datetime | str, date_only=False, as_dt=False, futr=False, warn=True)
-
Convert a date, datetime, or string into a formatted date string or datetime object.
This function converts the input
str_or_dt
into a date string formatted according togvars.DATE_FORMAT
. Ifdate_only
is True, it returns only the date part with '00:00:00' as the time. Ifas_dt
is True, it returns a datetime object instead of a string. If the input cannot be converted, it returns an empty string and logs a warning ifwarn
is True. It also handles future dates by adjusting them iffutr
is False.Args
str_or_dt
:date | datetime | str
- The input date, datetime, or string to convert.
date_only
:bool
- Optional. If True, returns only the date part. Defaults to False.
as_dt
:bool
- Optional. If True, returns a datetime object. Defaults to False.
futr
:bool
- Optional. If False, adjusts future dates to be within the past 100 years. Defaults to False.
warn
:bool
- Optional. If True, logs a warning when conversion fails. Defaults to True.
Returns
str | datetime
- The formatted date string or datetime object, or an empty string if conversion fails.
Example
>>> as_date("2020-12-31T23:59") '2020-12-31 23:59:00' >>> as_date("2020-12-31", date_only=True) '2020-12-31' >>> as_date("2020-12-31", as_dt=True) datetime.datetime(2020, 12, 31, 0, 0) >>> as_date("not a date") ''
def as_name(name_string: str | vStr | list[str | vStr]) ‑> vStr
-
Returns name_string formatted as a name via the nameparser.HumanName class.
If name_string is passed as a list, call as_name() recursively on each element and return the most frequent as_name() result.
Args
name_string
:vStr | str | list[str | vStr]
- The input name string or list of name strings to format.
Returns
vStr
- The formatted name string.
Example
>>> as_name("Primary: John Doe") 'DOE, JOHN' >>> as_name(["John Doe", "Jane Smith", "John Doe"]) 'DOE, JOHN'
def as_phone(phone_string: str | list[str]) ‑> str
-
Convert an input str to international format or return an empty string if invalid.
If a list is supplied, returns the most frequently occuring successfully converted value after converting each element individually.
Args
phone_string
:str | list[str]
- The input phone number string or list of phone number strings to process.
Returns
str
- The phone number in standard international format, or an empty string if the input is invalid.
Example
>>> as_phone("123-456-7890") # this is NOT a valid US phone number '' >>> as_phone("843-563-7890") # this IS a valid US phone number '+1 843-563-7890' >>> as_phone(["123-456-7890", "123-456-7890", "843-563-7890"]) '+1 843-563-7890'
def as_time(time_str: datetime.datetime | str, default_date: str = '', with_date: bool = True, strict: bool = False) ‑> str | vStr
-
Format time info in the input to standard hh24:mm format, e.g. 19:05. Includes the date in the output by default if such is present.
Use
has_time()
(not this function) to determine if a string contains time info.Args
time_str
- time string to format
with_date
- include date in output. Default True.
default_date
:str
- date to use if time_str does not include a date. Overridden by gvars.DEFAULT_DATE if no value is supplied.
strict
:bool
- if True, return empty string if default_date is falsy and time_str does not include a date. Default is False.
Returns
str
- YYYY-MM-DD hh24:mm if with_date==True and time_str includes a date, hh24:mm if with_date=False or time_str does not include a date, and an empty string if time_str does not contain time data at all.
Example
>>> as_time("2020-12-31T23:59") '2020-12-31 23:59' >>> as_time("23:59", default_date="2020-12-31") '2020-12-31 23:59' >>> as_time("23:59", with_date=False) '23:59' >>> as_time("not a time") ''
def check_indent(line: str | collections.abc.Sequence[str], has_indent: int, strict: bool) ‑> bool
-
Check if a line is indented by a specified number of characters.
Args
line
:str | Sequence[str]
- The input line or sequence of lines to check.
has_indent
:int
- The number of characters to check for indentation.
strict
:bool
- If True, checks for exact indentation. If False, checks for at least the specified indentation.
Returns
bool
- True if the line meets the indentation criteria, False otherwise.
Example
>>> check_indent(" indented line", 4, True) True >>> check_indent(" indented line", 4, False) True >>> check_indent(" indented line", 4, True) False
def check_start(line: str | collections.abc.Sequence[str], has_start: str, strict: bool) ‑> bool
-
Check if a line starts with a specified string, with optional leading spaces.
Args
line
:str | Sequence[str]
- The input line or sequence of lines to check.
has_start
:str
- The string to check for at the start of the line.
strict
:bool
- If True, checks for an exact match at the start. If False, allows additional leading spaces before the start string.
Returns
bool
- True if the line starts with the specified string, False otherwise.
Example
>>> check_start(" start of line", "start", False) True >>> check_start("start of line", "start", True) True >>> check_start(" start of line", "start", True) False
def close_log()
-
write buffered log data to log file on disk.
def columns(lines: list[str], min_rows: int = 2, min_split_spaces: int = 1, **kwargs) ‑> list[list[str]]
-
Segment a list of lines of text representing a table into logical columns.
Args
lines
:list[str]
- A list of strings representing the rows of a table.
min_rows
:int
- Optional. Minimum number of rows to consider for a page break. Defaults to 2.
min_split_spaces
:int
- Optional. Minimum number of contiguous spaces to consider for a column split. Defaults to 1.
KwArgs
force_page_breaks
:list[Callable[[str], bool]]
- Optional. List of functions to which each line is passed. If any returns True, parse lines from that point forward as an independent area in the table.
debug
:bool
- Optional. Enable debug logging. Defaults to False.
min_column_width
:int
- Optional. Minimum width of columns. Defaults to 5.
min_val_length
:int
- Optional. Alias for
min_column_width
. split_table_columns
:list
- Optional. List of predefined column names that should not be 'munged' when condensing the output to the header columns.
Returns
list[list[str]]
- A list of lists, where each inner list represents a row with its columns split appropriately.
Example
>>> lines = [ ... ' COL1 COL2 COL3 COL4', ... ' VALUE1-1 VALUE2-1 VALUE3-1 VALUE4-1', ... 'VAL1-2 VAL2-2 VAL3-2 VAL4-2', ... ] >>> columns(lines) [['COL1', 'COL2', 'COL3', 'COL4'], ['VALUE1-1', 'VALUE2-1', 'VALUE3-1', 'VALUE4-1'], ['VAL1-2', 'VAL2-2', 'VAL3-2', 'VAL4-2']]
def condense_to_headings(raw_vals_list: list[list[str]], split_cols: list[str] | None = None) ‑> list[list[str]]
-
Combine values for which the column header is empty with the previous column value.
This function processes a list of rows, combining values in columns where the header is an empty string with the previous column's value. It then removes the columns with empty headers. This is useful for condensing tables where values that contain multiple spaces extend beyond their respective column heading.
Args
raw_vals_list
:list[list[str]]
- A list of rows, where each row is a list of column values.
split_cols
:list[str]
- Optional. A list of predefined column names used to identify additional header rows in wrapped table formats.
Returns
list[list[str]]
- A list of rows with condensed column values and headers.
Example
>>> raw_vals_list = [ ... ['heading1', '', 'heading2', '', 'heading3'], ... ['value1-1', 'continued', 'value2-1', 'continued', 'value3-1'], ... ['value1-2', '', 'value2-2', '', 'value3-2'], ... ] >>> split_cols = [] >>> condense_to_headings(raw_vals_list, split_cols) [['heading1', 'heading2', 'heading3'], ['value1-1 continued', 'value2-1 continued', 'value3-1'], ['value1-2', 'value2-2', 'value3-2']]
def contiguous_ints(ints: collections.abc.Collection[int]) ‑> list[tuple[int, int]]
-
Given a list of integers, return a list of tuples representing the first and last integer for any section of the list that corresponds to a set of contiguous integers.
Args
ints
:Sequence[int]
- A list of integers to process.
Returns
list[tuple[int, int]]
- A list of tuples, each containing the start and end integers of a contiguous sequence.
Example
>>> lst = [1, 3, 2, 5, 6, 7, 8, 9, 12] >>> contiguous_ints(lst) [(1, 3), (5, 9), (12, 12)]
def deduped_key(value_dict: dict[str, typing.Any], this_key: str, value: Any = None) ‑> str
-
Ensure a key is unique within a dictionary, adjusting it if necessary.
If
value_dict
containsthis_key
, append a numeric suffix or, if present, replace the array index inthis_key
with increasing numerical strings until it is no longer present invalue_dict
. Replaces occurrences of '[*]' inthis_key
with '[0]' by default.Args
value_dict
:dict[str, Any]
- The dictionary to check for key uniqueness.
this_key
:str
- The key to ensure is unique.
value
:Any
- Optional. The value associated with the key. If supplied
and equal to
value_dict[this_key]
, returnthis_key
without deduping.
Returns
str
- A unique key that is not present in the dictionary.
Example
>>> value_dict = {"Provider[0]": "Dr. Smith, MD", "Diagnosis": "GERD"} >>> deduped_key(value_dict, "Provider[*]") 'Provider[1]' >>> deduped_key(value_dict, "Provider[0]", "Dr. Smith, MD") 'Provider[0]' >>> deduped_key(value_dict, "Diagnosis", "Acid Reflux") 'Diagnosis.0'
def dictify_dict_list(dict_list: list[dict[str, str]]) ‑> dict[str, list[str]]
-
Transforms a list of dicts into a dict of lists with standard lengths.
Args
dict_list
:list[dict[str, str]]
- A list of dictionaries representing the rows of an extracted table.
Returns
dict[str, list[str]]
- A dictionary with the set of keys across all input dicts as keys and a list of strings of a common length representing the value from each input dict 'row'.
Example
>>> dictify_dict_list( ... [ ... {"key1": "value1", "key2": "value2"}, ... {"key1": "value1", "key2": "value2", "extra key": "I'm extra!"}, ... {"key1": "value1", "key2": "value2"}, ... ] ... ) {'key1': ['value1', 'value1', 'value1'], 'key2': ['value2', 'value2', 'value2'], 'extra key': ['', "I'm extra!", '']}
def dq_field_colon_ranges(line: str) ‑> list[tuple[int, int]]
-
List start and stop positions of ranges that cannot contain field colons.
Currently excludes all text within parentheticals, such as the ':' following 'Left' in the string 'Procedure: Amputation (Left: Toe)'.
Args
line
:str
- The input string to analyze.
Returns
list[tuple[int, int]]
- A list of tuples, each containing the start and stop positions of ranges that cannot contain field colons.
Example
>>> dq_field_colon_ranges("Procedure: Amputation (Left: Toe)") [(22, 33)]
def exec_on_exit(*args, **kwargs)
-
Wrapper function for forcing execution of a function when IPython execution completes or upon interpreter exit for cmd line execution.
def extract_numerics(string: str | list[str], as_list: bool = False, joined: bool = False, tiebreak_func: collections.abc.Callable[[str, str], str] = <function <lambda>>) ‑> str | list[str]
-
Extract numeric substrings from a given string or list of strings.
If the input is a list of strings, the list is joined into a single string prior to extraction.
Args
string
:str | list[str]
- The input string or list of strings to process.
as_list
:bool
- Optional. If True, returns a list of numeric substrings.
If False, returns the most frequently occurring numeric substring or
the concatenated numeric substrings if
joined
is True. Defaults to False. joined
:bool
- Optional. If True, concatenates all numeric substrings into
a single string. Only used if
as_list
is False. Defaults to False. tiebreak_func
:Callable[[str, str], str]
- Optional. A function to break
ties between numeric substrings of the same frequency. Defaults to
a function that returns the longer substring. Ignored if either
as_list
orjoined
is set to True.
Returns
str | list[str]
- A list of numeric substrings or a numeric string.
Example
>>> extract_numerics("abc123def456", as_list=True) ['123', '456'] >>> extract_numerics("abc123def456", joined=True) '123456' >>> tiebreak_func = lambda x, y: y if int(y) > int(x) else x >>> extract_numerics(["abc123", "def456", "ghi98"], tiebreak_func=tiebreak_func) '456'
def find_page_breaks(lines: collections.abc.Sequence[str], min_rows: int, force_page_breaks: list[collections.abc.Callable[[str], bool]] | None = None) ‑> list[tuple[int, int]]
-
Find page breaks by looking for repeating column headers.
This function identifies page breaks in a list of lines by detecting repeating column headers. It returns a list of tuples, each representing the start and end indices of a page in the lines. The function also considers forced page breaks provided by the user.
Args
lines
:list[str]
- A list of strings representing the lines to analyze.
min_rows
:int
- The minimum number of rows required for a valid page.
force_page_breaks
:list[Callable[[str], bool]]
- Optional. A list of functions that take a line as input and return True to force a page break.
Returns
list[tuple[int, int]]
- A list of tuples, each containing the start and end indices of a 'page' in the lines.
Example
>>> lines = [ ... 'Header1 Header2 Header3', ... 'Value1-1 Value2-1 Value3-1', ... 'Value1-2 Value2-2 Value3-2', ... ' Header1 Header2 Header3', ... ' Value1-3 Value2-3 Value3-3', ... ] >>> find_page_breaks(lines, min_rows=1) [(0, 3), (3, 5)]
def first_field_colon_idx(line: str) ‑> int
-
Find the index of the first colon not surrounded by numbers.
This function identifies the index of the first colon in a string that is not part of a time structure (e.g., "##:##") or contained within a parenthetical.
Args
line
:str
- The input string to analyze.
Returns
int
- The index of the first colon not surrounded by numbers or within
excluded ranges. Returns -1 if no such colon is found.
Example
>>> first_field_colon_idx("(Left: Toe) at 12:30 Diagnosis: Diabetes Mellitus") 33
def flatten_list_nest(list_nest: Any) ‑> Any
-
Recursively flatten a nested list into a single-depth list.
This function takes a nested list and returns a new list with all nested elements flattened to a single depth.
Args
list_nest
- The nested list to flatten.
Returns
Any
- A single-depth list containing all elements from the nested list.
If list_nest is 'falsy' or not a list, return it unchanged (hence
the
Any
return type.)
Example
>>> flatten_list_nest([1, [2, [3, 4], 5], 6]) [1, 2, 3, 4, 5, 6]
def has_time(val: datetime.datetime | str)
-
Determine if a value contains time information.
This function checks if the given value, either a datetime object or a string, contains interpretable hours and minutes of a datetime. It handles short date formats and strings that may not initially appear to contain time information.
Args
val
:datetime | str
- The value to check.
Returns
bool
- True if the value contains time information, False otherwise.
Example
>>> has_time("2020-12-31T23:59") True >>> has_time("2020-12-31") False >>> has_time(datetime(2020, 12, 31, 23, 59)) True
def is_date(val: datetime.date | datetime.datetime | str)
-
Determine if a value can be interpreted as a date.
This function checks if the given value is a date, datetime, or a string that can be interpreted as a date. It handles short date formats and strings that include additional information such as age.
Args
val
:date | datetime | str
- The value to check.
Returns
bool
- True if the value can be interpreted as a date, False otherwise.
Example
>>> is_date("12/31/2020") True >>> is_date("12/31/2020 (age 30 yrs)") True >>> is_date("not a date") False
def key_value_split(key_val_string: str) ‑> tuple[str, typing.Any]
-
Split a string into a key-value pair at the first 'field colon'.
See
first_field_colon_idx()
for additional details. Recursively processes the value if the initially extracted value contains additional field colons.Args
key_val_string
:str
- The input string to split into a key-value pair.
Returns
tuple[str, Any]
- A tuple containing the key and value split at the first 'field colon'.
Example
>>> key_val_string = "Procedure: Amputation (Left: Toe) at 12:30" >>> key_value_split(key_val_string) ('Procedure', 'Amputation (Left: Toe) at 12:30')
def key_value_strings(line: str) ‑> list[str]
-
Extract key-value pairs from a string.
This function processes a given line and returns a list of strings, each representing a key-value pair. It first inserts a space after any colon that is immediately followed by an uppercase letter or digit. Then, it splits the line using the
split_unlikely_fields()
function and applies a regular expression to find key-value pairs.Args
line
:str
- The input string to be processed.
Returns
list[str]
- A list of strings, each containing a key-value pair.
Example
>>> line = 'A:B C:D' >>> key_value_strings(line) ['A: B', 'C: D']
def last_field_colon_idx(line: str) ‑> int
-
Find the index of the last colon not surrounded by numbers.
This function identifies the index of the last colon in a string that is not part of a time structure (e.g., "##:##") or contained within a parenthetical.
Args
line
:str
- The input string to analyze.
Returns
int
- The index of the last colon not surrounded by numbers or within excluded ranges. Returns -1 if no such colon is found.
Example
>>> last_field_colon_idx("Procedure: Amputation (Left: Toe) at 12:30") 9
def lindent(line: str) ‑> int
-
Convenience function to count spaces at start of a string
def line_splits(line: str) ‑> list[tuple[int, str]]
-
Find column names from a line of text and return (start index, column name) tuples.
This function processes a given line of text to identify column names based on a regular expression. It returns a list of tuples, each containing the starting index of the column and the column text. If no matches are found, it returns a single tuple with the entire line. Preprocessing via
split_unlikely_fields()
allows for desired splits when spacing is otherwise inadequate to trigger a split based ongvars.LINE_SPLIT_REGEX
alone.Args
line
:str
- The input line of text to be processed.
Returns
list[tuple[int, str]]
- A list of tuples, each containing the starting index
and the column text.
Example
>>> line_splits("Begin End Start Stop") [(0, 'Begin'), (6, 'End'), (18, 'Start'), (24, 'Stop')]
def line_startswith_any(line: str | collections.abc.Sequence[str], starts_list: collections.abc.Sequence[str], strict=False) ‑> bool
-
Check if a line starts with any string from a list of start strings.
Args
line
:str | Sequence[str]
- The input line or sequence of lines to check. If
not a string, checks occur against
line[0]
. starts_list
:Sequence[str]
- A list of start strings to check against the line.
strict
:bool
- Optional. If False, allows additional leading spaces. If True,
the result is equivalent to
line.startswith(tuple(starts_list))
. Defaults to False.
Returns
bool
- True if the line starts with any string from the
starts_list
, False otherwise.
Example
>>> line_startswith_any(" start of line", ["start", "begin"], strict=False) True >>> line_startswith_any("start of line", ["start", "begin"], strict=True) True >>> line_startswith_any(" start of line", ["start", "begin"], strict=True) False
def lstripped_char_array(lines: collections.abc.Sequence[str], max_indent: int = -1) ‑> numpy.ndarray
-
Convert a list of strings to a character array and drop leading space columns up to a maximum indent.
Args
lines
:list[str]
- A list of strings to be converted and stripped.
max_indent
:int
- The maximum number of leading space columns to remove. Supply -1 to remove all leading space columns. Defaults to -1.
Returns
np.ndarray
- The left stripped character array for the supplied lines.
Example
>>> lines = [" line1", " line2", " line3"] >>> lstripped_char_array(lines, max_indent=3) array([[' ', 'l', 'i', 'n', 'e', '1'], [' ', 'l', 'i', 'n', 'e', '2'], [' ', 'l', 'i', 'n', 'e', '3']], dtype='<U1')
def match_ratio(string_1: collections.abc.Sequence, string_2: collections.abc.Sequence, zero_if_empty=False) ‑> float
-
Conveniece function for calling builtin SequenceMatcher.ratio()
def most_freq_element(check_lists, tiebreak_func=<function vstr_confidence_tiebreak>, xform: collections.abc.Callable[[typing.Any], typing.Union[str, vStr, typing.Any]] = utilities.v_str.vStr)
-
Return the most frequent element from a list of lists.
Args
check_lists
:list[list[Any]]
- A list of lists containing elements to check.
tiebreak_func
:Callable[[Any, Any], Any]
- A function to break ties between
elements with the same frequency. Default is
vstr_confidence_tiebreak()
. xform
:Callable[[Any], str | vStr | Any]
- A transformation function applied
to each element. Defaults to
vStr
.
Returns
Any
- The most frequent element after applying the transformation and tiebreak functions.
Example
>>> check_lists = [["a", "b", "a"], ["a", "c", "b", "b"]] >>> tiebreak_func = lambda *args: sorted(args)[0] >>> xform = str.upper >>> most_freq_element(check_lists, tiebreak_func, xform) 'A' >>> tiebreak_func = lambda *args: sorted(args)[-1] >>> most_freq_element(check_lists, tiebreak_func, xform) 'B'
def multiline_check_start(lines: collections.abc.Sequence[str], line_start_tuples: collections.abc.Sequence[LineStart | tuple[int, str, bool, bool]]) ‑> bool
-
Test the start of multiple indexes in a list of strings.
This function checks whether specified lines in a sequence of strings start with given substrings. It allows for both strict and non-strict matching, and compares the results against expected truth values.
Args
lines
:Sequence[str]
- A sequence of strings to check.
line_start_tuples
:Sequence[LineStart | tuple[int, str, bool, bool]]
- A sequence
of tuples. See
LineStart
docstring for additional details.
Returns
bool
- True if all specified
LineStart
checks match their defined truth targets.
Example
>>> lines = [" start of line", "another line", " yet another line"] >>> line_start_tuples = [(0, "start", False, True), (1, "another", True, True)] >>> multiline_check_start(lines, line_start_tuples) True
def pbar(iterable: collections.abc.Iterable[~T] = (), total: int = None, desc: str = None, postfix: str = None) ‑> collections.abc.Iterable[~T]
-
Set the global PBAR variable to a new tqdm instance for the given iterable.
This function initializes a new tqdm progress bar for the provided iterable and assigns it to the global PBAR variable. If PBAR is already an instance of tqdm, it is closed before creating a new instance. If logging to the console is enabled (gvars.LOG_CONSOLE), the function returns the iterable without creating a progress bar.
Args
iterable
:Iterable[gvars.T]
- Optional. The iterable to wrap with a progress bar. Defaults to an empty tuple.
total
:int
- Optional. The total number of iterations. If not provided, it will be inferred from the iterable.
desc
:str
- Optional. A description to display alongside the progress bar.
postfix
:dict
- Optional. A dictionary of additional information to display at the end of the progress bar.
Returns
Iterable[gvars.T]
- The iterable wrapped with a tqdm progress bar, or the original iterable if logging to the console is enabled.
Example
>>> for item in pbar(range(10), desc="Processing"): ... pass
def pbprint(msg, post=False, level=LogLevel.INFO, n=None)
-
Print a message to the tqdm progress bar or log it.
If
gvars.LOG_CONSOLE
is False, this function prints the message to the tqdm progress bar. It also forwards the message tologprint
for recording in the log file or stdout. Themsg
parameter can be any object and will be returned to the caller unchanged when the log operation completes. This is convenient for logging the output of a function call even if that output is required in later operations.Args
msg
:Any
- The message to print or log.
post
:bool
- Optional. If True, sets the message as the postfix of the progress bar. Defaults to False.
level
:log_levels
- Optional. The log level for the message. Defaults to
log_levels.INFO
. n
:int
- Optional. The number of iterations to update the progress bar by. Defaults to None.
Returns
Any
- The original message, unchanged.
def set_log_path(log_dir: str = None, log_path: str = None, overwrite: bool = False, log_console: bool = None)
-
Initialize global log objects, creating the path to the desired log location if it doesn't exist.
This function sets up the global logging configuration, including the log file path and console logging options. It ensures that the log directory exists and creates the log file if it does not already exist. If logging to the console is enabled or changed, it closes the existing log and updates the configuration.
Args
log_dir
:str
- Optional. The directory where the log file should be created. If not provided, the default directory is used.
log_path
:str
- Optional. The name of the log file. If not provided, a default
name based on
gvars.LOG_NAME
andgvars.RUN_ID
is used. overwrite
:bool
- Optional. If True, the log file is overwritten. If False, log entries are appended to the existing file. Defaults to False.
log_console
:bool
- Optional. If True, enables logging to the console. If False, disables console logging. If not provided, the current console logging setting is used.
def split_unlikely_fields(line: str) ‑> list[tuple[int, str]]
-
Split a line by unlikely fields defined by a regular expression.
This function splits a given line into a list of tuples based on matches of the
gvars.UNLIKELY_FIELDS_REGEX
regular expression to account for inadequate spacing between columnar data.line_splits()
should be called to fully segment the input line.Args
line
:str
- The input string to be split.
Returns
list[tuple[int, str]]
- A list of tuples where each tuple contains the starting index and the corresponding segment of the split string.
Example
>>> split_unlikely_fields("Begin End Start Stop") [(0, 'Begin'), (6, 'End Start'), (24, 'Stop')] >>> line_splits("Begin End Start Stop") [(0, 'Begin'), (6, 'End'), (18, 'Start'), (24, 'Stop')]
def strtobool(val: vStr | str | int | bool, strict: bool = True) ‑> bool
-
Convert a string representation of truth to a boolean value.
Args
val
:vStr | str | int | bool
- The value to convert to a boolean.
strict
:bool
- Optional. If True, raises a ValueError for invalid truth values. If False, assumes False for invalid values. Defaults to True.
Returns
bool
- The boolean representation of the input value.
Raises
ValueError
- If
strict
is True and the input value is not a valid truth representation.
Example
>>> strtobool("yes") True >>> strtobool("no") False >>> strtobool(1) True >>> strtobool(0) False >>> strtobool("invalid", strict=True) Traceback (most recent call last): ... ValueError: invalid truth value: invalid
def transpose_lists(iterable_of_iterable: collections.abc.Iterable[collections.abc.Iterable[~T]]) ‑> list[list[~T]]
-
Transpose a list of lists.
Args
iterable_of_iterable
:Iterable[Iterable[gvars.T]]
- An iterable of iterables to transpose.
Returns
list[list[gvars.T]]
- A transposed list of lists.
Example
>>> iterable_of_iterable = [ ... ["a", "b", "c"], ... ["1", "2", "3"], ... ["4", "5", "6"] ... ] >>> transpose_lists(iterable_of_iterable) [['a', '1', '4'], ['b', '2', '5'], ['c', '3', '6']]
def trim_leading_spaces(lines: list[str]) ‑> list[str]
-
Remove leading columns consisting only of spaces from a list of strings.
This function processes a list of strings, converts them to a character array, and removes all leading columns that consist entirely of spaces. It returns the modified list of strings with leading spaces removed.
Args
lines
:list[str]
- The input list of strings to process.
Returns
list[str]
- The list of strings with leading columns of spaces removed.
Example
>>> trim_leading_spaces(lines=[" abc", " def", " ghi"]) [' abc', 'def', ' ghi']
def trim_vertical_spaces(lines: str | collections.abc.Sequence[str], max_space: int = 5) ‑> str
-
Remove excess vertical whitespace between column values in a multiline string.
This function processes a multiline string representation of tabular values and removes excess vertical whitespace between column values. It ensures that the spacing between columns does not exceed the specified maximum space.
Args
lines
:str | Sequence[str]
- A multiline string or a sequence of strings representing the lines to process.
max_space
:int
- Optional. The maximum allowed vertical space between columns. Defaults to 5.
Returns
str
- The processed multiline string with excess vertical whitespace removed.
Example
>>> lines = "col1 col2 col3\nval1 val2 val3" >>> trim_vertical_spaces(lines, max_space=2) 'col1 col2 col3\nval1 val2 val3'
def tuple_w_next(val_list: collections.abc.Sequence[~T], final_val: ~T) ‑> list[tuple[~T, ~T]]
-
Return a list of consecutive value tuples from the input list pairing the final
val_list
element withfinal_val
.Args
val_list
:Sequence[gvars.T]
- The input list of values to process.
final_val
:gvars.T
- The value to pair with the last element of
val_list
.
Returns
list[tuple[gvars.T, gvars.T]]
- The list of paired value tuples.
Example
>>> tuple_w_next([1, 2, 3], 4) [(1, 2), (2, 3), (3, 4)]
def us_state_abbr(state_string: str) ‑> str
-
Convert full state names and 4-char abbreviations to 2-char US state abbreviations.
Args
state_string
:str
- The state string to be converted.
Returns
str
- The 2-character US state abbreviation.
Example
>>> us_state_abbr("California") 'CA' >>> us_state_abbr("Cali") 'CA' >>> us_state_abbr("CA") 'CA'
def vertical_split_idxs(char_arr: numpy.ndarray, min_v_split: int, min_len: int = 5) ‑> list[int]
-
Get the indices of the middlemost member of each contiguous all-spaces column set.
Args
char_arr
:np.ndarray
- A NumPy character array representing the lines of text.
min_v_split
:int
- The minimum number of contiguous space columns required for a split.
min_len
:int
- Optional. The minimum length of non-space content required for a valid split. Defaults to 5.
Returns
list[int]
- A list of indexes representing the middlemost columns of contiguous space columns. Always includes 0 as the first element and the subarray length as the final element.
Example
>>> lines = ['aaaa bbbbb ccc dddd', ... 'aaa bbb cc ddd ', ... 'aaaaaa cccc '] >>> char_arr = np.array([list(line) for line in lines]) >>> vertical_split_idxs(char_arr, min_v_split=2, min_len=2) [0, 8, 17, 24, 30]
def vstr_confidence_tiebreak(value1: str | vStr, value2: str | vStr) ‑> str | vStr
-
Select the vStr argument with the highest confidence.
Args
value1
:str | vStr
- first value to compare
value2
:str | vStr
- second value to compare
Returns
str | vStr
- The value with the highest confidence if both are vStrs. If only one is a vStr, return the vStr. If both are vStrs of equal confidence OR neither value is a vStr, return the value having the greatest length.
Classes
class FileContentsEntry (*args, **kwargs)
-
A typed dict for entries in a pdf_extractor.py function output.
Attributes
lines
:list[str]
- a list containing all lines of extracted text.
src_docs
:list[str]
- a list, one per line in lines, detailing the PDF of origin for each line.
Ancestors
- builtins.dict
Class variables
var lines : list[str]
var src_docs : list[str]
class KeyDeduper
-
Creates a reference used to deduplicate keys in a dict comprehension.
Methods
def clear(self) ‑> bool
-
Call to reinitialize deduping during nested comprehensions.
Returns
bool
- always returns True to allow resets in outer loop conditionals.
class LineStart (offset: int, startswith: str | tuple[str, ...], strict: bool = True, truth_target: bool = True)
-
Line start tuple. Used with multiline_check_start to define which line indexes should start with which strings.
Attributes
offset
:int
- line index in list of lines to check
line_start
:str
- string to check for at start of line
strict
:bool
- If True, line must start with line_start exactly. If False, case is ignored and indent must be >= line_start. Default is True.
truth_target
:bool
- True when line must match line_start. False when line must NOT match line_start. Default is True.
Ancestors
- builtins.tuple
Instance variables
var offset : int
-
Alias for field number 0
var startswith : str | tuple[str, ...]
-
Alias for field number 1
var strict : bool
-
Alias for field number 2
var truth_target : bool
-
Alias for field number 3