Package utilities

shared extractor utility functions

Sub-modules

utilities.aws_utils

Container for send functions called via aws_specs and aws_specs_dev

utilities.azure_ocr_integrator

Function call for Microsoft Azure OCR requests

utilities.client_utils

Container for send functions called via client_specs and client_specs_dev

utilities.env_var_action

EnvVarAction class definition

utilities.hl7_utils

Parse HL7 data into pdf extractor schedule/demographic format.

utilities.json_decoders

Custom json decoders

utilities.library_utils

Tools for constructing and managing 'pdf_library' objects

utilities.log_utils

exception handler with logging and dynamic continuation

utilities.managed_fields

FieldManager and ManagedField class and related utilities. See specs/_bases/_fields.py for examples of well commented, globally applicable …

utilities.pdf_utils

Utility functions for raw pdf, CSV, and other delimited file processing.

utilities.protocols

Define protocols for inaccessible (due to circular imports) external classes

utilities.section_utils

Utility functions and classes used by: section_extractor.py section_specs.py

utilities.table_interpreters

library of "interpreter" functions called during table extraction that covert free text lines into a raw table format

utilities.table_utils

Utility functions for table extraction

utilities.transform_utils

utility functions used by: table_transformer.py transform_specs.py summary_specs.py

utilities.utils

Utiltily functions with global scope

utilities.v_str

String implementation used to associate source context and confidence with information extracted from the PDF

utilities.value_cache_dict

Useful in extending functionality for comprehensions by allowing the current iteration to reference the results of prior iteration(s). See …

Classes

class AzureOCRIntegrator (max_retries: int = 10, timeout: int = 60)

Extract text from pdf images via calls to Azure Cognitive Services OCR API.

Class variables

var client : azure.cognitiveservices.vision.computervision._computer_vision_client.ComputerVisionClient | None
var last_op_id : str
var last_read_results : list[azure.cognitiveservices.vision.computervision.models._models_py3.ReadResult]
var max_retries : int
var timeout : int

Methods

def create_client(self, azure_secret_name: str) ‑> None

Create an Azure ComputerVisionClient based on the supplied AWS secret.

If the azure_secret_name parameter is empty or does not point to a valid AWS secret, self.client is set to None and no image text will be extracted.

Args

azure_secret_name
the name of the secret defined in AWS for this facility's subscription key.

Environment Variables: AWS_ACCESS_KEY_ID: the access key ID for an account with access to the secret AWS_SECRET_ACCESS_KEY: the secret access key for an account with access to the secret AWS_SESSION_TOKEN (optional): required if the selected account is not allowed the non-interactive login permission. Blank otherwise. AWS_REGION (optional): defaults to "us-east-1" if not set.

def ocr_pages(self, pdf: bytes, pages: list[int], debug_path: pathlib.Path | None = None) ‑> list[str]

Read the text from supplied pdf page indices.

Args

pdf
bytes of a pdf file
pages
list of pdf page indices to OCR

Returns

list[str]
list of strings containing structured text extracted from each supplied page index.
class CacheDictCheck (check_name: str, key_arg: str | int, value_arg: str | int, key_check: collections.abc.Callable[[typing.Any], bool], format_key: collections.abc.Callable[[typing.Any], collections.abc.Hashable] = builtins.str, value_check: collections.abc.Callable[[typing.Any], bool] = <function CacheDictCheck.<lambda>>, preprocess_value: collections.abc.Callable[[typing.Any], typing.Any] = <function CacheDictCheck.<lambda>>, concat_value: collections.abc.Callable[[typing.Any, typing.Any], typing.Any] = <function CacheDictCheck.<lambda>>, format_arg_value: collections.abc.Callable[[typing.Any, typing.Any], typing.Any] = <function CacheDictCheck.<lambda>>, cache_factory: collections.abc.Callable[[], typing.Any] = builtins.str, reset_arg: str | int | None = None, *, reset_check: collections.abc.Callable[[typing.Any, typing.Any], bool] = <function CacheDictCheck.<lambda>>)

Defines a cache_check for use with the ValueCacheDict decorator.

Defines the tests for determining whether an arg or kwarg should be added to the cache, how the value should be formatted prior to being added, how to combine it with previously cached values, and how the value in the cache should be reincorporated into the values of the args and kwargs supplied in the call before they are forwarded to the wrapped function.

Attributes

check_name : str
key for this check's cache in the cache_dict of the @ValueCacheDict wrapped function.
key_arg : str | int
the argument to the wrapped function that will be used to generate the cache dict key. An int represents the zero based index of a postional arg (see note below). A str must reference a valid keyword argument.
value_arg : str | int
the argument to the wrapped function that will be used to create values for the cache.
key_check : Callable[[Any], bool]
given the resolved value of the 'key_arg' (see ValueCacheDict), return True to trigger caching.
format_key : Callable[[Any], Hashable]
given the resolved value of the 'key_arg', return a cache dict key. (Default is str)
value_check : Callable[[Any], bool]
given the resolved value of the 'value_arg' (see ValueCacheDict), return True to trigger caching. (Default is lambda _: True)
preprocess_value : Callable[[Any], Any]
given the resolved value of the 'value_arg', return a modified representation fit for caching. (Default is lambda x: x)
concat_value : Callable[[Any, Any], Any]
given the cached value and the preprocessed new value, return a new cached value. (Default is lambda x, y: x + y)
format_arg_value : Callable[[Any, Any], Any]
given the original resolved value of the 'value_arg' and the new cache value, return the value of 'value_arg' to use when calling the wrapped function. (Default is lambda: _, y: y (passes the new cache value))
cache_factory : Callable[[], Any]
assigned to the "default_factory" attribute of the internal defaultdict serving as the cache for the duration of this check. (Default is str)
reset_arg : str | int | None
the argument to the wrapped function that will be evaluated to determine whether the cache should be reset. If None, the cache will never reset.
reset_check : Callable[[Any, Any], bool]
given the previously resolved value of the 'reset_arg' and the current resolved value, return True to reset the cache. (Default is lambda x, y: x != y)

NOTE: the arg at position zero of a class or instance method will be "cls" or "self", so the arguments in the actual function call will begin at position 1 for those use cases.

Class variables

var cache_factory : collections.abc.Callable[[], typing.Any]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

var cached : dict[collections.abc.Hashable, typing.Any]
var check_name : str
var format_key : collections.abc.Callable[[typing.Any], collections.abc.Hashable]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

var key_arg : str | int
var key_check : collections.abc.Callable[[typing.Any], bool]
var reset_arg : str | int | None
var reset_value : Any
var value_arg : str | int

Methods

def concat_value(x, y) ‑> collections.abc.Callable[[typing.Any, typing.Any], typing.Any]
def format_arg_value(_, y) ‑> collections.abc.Callable[[typing.Any, typing.Any], typing.Any]
def preprocess_value(x) ‑> collections.abc.Callable[[typing.Any], typing.Any]
def reset_check(x, y) ‑> collections.abc.Callable[[typing.Any, typing.Any], bool]
def value_check(_) ‑> collections.abc.Callable[[typing.Any], bool]
class ContextRule (title: ForwardRef('str'), controller_keys: ForwardRef('list[str]'), patterns: ForwardRef('list[re.Pattern]'), applied_context: ForwardRef('Sequence[str]'), is_active: ForwardRef('Callable[[KeyGroups], Any]'), candidate_filter: ForwardRef('Callable[[KeyGroups, Any], KeyGroups]') = <function ContextRule.<lambda>>)

Conditionally modify the preferred point of origin (context) and/or manipulate the list of candidate values for all standard keys matched by a list of regular expressions. See FieldManager._apply_context_rules() for implementation. See specs/_bases/_fields.py for examples of well commented, globally applicable ContextRules that reduce DB/CSV and PDF/DB/CSV summaries. These global context rules are available from the specs subpackage, i.e.: import specs specs.CSV_DB_CONTEXT_RULE # context rule for CSV/DB summaries

Args

title : str
a descriptive name used for logging and maintainability purposes.
controller_keys : list[str]
A list of standard keys whose standard_key_groups entries will determine if the rule is activated.
patterns : list[re.Pattern]
A list of regular expressions that determine the set of standard keys who are subject to the rule's actions.
applied_context : list[str]
the new context_priority value for matched keys if the supplied is_active function returns True (boolean type only).
is_active : Callable[[KeyGroups], Any]
A function that takes the subset of the FieldManager's standard_key_groups that matches the rule's controller_keys and determines if the rule is in effect (bool) or an alternative input for a custom candidate_filter in advanced applications.
candidate_filter : Callable[[KeyGroups, Any], KeyGroups]
A function that takes the subset of the FieldManager's standard_key_groups that matches the rule's keys and the result from the is_active func and returns the subset of the FieldManager's standard_key_groups that are eligible to appear in the output. Optional. Defaults to a null operation (i.e. lambda x, _: x).

Ancestors

  • builtins.tuple

Instance variables

var applied_context : collections.abc.Sequence[str]

Alias for field number 3

var candidate_filter : collections.abc.Callable[[dict[str, dict[str, str | vStr]], typing.Any], dict[str, dict[str, str | vStr]]]

Alias for field number 5

var controller_keys : list[str]

Alias for field number 1

var is_active : collections.abc.Callable[[dict[str, dict[str, str | vStr]]], typing.Any]

Alias for field number 4

var patterns : list[re.Pattern]

Alias for field number 2

var title : str

Alias for field number 0

class DocuVisionIntegratorProtocol (*args, **kwargs)

Define externally accessed methods for the DocuVisionIntegrator class. See ./app/integrators/docuvision_integrator.py for class implementation.

Ancestors

  • typing.Protocol
  • typing.Generic

Subclasses

Class variables

var api_key : str
var base_path : str
var base_url : str
var page_map : dict[str, dict[str, list[int]]]
var split_by_pid : bool

Instance variables

prop pdf_library : dict[str, typing.Any]

Dict of {case_doc_id: PDFLibEntry} where each entry represents the combined_pdf of a DocuVisionCase created by a child DocuVisionTask instance.

Merged into S3Batch.pdf_library in aws_s3_batch.py for file matching and other downstream processes.

prop pdfs_by_doc_id : dict[str, bytes]

Returns a dict of pdf bytes split and/or concatenated by task pids.

prop results : dict[str, list[dict[str, str]]]

Process responses from all child tasks to produce results according to the current instance settings.

Methods

def create_tasks(self, documents: dict[str, typing.Any] | None = None)

Obtain upload location and post PDFs. If mock_ids were defined, create mock tasks for each id and collect the existing respones.

Args

documents : dict[str, lu.PDFLibProto]
dict of {filename: PDFLibProto} where PDFLibProto is a namedtuple of (body, meta) where body is a bytes object and meta is a dict of metadata for the pdf. Optional. Extends self.documents if supplied.
def job_dict_entries(self, extracted_data: dict[str, dict[str, typing.Any]]) ‑> dict[str, dict[str, typing.Any]]

Dict of {job_id: job_dict} for all tasks in self._tasks where each job_dict contains values for db columns 'input', 'comments', and 'note'.

Called by aws_s3_batch.py to recombine the values for the columns noted above with their corresponding output from table_transformer.py.

Args

extracted_data : dict[str, dict[str, Any]]
TableTransformer output data supplied from aws_s3_batch.S3Batch.transformed.

Returns

dict[str, dict[str, Any]]
dict of {job_id: job_dict} for all tasks in self._tasks.
def reset(self, **kwargs)

Reset the dataclass to prepare for a new facility by clearing all documents, tasks, results, and internal variables.

KwArgs

documents : dict[str, lu.PDFLibProto]
dict of {filename: PDFLibProto} where PDFLibProto is a namedtuple of (body, meta) where body is a bytes object and meta is a dict of metadata for the pdf.
page_map : dict[str, dict[str, list[int]]]
dict of {new_doc_id: {old_doc_id: [page_nums]}} where new_doc_id is the doc_id for the combined pdf created by docuvision and old_doc_id is the doc_id for the original pdf. page_nums is a list of page numbers from the original pdf that were included in the combined pdf.
out_dir : str
path to directory where output json files will be written. If None, no files will be written.
split_by_pid : bool
if True, docuvision will split each pdf into separate documents based on patient id. If False, docuvision will combine all pdfs into a single document.
fail_on_error : bool
if True, raise an error if any task fails to post or any response fails to be collected. Defaults to False.
mock_ids : list[int]
list of task_ids to use for mock responses. Defaults to [].
default_dos : str
default date of service to use if no date of service is extracted from the facesheet. Defaults to gvars.DEFAULT_DOS.
api_secret_name : str
name of secret in AWS secretsmanager containing the base_url, base_path, and api_key values for the docuvision API.
dv_preferred_networks : list[str] | None
list of preferred Docuvision Neural Networks. Created for facilities where people manually upload 1-page PDFs
table_converters : dict[str, Callable[[list[str]], list[dict[str, str]]]]
function reference for processing '*Table' labels returned by DV-1.
dv_required_page_types : set[str]
if supplied, a case will only be created for a pid if at least one of the pages assigned to that pid have a type in this set.
def tables_for(self, doc_id: str, section: str = 'DocuVision', sep: str = '.') ‑> dict[str, list[dict[str, str]]]

Get results for the supplied doc_id in a tabular format suitable for downstream processing in table_transformer.py.

Args

doc_id : str
doc_id for the document to retrieve results for.
section : str
section name to use for the table. Defaults to "DocuVision".
sep : str
separator to use for table keys. Defaults to ".".

Returns

dict[str, list[dict[str, str]]]
dict of {table_name: [table_rows]} where each table_row is a dict of {label: value}.
def task_attr_list(self, attr: str) ‑> list

List the specified attribute for all tasks in self._tasks.

class EnvVarAction (option_strings, dest, default=None, type=None, **kwargs)

Custom argparse action to override command line args with environment variable values if environment variables have been set. To work, the environment variable name must equal (cmd line arg namespace name).upper(). For example:

    cmd line arg        namespace name      env var name
    ------------        --------------      ------------
    --log-console       log_console         LOG_CONSOLE
    --aws-s3-bucket     aws_s3_bucket       AWS_S3_BUCKET

For boolean type args, this action works similarly to the 'store_true' built-in action, i.e. the cmd line arg functions as a switch to set the namespace value to True (unless the environment variable is defined in which case the env value will be used).

Ancestors

  • argparse.Action
  • argparse._AttributeHolder
class FieldManager (managed_fields: dict[str, ManagedField], standard_keys: list[str], context_rules: Sequence[ContextRule] = (), context_sep: str = '|', output_dir: str | None = None, patient_id: str = 'Unknown', raw_summary: dict[str, vStr] = <factory>)

Apply the settings defined in the ManagedField objects to the data in a flattened summary dictionary from the output of a TableTransformer instance.

Args

managed_fields : dict[str, ManagedField]
A dictionary mapping keys to ManagedField objects.
standard_keys : list[str]
A list of standard keys used in the data typically defined in a summary_spec (see specs/summary_specs.py and table_transformer.py).
context_rules : Sequence[ContextRule]
A list of context rules to apply to the data. ContextRules allow for conditional modification of the preferred point of origin (context) and/or manipulation of the candidate values for a standard key subset.
context_sep : str
A string used to separate keys and their context. Defaults to "|". Example key: "schedule.startTime|CSV|0".
output_dir : str | None
An optional file path prefix (end in "/" to target a folder) If supplied, f"{output_dir}{patient_id}_managed_fields.json" will be created after calling the reduce() method.
patient_id : str
an identifier for the patient being processed. Used in debug output filenames and logging. Defaults to 'Unknown'.
raw_summary : dict[str, vStr]
A dict containing raw summary data for one patient. Defaults to an empty dict to facilitate templating (see __call__() docstring)

Attributes

standard_key_groups : KeyGroups
A mapping between standard keys and groups of corresponding context bearing candidate key value pairs. Has form {'<standard key>': {'<standard key><sep><context>': '<value>', ...}, ...}
context_groups : dict[str, KeyGroups]
The inverse of standard_key_groups (kind of). Groups raw data by context. Initialized as a defaultdict(dict). Used in list and grouped element merge operations. See _GROUPED_ELEMENTS. Has form {'<parent>': {'<context>': {'<final element>': '<value>', ...}, ...}, ...}.
context_group_priorities : dict[str, list[str]]
Records the _group_context of the first candidate selected for a standard_key in a context group to ensure that the same context is prioritized for subsequent group members.
list_manager : dict[str, list[tuple[int, int]]]
init==False. A dictionary used to track the output list indexes of "array type" standard keys.
output : dict[str, vStr]
init==False. A dictionary containing the reduced output. Empty until the reduce() method is called.

During post intialization, the supplied raw_summary (dict[str, vStr]) is compiled into standard_key_groups (dict[str, dict[str, vStr]]) where the keys of raw_summary appear as subkeys within standard_key_groups with standard_keys members as parent keys. E.g., raw_summary entries: { "schedule.diagnosis|PDF.Anesthesia Record.Case Summary|0": "Right hip displacement", "schedule.diagnosis|PDF.Operative Note|7": "Displacement of right hip", } would be compiled into standard_key_groups entries: { "schedule.diagnosis": { "schedule.diagnosis|PDF.Anesthesia Record.Case Summary|0": "Right hip displacement", "schedule.diagnosis|PDF.Operative Note|7": "Displacement of right hip", } } The same is true, generally speaking, for array type keys with the additional step of replacing the wildcard card in the standard key definition with the proper list index using the list_manager dictionary. E.g., raw_summary entries: { "patient_info.insurance[0].company|PDF.Anesthesia Record.Active Insurance|0": "Aetna", "patient_info.insurance[0].company|PDF.Anesthesia Record.Active Insurance|1": "BCBS", "patient_info.insurance[1].company|PDF.Anesthesia Record.Active Insurance|0": "MDCR", } would be compiled into standard_key_groups entries: { "patient_info.insurance[0].company": { "patient_info.insurance[0].company|PDF.Anesthesia Record.Active Insurance|0": "Aetna", }, "patient_info.insurance[1].company": { "patient_info.insurance[0].company|PDF.Anesthesia Record.Active Insurance|1": "BCBS", }, "patient_info.insurance[2].company": { "patient_info.insurance[1].company|PDF.Anesthesia Record.Active Insurance|0": "MDCR", }, } by virtue of their combined list and table indexes (0, 0), (0, 1), and (1, 0) respectively.

The reduce() method iterates over each standard field in standard_key_groups, selecting the proper final standard key value from the list of candidates according to the ManagedField definition. If the field is not managed and present in standard_keys, the default ManagedField will be applied (selects the most frequent value in the list of candidates).

Class variables

var context_group_priorities : dict[str, list[str]]
var context_groups : dict[str, dict[str, dict[str, str | vStr]]]
var context_rules : collections.abc.Sequence[ContextRule]
var context_sep : str
var list_manager : dict[str, list[tuple[int, int]]]
var managed_fields : dict[str, ManagedField]
var output : dict[str, str | vStr]
var output_dir : str | None
var patient_id : str
var raw_summary : dict[str, vStr]
var standard_key_groups : dict[str, dict[str, str | vStr]]
var standard_keys : list[str]

Methods

def expand_dependencies(self, this_field: ManagedField) ‑> collections.abc.Sequence[str]

Allow a 'non-list' field to depend on all values collected for a list field.

If a non-list field depends upon a list type field, replace the original wildcarded list key reference supplied in its dependencies with the list of indexed keys for which we actually collected data.

Example

Given a raw_summary containing data for three anesthesia providers, and a standard field "schedule.surgeon" with dependency "schedule.anesthesiaStaff[*].provider", expand the wildcarded list dependency into entries for the three provider values collected and return: [ "schedule.anesthesiaStaff[0].provider", "schedule.anesthesiaStaff[1].provider", "schedule.anesthesiaStaff[2].provider", ]

Args

this_field : ManagedField
the managed field to process

Returns

list[str]
return the original dependencies if the supplied field is itself a list field. otherwise, replace all wildcarded list type dependency references with the list of indexed keys generated by the dependency's collected values in _build_standard_key_groups().
def reduce(self)

Reduces self.standard_key_groups by applying the settings defined in self.managed_fields.

This method iterates over each standard field in self.standard_key_groups. If the field is not managed and present in self.standard_keys, the default ManagedField will be applied (selects the most frequent value in the list of candidates). If the field is managed, its dependencies are recursively reduced and a final value is generated according to the preprocess and xform functions defined in its ManagedField definition.

Args

_input : KeyGroups | None
The input dictionary to reduce. If None, the method uses the standard_key_groups attribute.
_recursing_for : list[str]
A list of standard keys that are being recursively reduced due to their presence as a dependency in a different standard key's ManagedField definition. Used to avoid infinite recursion.

Returns

dict
The reduced dictionary.
class LogExHandler (continue_on_error: bool, default_return=False, *, is_generator=False, **kwargs)

A decorator class to handle exceptions, log them, and optionally continue execution.

This decorator can be used to wrap functions and handle exceptions by logging them along with their trace information. It provides options to continue execution or stop it based on the configuration. It also supports generator functions.

Usage

>>> @LogExHandler(True)
... def your_func(arg_1, arg_2, **kwargs):
...     pass

Args

continue_on_error : bool
If False, raises an additional exception after saving the log to disk to stop execution.
default_return : Optional
If a literal, returns the literal. If a string, checks if 'self.', 'args' or 'kwargs' is in default_return. If so, sets return_val = eval(default_return). If is_generator is False, this can be used to return an argument supplied to the original function for continued processing. If is_generator is True, eval(default_return) should generate a new and complete "args" object. This new "args" object will then be passed to another instance of the generator to allow processing to continue in a fully transparent fashion for the caller. NOTE: the new "args" object should be modified from the original to exclude the object that resulted in the original error. Otherwise, exceptions will be logged until depth >= gvars.MAX_EX_DEPTH.

KwArgs

is_generator : bool
Set to True if decorating a generator function that utilizes the "yield" statement. Otherwise, the LogExHandler wrapper immediately returns the generator object to the caller without evaluating any of the internal generator logic, and exceptions raised during iteration will not be caught or logged by this decorator. Modifies "default_return" behavior (see above).
max_depth : int
Set the maximum number of attempts at re-entering a wrapped generator. No effect when is_generator==False. Default is gvars.MAX_EX_DEPTH.
exit_code : int
Sets a custom exit code in implementations where continue_on_error=False. Should be a positive integer. Defaults to 1.
notify : bool
If True, capture logged exceptions in gvars.NOTIFICATIONS to send an email notification at the conclusion of the run. Default is False.
**kwargs
Additional arbitrary keyword args are captured as class attributes. Allows the caller to pass in classes and/or objects not present in this module for use when evaluating a custom default_return (see above).

Example

>>> import contextlib, io  # NOTE: required for stdout redirect during doctest evaluation
>>> @LogExHandler(continue_on_error=True, default_return=0)
... def example_function(x, y):
...     return x / y
>>> with contextlib.redirect_stdout(io.StringIO()):
...     test = example_function(10, 0)
>>> test
0
class LogExOverrideError (*args, **kwargs)

Raise this exception to force all nested LogExHandlers to propagate an error upward to the first wrapped function with continue_on_error=False.

Ancestors

  • builtins.Exception
  • builtins.BaseException
class ManagedField (standard_key: str, context_priority: Sequence[str] = ('CSV', 'PDF', 'DB'), xform: Callable[[str | vStr], str | vStr] = <function ManagedField.<lambda>>, generated: bool = False, dependencies: Sequence[str] = (), preprocess: Callable[[str | vStr, dict[str, str | vStr]], str | vStr] = <function ManagedField.<lambda>>, reducer: Callable[[Sequence[str | vStr]], str | vStr] = <function most_freq_element>)

Standardize, recombine, generate, and prioritize extracted data based on point of origin, current value, and pre-defined dependencies.

Useful for establishing enums, rejecting bad inputs, and accepting postprocedure data over preprocedure data, among other use cases. Use the standardize_field_value() function for establishing enums. Implemented at the client and facility level of specs/client_specs.py. Global definitions for the ClaimMaker UI use case are available as members of the specs subpackage, i.e.: import specs specs.BASE_MANAGED_FIELDS # baseline field definitions

Args

standard_key : str
a key from a summary_specs entry
context_priority : Sequence[str]
A list used to prioritize the value selected from a list of vStr candidates based on vStr.ctx and/or point of origin provided in a summary key. Default is ("CSV", "PDF", "DB").
xform : Callable[[str | vStr], str | vStr]
transform to apply to produce a standard output from the provided input, if required. Defaults to a null operation (i.e. lambda x: x) to allow context only operations.
generated : bool
if true, calculate a value for this key from its depedencies and add it to the summary output even if the key is absent in the original summary provided to the FieldManager.
dependencies : Sequence[str]
list of standard field names that will be used to augment or construct the managed output
preprocess : Callable[[vStr, dict[str, vStr]], vStr]
takes raw value and the dict of dependency values as input. preps raw value for xform. defaults to a null operation (i.e. lambda x, _: x).
reducer : Callable[[Sequence[vStr]], vStr]
takes a list of candidate vStr values as input and returns the final output for this field. Defaults to most_freq_element().

Class variables

var context_priority : collections.abc.Sequence[str]
var dependencies : collections.abc.Sequence[str]
var generated : bool
var standard_key : str

Methods

def clone(self, list_idx: str) ‑> ManagedField

Clone this instance after replacing '*' with the supplied list_idx in standard_key and all dependencies.

Args

list_idx : str
an integer string e.g. "1"

Returns

ManagedField
a new ManagedFields instance
def preprocess(x, _) ‑> collections.abc.Callable[[str | vStr, dict[str, str | vStr]], str | vStr]
def reduce(self, candidates: dict[str, str | vStr]) ‑> tuple[str, str | vStr]

Call self.reducer and return the key and value of the selected candidate.

Args

candidates
dictionary of candidate keys and values

Returns

tuple[str, str | vStr]
the selected candidate key and value
def reducer(check_lists, tiebreak_func=<function vstr_confidence_tiebreak>, xform: collections.abc.Callable[[typing.Any], typing.Union[str, vStr, typing.Any]] = utilities.v_str.vStr) ‑> collections.abc.Callable[[collections.abc.Sequence[str | vStr]], str | vStr]

Return the most frequent element from a list of lists.

Args

check_lists : list[list[Any]]
A list of lists containing elements to check.
tiebreak_func : Callable[[Any, Any], Any]
A function to break ties between elements with the same frequency. Default is vstr_confidence_tiebreak.
xform : Callable[[Any], str | vStr | Any]
A transformation function applied to each element. Defaults to vStr.

Returns

Any
The most frequent element after applying the transformation and tiebreak functions.

Example

>>> check_lists = [["a", "b", "a"], ["a", "c", "b", "b"]]
>>> tiebreak_func = lambda *args: sorted(args)[0]
>>> xform = str.upper
>>> most_freq_element(check_lists, tiebreak_func, xform)
'A'
>>> tiebreak_func = lambda *args: sorted(args)[-1]
>>> most_freq_element(check_lists, tiebreak_func, xform)
'B'
def xform(x) ‑> collections.abc.Callable[[str | vStr], str | vStr]
class ProviderIntegratorProtocol (*args, **kwargs)

Define externally accessed methods for the ProviderIntegrator class

Ancestors

  • typing.Protocol
  • typing.Generic

Subclasses

Class variables

var api_url : str
var full_namevStr
var is_anes_provider : bool
var mode : str | None
var npi : str | vStr
var public_only : bool

Instance variables

prop last_api_response : dict[str, typing.Any] | None

last public API query response

prop last_url_params : dict[str, typing.Any]

last public API query parameters

prop query_name

name of last query function

Methods

def search(self, is_anes_provider: bool, full_name: str | vStr, npi: str | vStr = '', mode: str | None = None) ‑> tuple[vStrvStr]

progressively search local DB and public API with supplied provider data and return (a) fully populated vStr objects for the provider name and NPI upon a successful lookup or (b) an 'original value only' vStr object for the provider name and a "null" vStr upon failure.

class ValueCacheDict (cache_checks: collections.abc.Sequence[CacheDictCheck], **kwargs)

A decorator proviing flexible caching options for use in comprehensions.

Initially developed to provide a means to cache the parts of a full address when street address and city/state/zip are found in different sections of a form and thus labeled with independent bounding boxes, this class and its supporting class CacheDictCheck are implemented for general use. Useful in extending the basic functionality of comprehensions by allowing the current iteration to reference the results of prior iteration(s).

Args

cache_checks : Sequence[CacheDictCheck]
one CacheDictCheck instance per cached parameter. See CacheDictCheck for details.
ex_handler : Callable[[Exception], bool]
a user defined function to which all raised exceptions are passed. Return True to continue processing. Return False (default) to raise the exception to the calling thread. Default is lambda _: False.

Attributes

cache_dict : dict[Any, Any]
this attribute will be added to functions decorated with this class to provide convenient access to the cached attributes of each CacheDictCheck.

Instance variables

prop cache_dict : dict[collections.abc.Hashable, dict[collections.abc.Hashable, typing.Any]]

dict of dicts of form {'[check.check_name]: {[check.cached]}'

class vStr (_str, context: str = '', confidence: float = 0.0, force_type: bool | None = None, og_value: str | None = None, verified: bool | None = None)

Built-in string extended with attributes used in data validation.

Atrributes

ctx [str]: point of origin data for the value (context arg) con [float]: confidence that the value is correct (confidence arg) frc [bool]: force all descendents to class vStr (force_type arg) ogv [str]: originalValue in data_entry_fields (og_value arg) tru [bool]: isVerified in data_entry_fields (verified arg)

Ancestors

  • builtins.str

Static methods

def cat(*args) ‑> vStr

equivalent to vStr.jn("", (args)). emulates ''.join(args)

def from_data_entry_dict(de_dict: dict[str, Any], verify_all: bool = False) ‑> vStr

Return a vStr constructed from a data_entry_fields object from the claimmaker DB and/or a validated claimmaker job dict

def from_nested(val: Any, is_verified: bool = False) ‑> str | vStr

Used when manually created cases contain data in nested columns with no corresponding entry in the data_entry_fields column.

returns a 1.0 confidence vStr with context 'DB.User' for 'non-null' vals i.e. when val != type(val)()

def jn(sep: str, iterable: Iterable[str | vStr]) ‑> vStr

emulates str.join()

def merge_attrs(iterable: Iterable[str | vStr]) ‑> tuple[str, float, bool, str | None, bool]

merge vStr attrs from multiple instances. returns pipe delimited string for ctx and mean for con or empty string / 0.0 if no vStr is present in input

Instance variables

var con
var ctx
prop data_entry_dict

return a data_entry_fields dict representing this vStr

var frc
prop is_verified : bool

True if value has been validated by a user.

var ogv
var tru

Methods

def extend_context(self, suffix: str) ‑> str

Append suffix to current ctx value

def format(self, *args, **kwargs) ‑> vStr

If any of the supplied args or kwargs is of type vStr, return the str.format result as a vStr having attributes merged from all vStr inputs. If none of the inputs are

def mutation_decorator(self, mutation_func: Callable[..., str])

Decorates underlying str mutation functions (e.g. upper, lower, strip, split, etc.) to restore ctx and con attributes post mutation

def prepend_context(self, prefix: str) ‑> str

Prepend prefix to current ctx value

def replace_context(self, new_context: str) ‑> str

Prepend prefix to current ctx value

def set_custom_attrs(self, context: str, confidence: float, force_type: bool, og_value: str, verified: bool)

set custom attributes ("ctx", "con", "frc", "ogv", "tru")