Module `utilities.client_utils`

Container for send functions called via client_specs and client_specs_dev

Functions

def csv_demo_group_preprocess(demos: dict[str, PDFLibProto], val_sep_override: str = ',', date_column: str = 'date', has_header_row: bool = True) ‑> dict[str, RecordsLibEntry]

Concatenate the bytes from all demo bodies and extract a list of static DOSs to associate with each demo record.

Args

demos : Mapping[str, PDFLibProto]: PDFLibEntry objects to concatenate
val_sep_override : str: Default is None. If not supplied, the separator is assigned as "," or "|" depending on which of the two results in greatest length output from splitting the first line in the source.
date_column : str: Optional. The column to use for DOS extraction. Defaults to "date_admit".
has_header_row : bool: Default is True. Set to false if source docs do not include header rows.

Returns

dict[str, lu.PDFLibProto]: a new library containing a single entry with concatenated bytes from all inputs with at least one value row.

def csv_schedule_group_preprocess(schedules: dict[str, PDFLibProto], val_sep_override: str = ',', file_type_override: None | S3FileType = None, date_column: str = 'date', has_header_row: bool = True) ‑> dict[str, RecordsLibEntry]

Concatenate the bytes from all schedule bodies and extract a list of static DOSs and s3 destination keys to associate with each schedule record.

Args

schedules : dict[str, PDFLibProto]: PDFLibEntry objects to concatenate
val_sep_override : str: Optional. Override the default value separator for csv files. Defaults to ",".
file_type_override : lu.S3FileType: Optional. Supply an alternate value to override the default SCHDL file_type (e.g. for OTHCSV inputs)
date_column : str: Optional. The column to use for DOS extraction. Defaults to "date".
has_header_row : bool: Default is True. Set to false if source docs do not include header rows.

Returns

dict[str, lu.PDFLibProto]: a new library containing a single entry with concatenated bytes from all inputs with at least one value row.

def db_push_dict(*, extracted_dict: dict[str, typing.Any], client_path: str, provider_facility: str, first_dos: str, **kwargs) ‑> dict[str, bool]

Send extracted data to Claim Maker DB. All kwargs are supplied by S3Batch.send().

Args

extracted_dict : dict[str, Any]: extracted data for all patients.
client_path : str: s3 client folder/claim maker DB 'provider_name'.
provider_facility : str: facility_name in ClaimMaker DB.
first_dos : str: earliest date of service allowed to post to ClaimMaker DB. Must be in "YYYY-MM-DD" format.

Returns

dict[str, bool]: dict of form {filename: True, …} to indicate successful extraction for all keys in extracted_dict.

def dict_to_disk(*, extracted_dict: dict[str, typing.Any], client_path: str, provider_facility: str, claimmaker_support: bool, file_name_path: str, **kwargs) ‑> dict[str, bool]

Save extracted json data to a local file.

This function must be supplied as a partial to define kwarg file_name_path, e.g. send_func=partial(s3_push_dict, file_name_path="./output/extract.json"), when used in a facility_spec.

Args

extracted_dict : dict[str, Any]: all extracted data. supplied by S3Batch.send()
client_path : str: s3 client folder/claim maker DB 'provider_name'. supplied by S3Batch.send().
provider_facility : str: facility_name in ClaimMaker DB. supplied by S3Batch.send()
claimmaker_support : bool: mock create_batches if True. supplied by S3Batch.send()
file_name_path : str: Local file path where output is written. MUST be set by partial.

Returns

dict[str, bool]: dict of form {filename: False, …} to prevent keys from moving in S3 for locally dumped output.

def docvis_group_preprocess(pdf_library: dict[str, PDFLibProto]) ‑> dict[str, PDFLibProto]

Post PDFs to the docuvision API. No change to pdf_library.

Args

pdf_library: the library as collected from the unprocessed folder

Returns

dict[str, lu.PDFLibProto]: the original library.

def docvis_group_preprocess_precombined(pdf_library: dict[str, PDFLibProto]) ‑> dict[str, PDFLibProto]

Post PDFs to the docuvision API. No change to pdf_library.

Args

pdf_library: the library as collected from the unprocessed folder

Returns

dict[str, lu.PDFLibProto]: the original library.

def dummy_send(extracted_dict: dict[str, typing.Any], **kwargs) ‑> dict[str, bool]

no db send for json_to_s3 execution. create dummy dict of form {source key 1: True, source key 2: True, …} to leverage copy/delete functions in S3Batch.send()

def file_preprocess_split_by_first_line(pdf_ref: PDFLibReference, as_references: bool = True, name_mrn_regex: re.Pattern = re.compile('^Patient: (.*) \$MRN: (.*)\$ - Printed by .*')) ‑> dict[str, PDFLibProto]

Splits the PDF when name_mrn_regex matches the first line of a page.

Args

pdf_ref: a reference to the source PDF to split
as_references: Set to false to convert children to lu.PDFLibEntry. Defaults to True.
name_mrn_regex: regex pattern for first line. group 1 is pt, 2 is mrn. Defaults to r"^Patient: (.) (MRN: (.)) - Printed by .*".

Returns

dict[str, lu.PDFLibProto]: entries for pdf_library

def file_preprocess_split_by_outline(pdf_ref: PDFLibReference, split_check: collections.abc.Callable[[PDFLibReference], bool] = <function <lambda>>, designator_func: collections.abc.Callable[[list[str]], str] = <function designator_from_demographics>, outline_item_check: collections.abc.Callable[[pypdf.generic._data_structures.Destination | list[pypdf.generic._data_structures.Destination | list[pypdf.generic._data_structures.Destination]]], bool] = <function <lambda>>, as_references: bool = True) ‑> dict[str, PDFLibProto]

Split a pdf according to entries in its navigation pane.

Initially implemented to handle lump op note exports from Epic.

Args

pdf_ref: a reference to the source PDF to split
split_check: a test to determine if this PDF is eligible for splitting. if False is returned, the pdf_ref is cast to an entry and returned unmodified for downstream handling.
designator_func: given the lines extracted from the first page of a child pdf, return a string to serve as the child's filename.
outline_item_check: PDFs are split at every top level outline entry by default. if more than one top level entry exists for each patient, supply a custom function to select which entry should trigger the split. E.g. a custom outline item check can be used to split at only the "Demographics Patient #" outline items given an outline having "Demographics Patient 1", "Anesthesia Summary Patient 1", "Demographics Patient 2", "Anesthesia Summary Patient 2", etc.
as_references: If true, children are returned as PDFLibReferences rather than PDFLibEntries.

Returns

dict[str, lu.PDFLibProto]: entries for pdf_library

def pdf_schedule_group_preprocess(pdf_library: dict[str, PDFLibProto], column_split_line_regex: re.Pattern, first_value_line_regex: re.Pattern, column_to_value_regexes: list[re.Pattern], final_value_line_regex: re.Pattern | None = None, **kwargs) ‑> dict[str, PDFLibProto]

Extract and compile pdf schedule data into a single RecordsLibEntry.

Given a dict of PDFLibProto objects, extract rows from the PDF using regular expressions to identify rows and columns and extract values and append a RecordsLibEntry with the extracted_data. Converts all source PDFs to type MANUAL to have them post on placeholder cases.

Args

pdf_library : dict[str, lu.PDFLibProto]: dict of file_id, PDFLibProto containing the entires for all schedule PDFs
column_split_line_regex : re.Pattern: regex pattern of the line that will be used to set the column spans.
first_value_line_regex : re.Pattern: regex pattern to match first line of a row
column_to_value_regexes : list[re.Pattern]: list of regex patterns to extract values from columns.
final_value_line_regex : re.Pattern | None: optional regex pattern to match the final line of a row. if None (default), the current row ends when the next line matches first_value_line_regex. if supplied, the current row ends when the current line matches final_value_line_regex and the current line is included.

KwArgs

join_func : Callable[[Iterable[str], str]]: method for joining the string segments collected for a given column into a single string. The default method trims all leading and trailing spaces and then joins the segments using a single space as a separator UNLESS the first segment ends or the second segment begins with "/" or "-" in which case the segments are joined without a separator. This allows for proper date interpretations by avoiding representations like "04/20/ 2024" when a date is wrapped onto the next line.
debug_path : Path: path to write text extraction debug info from pypdf
debug : bool: write internal debug data to log (also triggered when debug_path is supplied)

Returns

dict[str, lu.PDFLibProto]: the extended / updated pdf_library

def s3_push_dict(*, extracted_dict: dict[str, typing.Any], s3_bucket: str, client_path: str, facility_path: str, extracts_path: str, json_dumps_kwargs: dict[str, typing.Any] | None = None, **kwargs) ‑> dict[str, bool]

Send extracted data to s3 in json format.

This function must be supplied as a partial to define kwarg extracts_path, e.g. send_func=partial(s3_push_dict, extracts_path="extracts"), when used in a facility_spec.

Args

extracted_dict : dict[str, Any]: all extracted data. supplied by S3Batch.send()
s3_bucket : str: s3 bucket name. supplied by S3Batch.send()
client_path : str: s3 client folder. supplied by S3Batch.send()
facility_path : str: s3 facility subfolder. supplied by S3Batch.send()
extracts_path : str: s3 extracts subfolder. MUST be set by partial.
json_dumps_kwargs : dict[str, Any]: optional kwargs for json.dumps. Defaults to {"indent": 2}.

Returns

dict[str, bool]: dict of form {filename: True, …} to indicate successful extraction for all keys in extracted_dict.

def unmatched_pdfs_group_preprocess(pdf_library: dict[str, PDFLibProto]) ‑> dict[str, PDFLibProto]

Extend raw library with unmatched PDFs previously posted to a claimmaker DB.

Args

pdf_library: the library as collected from the unprocessed folder

Returns

dict[str, lu.PDFLibProto]: the library extended with file references for unmatched files previously posted to the claimmaker DB.

Classes

class S3FileGroup (file_type: S3FileType, filename_test_expr: re.Pattern = re.compile('^(?P<fileId>.*)\\.(?i:pdf)$'), groupdict_xforms: dict[str, collections.abc.Callable[[str], str]] = <factory>, file_preprocess: collections.abc.Callable[[PDFLibReference], collections.abc.Mapping[str, PDFLibProto]] = <function _default_file_preprocess>, group_preprocess: collections.abc.Callable[[dict[str, PDFLibProto]], collections.abc.Mapping[str, PDFLibProto]] = <function _default_group_preprocess>, retain_for_reprocess_days: int = 0, check_results: bool | None = None, fallback_dos: collections.abc.Callable[[PDFLibReference], str] = <function _default_fallback_dos>)

Collects and preprocesses library_utils.PDFLibProto objects with filenames matching a predefined regular expression.

Defines both 'file' and 'group' preprocessing functions. Individual file preprocessing occurs in add_entries as each file is collected. Group preprocessing occurs in finalize and operates on the set of all files collected. See specs.builtin_client_specs for example instances. See facility_pdf_library in aws_extractor.py for usage.

Attributes

file_type : S3FileType: Filetype being assigned to a file that matched this regex
filename_test_expr : re.Pattern: most important thing! regex that selects the files
groupdict_xforms : dict[str, Callable[[str], str]]: defines transformations to apply to named groups in the re.Match object produced by filename_test_expr (if any)
file_preprocess : Callable[[PDFLibReference], Mapping[str, PDFLibProto]]: method called for preprocessing each file's data (i.e. split operations)
group_preprocess : Callable[[dict[str, PDFLibProto]], Mapping[str, PDFLibProto]]: method called for preprocessing all files collected for the group (i.e. combine operations) must be used for schedule/demo/other csv type inputs to produce a single lu.RecordsLibEntry object. Defaults to a S3FileType based selector for common algorithms. For example, a group assigned file_type SCHDL will call the csv_schedule_group_preprocess() function by default.
retain_for_reprocess_days : int: number of days to keep an entry in 'unprocessed'. most commonly used to retain schedules for matching with late posting demographics records.
check_results : bool | None: indicates whether entries in this group must be correlated with an analysis_jobs record to be considered successfully processed and moved in S3. If None (default), groups having S3FileType.PDF will default to True and all other file types will default to False.
fallback_dos : Callable[[str, PDFLibProto, datetime], str]: method for determining DoS in case it was not found in the filename. uses lu.numerics_to_dos by default if 'dos' is not already present in meta.
keys_versions : list: list of KeyVersionTuples for collected files. Used to manage s3 transfers from the unprocessed/ s3 folder to processed/ or failed/.
library : dict[str, PDFLibProto]: dict of files in the group, don't have to be PDF

Class variables

var check_results : bool | None
var file_type : S3FileType
var filename_test_expr : re.Pattern
var groupdict_xforms : dict[str, collections.abc.Callable[[str], str]]
var keys_versions : list[KeyVersionTuple]
var library : dict[str, PDFLibProto]
var retain_for_reprocess_days : int

Methods

def add_entries(self, s3_key: str, version_list: list[DateVersIsEmptyTuple]) ‑> bool

Given an s3_key, check the filename for a filename_test_expr match. if it's a match, create and add PDFLibProto object(s) to self.library according to file_preprocess.

Calls file_preprocess method to create PDFLibProto object(s), the fallback_dos method to assign a default date of service, and adds the s3_key and version_id to self.keys_versions. See specs.builtin_client_specs for example instances. See facility_pdf_library in aws_extractor.py for usage.

Args

s3_key : str: An s3 key for an extract target
version_list : list[DateVersIsEmptyTuple]: File versions for s3_key.

Returns

bool: True if the file was added to the library, False otherwise.

def fallback_dos(entry: PDFLibReference) ‑> str

Default 'fallback_dos' function for the S3FileGroup dataclass.

Args

entry : PDFLibReference: PDFLibReference to the file being processed.

Returns

str: Default date of service (DoS) for the file being processed.

def file_preprocess(pdf_lib_ref: PDFLibReference) ‑> collections.abc.Mapping[str, PDFLibProto]

Default 'file_preprocess' function for the S3FileGroup dataclass. Automatically returns PDFLibEntry for PDF files and PDFLibReference for all other files.

Args

doc_id : str: S3 key of the file being processed.
meta : dict[str, str]: Metadata for the file being processed.

Returns

dict[str, PDFLibProto]: PDFLibProto object with filename key for the file being processed.

def finalize(self) ‑> collections.abc.Mapping[str, PDFLibProto]

Call after collecting all keys into the group to apply the 'group_preprocess' method and return the finalized library.

def group_preprocess(pdf_library: dict[str, PDFLibProto]) ‑> collections.abc.Mapping[str, PDFLibProto]

Default 'group_preprocess' function for the S3FileGroup dataclass.

Calls 'csv_schedule_preprocess' for S3FileTypes SCHDL and OTHCSV, 'csv_demo_preprocess' for S3FileType DEMOS, and the unmodified input for other file types.