Module utilities.client_utils
Container for send functions called via client_specs and client_specs_dev
Functions
def csv_demo_group_preprocess(demos: dict[str, PDFLibProto], val_sep_override: str = ',', date_column: str = 'date', has_header_row: bool = True) ‑> dict[str, RecordsLibEntry]
-
Concatenate the bytes from all demo bodies and extract a list of static DOSs to associate with each demo record.
Args
demos
:Mapping[str, PDFLibProto]
- PDFLibEntry objects to concatenate
val_sep_override
:str
- Default is None. If not supplied, the separator is assigned as "," or "|" depending on which of the two results in greatest length output from splitting the first line in the source.
date_column
:str
- Optional. The column to use for DOS extraction. Defaults to "date_admit".
has_header_row
:bool
- Default is True. Set to false if source docs do not include header rows.
Returns
dict[str, lu.PDFLibProto]
- a new library containing a single entry with concatenated bytes from all inputs with at least one value row.
def csv_schedule_group_preprocess(schedules: dict[str, PDFLibProto], val_sep_override: str = ',', file_type_override: None | S3FileType = None, date_column: str = 'date', has_header_row: bool = True) ‑> dict[str, RecordsLibEntry]
-
Concatenate the bytes from all schedule bodies and extract a list of static DOSs and s3 destination keys to associate with each schedule record.
Args
schedules
:dict[str, PDFLibProto]
- PDFLibEntry objects to concatenate
val_sep_override
:str
- Optional. Override the default value separator for csv files. Defaults to ",".
file_type_override
:lu.S3FileType
- Optional. Supply an alternate value to override the default SCHDL file_type (e.g. for OTHCSV inputs)
date_column
:str
- Optional. The column to use for DOS extraction. Defaults to "date".
has_header_row
:bool
- Default is True. Set to false if source docs do not include header rows.
Returns
dict[str, lu.PDFLibProto]
- a new library containing a single entry with concatenated bytes from all inputs with at least one value row.
def db_push_dict(*, extracted_dict: dict[str, typing.Any], client_path: str, provider_facility: str, first_dos: str, **kwargs) ‑> dict[str, bool]
-
Send extracted data to Claim Maker DB. All kwargs are supplied by S3Batch.send().
Args
extracted_dict
:dict[str, Any]
- extracted data for all patients.
client_path
:str
- s3 client folder/claim maker DB 'provider_name'.
provider_facility
:str
- facility_name in ClaimMaker DB.
first_dos
:str
- earliest date of service allowed to post to ClaimMaker DB. Must be in "YYYY-MM-DD" format.
Returns
dict[str, bool]
- dict of form {filename: True, …} to indicate successful extraction for all keys in extracted_dict.
def dict_to_disk(*, extracted_dict: dict[str, typing.Any], client_path: str, provider_facility: str, claimmaker_support: bool, file_name_path: str, **kwargs) ‑> dict[str, bool]
-
Save extracted json data to a local file.
This function must be supplied as a partial to define kwarg
file_name_path
, e.g.send_func=partial(s3_push_dict, file_name_path="./output/extract.json")
, when used in a facility_spec.Args
extracted_dict
:dict[str, Any]
- all extracted data. supplied by S3Batch.send()
client_path
:str
- s3 client folder/claim maker DB 'provider_name'. supplied by S3Batch.send().
provider_facility
:str
- facility_name in ClaimMaker DB. supplied by S3Batch.send()
claimmaker_support
:bool
- mock create_batches if True. supplied by S3Batch.send()
file_name_path
:str
- Local file path where output is written. MUST be set by partial.
Returns
dict[str, bool]
- dict of form {filename: False, …} to prevent keys from moving in S3 for locally dumped output.
def docvis_group_preprocess(pdf_library: dict[str, PDFLibProto]) ‑> dict[str, PDFLibProto]
-
Post PDFs to the docuvision API. No change to pdf_library.
Args
pdf_library
- the library as collected from the unprocessed folder
Returns
dict[str, lu.PDFLibProto]
- the original library.
def docvis_group_preprocess_precombined(pdf_library: dict[str, PDFLibProto]) ‑> dict[str, PDFLibProto]
-
Post PDFs to the docuvision API. No change to pdf_library.
Args
pdf_library
- the library as collected from the unprocessed folder
Returns
dict[str, lu.PDFLibProto]
- the original library.
def dummy_send(extracted_dict: dict[str, typing.Any], **kwargs) ‑> dict[str, bool]
-
no db send for json_to_s3 execution. create dummy dict of form {source key 1: True, source key 2: True, …} to leverage copy/delete functions in S3Batch.send()
def file_preprocess_split_by_first_line(pdf_ref: PDFLibReference, as_references: bool = True, name_mrn_regex: re.Pattern = re.compile('^Patient: (.*) \\(MRN: (.*)\\) - Printed by .*')) ‑> dict[str, PDFLibProto]
-
Splits the PDF when name_mrn_regex matches the first line of a page.
Args
pdf_ref
- a reference to the source PDF to split
as_references
- Set to false to convert children to lu.PDFLibEntry. Defaults to True.
name_mrn_regex
- regex pattern for first line. group 1 is pt, 2 is mrn. Defaults to r"^Patient: (.) (MRN: (.)) - Printed by .*".
Returns
dict[str, lu.PDFLibProto]
- entries for pdf_library
def file_preprocess_split_by_outline(pdf_ref: PDFLibReference, split_check: collections.abc.Callable[[PDFLibReference], bool] = <function <lambda>>, designator_func: collections.abc.Callable[[list[str]], str] = <function designator_from_demographics>, outline_item_check: collections.abc.Callable[[pypdf.generic._data_structures.Destination | list[pypdf.generic._data_structures.Destination | list[pypdf.generic._data_structures.Destination]]], bool] = <function <lambda>>, as_references: bool = True) ‑> dict[str, PDFLibProto]
-
Split a pdf according to entries in its navigation pane.
Initially implemented to handle lump op note exports from Epic.
Args
pdf_ref
- a reference to the source PDF to split
split_check
- a test to determine if this PDF is eligible for splitting. if False is returned, the pdf_ref is cast to an entry and returned unmodified for downstream handling.
designator_func
- given the lines extracted from the first page of a child pdf, return a string to serve as the child's filename.
outline_item_check
- PDFs are split at every top level outline entry by default. if more than one top level entry exists for each patient, supply a custom function to select which entry should trigger the split. E.g. a custom outline item check can be used to split at only the "Demographics Patient #" outline items given an outline having "Demographics Patient 1", "Anesthesia Summary Patient 1", "Demographics Patient 2", "Anesthesia Summary Patient 2", etc.
as_references
- If true, children are returned as PDFLibReferences rather than PDFLibEntries.
Returns
dict[str, lu.PDFLibProto]
- entries for pdf_library
def pdf_schedule_group_preprocess(pdf_library: dict[str, PDFLibProto], column_split_line_regex: re.Pattern, first_value_line_regex: re.Pattern, column_to_value_regexes: list[re.Pattern], final_value_line_regex: re.Pattern | None = None, **kwargs) ‑> dict[str, PDFLibProto]
-
Extract and compile pdf schedule data into a single
RecordsLibEntry
.Given a dict of PDFLibProto objects, extract rows from the PDF using regular expressions to identify rows and columns and extract values and append a RecordsLibEntry with the extracted_data. Converts all source PDFs to type MANUAL to have them post on placeholder cases.
Args
pdf_library
:dict[str, lu.PDFLibProto]
- dict of file_id, PDFLibProto containing the entires for all schedule PDFs
column_split_line_regex
:re.Pattern
- regex pattern of the line that will be used to set the column spans.
first_value_line_regex
:re.Pattern
- regex pattern to match first line of a row
column_to_value_regexes
:list[re.Pattern]
- list of regex patterns to extract values from columns.
final_value_line_regex
:re.Pattern | None
- optional regex pattern to
match the final line of a row. if
None
(default), the current row ends when the next line matches first_value_line_regex. if supplied, the current row ends when the current line matches final_value_line_regex and the current line is included.
KwArgs
join_func
:Callable[[Iterable[str], str]]
- method for joining the string segments collected for a given column into a single string. The default method trims all leading and trailing spaces and then joins the segments using a single space as a separator UNLESS the first segment ends or the second segment begins with "/" or "-" in which case the segments are joined without a separator. This allows for proper date interpretations by avoiding representations like "04/20/ 2024" when a date is wrapped onto the next line.
debug_path
:Path
- path to write text extraction debug info from pypdf
debug
:bool
- write internal debug data to log (also triggered when debug_path is supplied)
Returns
dict[str, lu.PDFLibProto]
- the extended / updated pdf_library
def s3_push_dict(*, extracted_dict: dict[str, typing.Any], s3_bucket: str, client_path: str, facility_path: str, extracts_path: str, json_dumps_kwargs: dict[str, typing.Any] | None = None, **kwargs) ‑> dict[str, bool]
-
Send extracted data to s3 in json format.
This function must be supplied as a partial to define kwarg
extracts_path
, e.g.send_func=partial(s3_push_dict, extracts_path="extracts")
, when used in a facility_spec.Args
extracted_dict
:dict[str, Any]
- all extracted data. supplied by S3Batch.send()
s3_bucket
:str
- s3 bucket name. supplied by S3Batch.send()
client_path
:str
- s3 client folder. supplied by S3Batch.send()
facility_path
:str
- s3 facility subfolder. supplied by S3Batch.send()
extracts_path
:str
- s3 extracts subfolder. MUST be set by partial.
json_dumps_kwargs
:dict[str, Any]
- optional kwargs for json.dumps. Defaults
to
{"indent": 2}
.
Returns
dict[str, bool]
- dict of form {filename: True, …} to indicate successful extraction for all keys in extracted_dict.
def unmatched_pdfs_group_preprocess(pdf_library: dict[str, PDFLibProto]) ‑> dict[str, PDFLibProto]
-
Extend raw library with unmatched PDFs previously posted to a claimmaker DB.
Args
pdf_library
- the library as collected from the unprocessed folder
Returns
dict[str, lu.PDFLibProto]
- the library extended with file references for unmatched files previously posted to the claimmaker DB.
Classes
class S3FileGroup (file_type: S3FileType, filename_test_expr: re.Pattern = re.compile('^(?P<fileId>.*)\\.(?i:pdf)$'), groupdict_xforms: dict[str, collections.abc.Callable[[str], str]] = <factory>, file_preprocess: collections.abc.Callable[[PDFLibReference], collections.abc.Mapping[str, PDFLibProto]] = <function _default_file_preprocess>, group_preprocess: collections.abc.Callable[[dict[str, PDFLibProto]], collections.abc.Mapping[str, PDFLibProto]] = <function _default_group_preprocess>, retain_for_reprocess_days: int = 0, check_results: bool | None = None, fallback_dos: collections.abc.Callable[[PDFLibReference], str] = <function _default_fallback_dos>)
-
Collects and preprocesses
library_utils.PDFLibProto
objects with filenames matching a predefined regular expression.Defines both 'file' and 'group' preprocessing functions. Individual file preprocessing occurs in
add_entries
as each file is collected. Group preprocessing occurs infinalize
and operates on the set of all files collected. Seespecs.builtin_client_specs
for example instances. Seefacility_pdf_library
in aws_extractor.py for usage.Attributes
file_type
:S3FileType
- Filetype being assigned to a file that matched this regex
filename_test_expr
:re.Pattern
- most important thing! regex that selects the files
groupdict_xforms
:dict[str, Callable[[str], str]]
- defines transformations to apply to named groups in the re.Match object produced by filename_test_expr (if any)
file_preprocess
:Callable[[PDFLibReference], Mapping[str, PDFLibProto]]
- method called for preprocessing each file's data (i.e. split operations)
group_preprocess
:Callable[[dict[str, PDFLibProto]], Mapping[str, PDFLibProto]]
- method
called for preprocessing all files collected for the group (i.e. combine operations)
must be used for schedule/demo/other csv type inputs to produce a single
lu.RecordsLibEntry object. Defaults to a S3FileType based selector for common
algorithms. For example, a group assigned file_type SCHDL will call the
csv_schedule_group_preprocess()
function by default. retain_for_reprocess_days
:int
- number of days to keep an entry in 'unprocessed'. most commonly used to retain schedules for matching with late posting demographics records.
check_results
:bool | None
- indicates whether entries in this group must be correlated with an analysis_jobs record to be considered successfully processed and moved in S3. If None (default), groups having S3FileType.PDF will default to True and all other file types will default to False.
fallback_dos
:Callable[[str, PDFLibProto, datetime], str]
- method for determining DoS in
case it was not found in the filename. uses
lu.numerics_to_dos
by default if 'dos' is not already present in meta. keys_versions
:list
- list of KeyVersionTuples for collected files. Used to manage s3 transfers from the unprocessed/ s3 folder to processed/ or failed/.
library
:dict[str, PDFLibProto]
- dict of files in the group, don't have to be PDF
Class variables
var check_results : bool | None
var file_type : S3FileType
var filename_test_expr : re.Pattern
var groupdict_xforms : dict[str, collections.abc.Callable[[str], str]]
var keys_versions : list[KeyVersionTuple]
var library : dict[str, PDFLibProto]
var retain_for_reprocess_days : int
Methods
def add_entries(self, s3_key: str, version_list: list[DateVersIsEmptyTuple]) ‑> bool
-
Given an s3_key, check the filename for a
filename_test_expr
match. if it's a match, create and add PDFLibProto object(s) to self.library according to file_preprocess.Calls
file_preprocess
method to create PDFLibProto object(s), thefallback_dos
method to assign a default date of service, and adds the s3_key and version_id to self.keys_versions. Seespecs.builtin_client_specs
for example instances. Seefacility_pdf_library
in aws_extractor.py for usage.Args
s3_key
:str
- An s3 key for an extract target
version_list
:list[DateVersIsEmptyTuple]
- File versions for
s3_key
.
Returns
bool
- True if the file was added to the library, False otherwise.
def fallback_dos(entry: PDFLibReference) ‑> str
-
Default 'fallback_dos' function for the S3FileGroup dataclass.
Args
entry
:PDFLibReference
- PDFLibReference to the file being processed.
Returns
str
- Default date of service (DoS) for the file being processed.
def file_preprocess(pdf_lib_ref: PDFLibReference) ‑> collections.abc.Mapping[str, PDFLibProto]
-
Default 'file_preprocess' function for the S3FileGroup dataclass. Automatically returns PDFLibEntry for PDF files and PDFLibReference for all other files.
Args
doc_id
:str
- S3 key of the file being processed.
meta
:dict[str, str]
- Metadata for the file being processed.
Returns
dict[str, PDFLibProto]
- PDFLibProto object with filename key for the file being processed.
def finalize(self) ‑> collections.abc.Mapping[str, PDFLibProto]
-
Call after collecting all keys into the group to apply the 'group_preprocess' method and return the finalized library.
def group_preprocess(pdf_library: dict[str, PDFLibProto]) ‑> collections.abc.Mapping[str, PDFLibProto]
-
Default 'group_preprocess' function for the S3FileGroup dataclass.
Calls 'csv_schedule_preprocess' for S3FileTypes SCHDL and OTHCSV, 'csv_demo_preprocess' for S3FileType DEMOS, and the unmodified input for other file types.