Module aws_s3_batch
Manages S3 and database interactions for PDF extractor. Called from extract_s3.py.
Classes
class S3Batch (pdf_library: dict[str, PDFLibProto], keys_versions: set[KeyVersionTuple] | list[KeyVersionTuple], failed_only: bool = False, **kwargs)
-
Container for processing all pdfs for a single facility.
Exposes methods to trigger execution of all pdf-extractor modules and send extracted data to db, s3 or disk.
Creates a new S3Batch instance.
Args
pdf_library
:dict[str, lu.PDFLibProto]
- dict of PDFLibProto objects representing PDFs to be extracted.
keys_versions
:lu.KeysVersions
- set of lu.KeyVersionTuple objects representing S3 objects to be copied and deleted.
failed_only
:bool
- Optional. if True, copy_failed_objects will be called and self.keys_versions will be updated to remove successfully copied objects. Defaults to False.
KwArgs
See
S3BatchArgs
.Instance variables
var original_filenames
-
The set of filenames for S3 keys in the original keys_versions object.
var stitch_map
-
A mapping between source pdf entries and stitching destinations.
WARNING: Accessing this property prior to calling the
S3Batch.extract_pdfs()
on thisS3Batch
instance may result in unexpected behavior. prop transformed : dict[str, dict[str, typing.Any]]
-
Return self.xfmr.output if such is available and of type dict.
Methods
def add_placeholders(self)
-
Add placeholder cases for the set of dates of service found in the meta dicts of self.placeholder_documents and add corresponding PDFs to their input->'entities' lists.
def add_xform_debug_output(self, pdf_summary: dict[str, dict[str, typing.Any]] | None = None, pdf_deduped: dict[str, dict[str, typing.Any]] | None = None, pdf_xformed: dict[str, dict[str, typing.Any]] | None = None)
-
Add pre_summary, summary, and transformed output to debug_output.
def check_insurances(self)
-
Update and assign insurance IDs for extracted insurance data.
For each extracted insurance, search common_db.shared.insurance_providers for a matching entry and update the extracted value with standard DB values for company name, plan name, phone, address, and insuranceID. If no record is found, set the confidence of all extracted fields to 0 and leave insuranceID blank.
Environment Variable/FacilitySpec Setting: INSURANCE_INTEGRATION_MODE/insurance_integration_mode: "0" or "1". If "0", disable this routine entirely. If "1" (default), execute the routine as described above. NOTE: The FacilitySpec setting overrides env var if not None.
def check_providers(self) ‑> bool
-
Find NPIs for extracted provider names.
Search local db provider store and public API for extracted surgeon and anesthesiaStaff provider names and assign NPIs accordingly.
Environment Variable/FacilitySpec Setting: PROVIDER_INTEGRATION_MODE/provider_integration_mode: "0", "1" (default), or "2". If "0", skip NPI lookups altogether. If "1", set the confidence of any unmatched provider name to 0.0 and return the as found name and an empty, 0.0 confidence vStr for NPI. If "2", return as found values but update local DB store with lookup result (used during startup to initialize the local DB table). NOTE: The FacilitySpec setting overrides env var if not None.
def collect_docuvision_results(self) ‑> bool
-
Save results from docuvision PDFs to self.section_dict.
def copy_failed_objects(self, failed_keys: set[KeyVersionTuple] | list[KeyVersionTuple]) ‑> set[KeyVersionTuple] | list[KeyVersionTuple]
-
Copy objects to S3 failed folder.
def copy_object(self, key_vers: KeyVersionTuple, dest: str) ‑> bool
-
Copy a single s3 object from one key to another.
def copy_processed_objects(self, push_results: dict) ‑> tuple[set[KeyVersionTuple] | list[KeyVersionTuple], set[KeyVersionTuple] | list[KeyVersionTuple]]
-
Returns two sets of tuples of (str, str). The first contains (key, version) pairs for files that didn't extract properly or failed to meet the file naming conventions for extractable files. The second contains (key, version) pairs for files that encountered an error during the S3 copy process itself.
def delete_objects(self, keys_vers_to_del: set[KeyVersionTuple] | list[KeyVersionTuple]) ‑> str
-
Permanently deletes all s3 object keys in keys_vers_to_del.
def drop_duplicate_cases(self)
-
Drop duplicate cases from self.transformed and update the input entities of the retained case with a 'SUPERSEDED' copy of the source PDF of the duplicate case.
def extract_pdfs(self) ‑> bool
-
Calls pdf_extractor.py functions.
def extract_sections(self, debug=False) ‑> bool
-
Calls section_extractor.py and table_extractor.py functions.
def map_results_to_s3_keys(self) ‑> dict[str, bool]
-
Updates send_results to include keys for pdfs that were either split into multiple cases or stitched to another pdf during processing.
def match_extracts(self, src_job_ids: collections.abc.Sequence[int] = (-1,)) ‑> bool
-
Match transformer output with: - data from existing DB cases having the same date of service as an extracted record, and - schedule and demographic data obtained from CSVs and other discrete references.
Args
src_job_ids
:Sequence[int]
- Optional. List of analysis_jobs ids to include in addition to those identified via date of service. Defaults to (-1,), i.e. no additional ids are included.
def match_schedule_to_demographics(self) ‑> dict[str, dict[str, vStr]]
-
Extract, match, and reformat data found in discrete schedule and demographic references and save the results to self.matching_data.
def merge_docvis(self)
-
Merge input, comments, and note from docuvision integrator if required.
def post_docuvision_tasks(self, split_by_pid=True) ‑> bool
-
Create tasks for multipatient docuvision PDFs.
def send(self)
-
Call the send_func defined in the facility spec to save extracted data to DB, S3, or disk.
def transform_tables(self, auto=True, src_job_ids: collections.abc.Sequence[int] = (-1,)) ‑> bool
-
Calls table_transformer.py functions.
class S3BatchArgs (*args, **kwargs)
-
Kwargs template for the S3Batch class defined below.
Attributes
bucket
:str
- S3 bucket containing files to extract
facility_spec
:sp.FacilitySpec
- facility (e.g. chaph) entry from specs/client_specs
log_console
:bool
- if True (default), log to console. if false, log to file.
log_dir
:str
- where to save logs if log_console is false
log_name_template
:str
- log name base for batch (e.g. "chaph_[PREFIX]")
facility_path
:str
- facility_spec key for client_specs
client_path
:str
- synonym for client; organization providing PDFs for processing
section_specs
:dict[str, Any]
- optional override for standard section_specs
summary_specs
:dict[str, Any]
- optional override for standard summary_specs
table_specs
:dict[str, Any]
- optional override for standard table_specs
transform_specs
:dict[str, Any]
- optional override for standard transform_specs
match_specs
:dict[str, Any]
- optional override for standard match_specs
Ancestors
- builtins.dict
Class variables
var bucket : str
var client_path : str
var facility_path : str
var facility_spec : FacilitySpec
var log_console : bool
var log_dir : str
var log_name_template : str
var match_specs : dict[str, MatchSpec]
var section_specs : dict[str, SectionSpec]
var summary_specs : dict[str, typing.Any]
var table_specs : dict[str, dict[str, TableSpec]]
var transform_specs : dict[str, dict[str, dict[str, dict[str, dict[str, dict[str, typing.Any]]]]]]