Module aws_extractor

Manages S3 and database interactions for PDF extractor. Called from extract_s3.py.

Functions

def configure_db_from_secret(client_spec: ClientSpecStatic | dict[str, FacilitySpec])

Pull connection info from AWS and connect to a ClaimMaker DB instance.

Connection info is pulled from AWS secrets manager. client_spec setting 'db_secret_name' defines the secret to be pulled.

Raises

SystemExit(98): Failed to retrieve the secret specified by 'db_secret_name' from AWS SystemExit(99): Failed to connect to the DB using the connection information found in the specified secret. Raised IAOI (gvars.NO_DB and gvars.DISABLE_DB) == False. Otherwise, a warning is logged and execution is allowed to continue.

Args

client_spec : sp.ClientSpec
spec for the current client selected from this run's client_specs object.
def extract_buckets(**kwargs) ‑> collections.abc.Iterator[aws_s3_batch.S3Batch]

Process all clients and facilities defined in the client_specs object supplied for this run.

KwArgs

aws_s3_bucket : str
The source AWS S3 Bucket name.
no_db : bool
Somewhat of a misnomer as features have evolved. If true, the extraction is performed in a "read only" mode. SELECTs against DB resources ARE allowed, but writes TO the DB are NOT. S3 keys will be pulled but not updated or created. Defaults to False.
log_console : bool
Send all log data to the console stdout and stderr. Default is False.
log_dir : str
If log_console is False, sets the folder where log files are saved. Defaults to the current working directory.
client_list : list[str]
The list of client keys to process during the run. Optional. Defaults to all clients in the current client_specs object.
facility_list : list[str]
The list of facility keys to process during the run. Optional. Defaults to all facilities for all clients defined in the current client_specs object.
client_specs : ClientSpecs
Optional. Overrides builtin_client_specs if supplied.
match_specs : MatchSpecs
Optional. Overrides builtin_match_specs if supplied.
section_specs : SectionSpecs
Optional. Overrides builtin_section_specs if supplied.
table_specs : TableSpecs
Optional. Overrides builtin_table_specs if supplied.
transform_specs : TransformSpecs
Optional. Overrides builtin_transform_specs if supplied.
summary_specs : SummarySpecs
Optional. Overrides builtin_summary_specs if supplied.
max_keys : int
When set to a positive integer, sets the default for the maximum number of AWS S3 object keys to process per facility. sp.FacilitySpec max_keys settings override the value in "positive integer" mode. A value of -1 triggers recursive operations. In this mode, batches of FacilitySpec.get('max_keys', 1000) S3 keys are processed in series until all keys in the facility S3 folder have been processed. Default is 1000, i.e. the maxiumum number of keys allowed in an AWS boto3 'list objects' response.
api_secret_name : str
Optional override for all 'api_secret_name' client_spec entries. Used to avoid charges to clients when the container is executing in PROD for non-PROD purposes, e.g. debugging a prior production run or preloading data for a new client.
output_dir : str
Forces debug output to save to disk at the supplied path. Overrides FacilitySpec setting output_dir for all facilities processed during the run. Does NOT interfere with DB or S3 interactions if otherwise enabled.
initialize : bool
If True, sets the global variables for the run and initializes the log file. If False, skips initialization and assumes that the global variables have already been set and the log file has already been initialized. Defaults to True.

Yields

s3b.S3Batch
A batch object for each facility processed during the run.
def extract_cases(**kwargs)

Split a single PDF into cases or process all pages under a single case using docuvision.

Called for on demand runs triggered by uploads to the to-extract/ area of S3 (manual UI uploads and scanner service uploads).

KwArgs

aws_s3_bucket : str
The source AWS S3 Bucket name.
no_db : bool
Somewhat of a misnomer as features have evolved. If true, the extraction is performed in a "read only" mode. SELECTs against DB resources ARE allowed, but writes TO the DB are NOT. S3 keys will be pulled but not updated or created. Defaults to False. log_console (bool): Send all log data to the console stdout and stderr. Default is False.
log_dir : str
If log_console is False, sets the folder where log files are saved. Defaults to the current working directory.
client_specs : ClientSpecs
Optional. Overrides builtin_client_specs if supplied.
section_specs : SectionSpecs
Optional. Overrides builtin_section_specs if supplied.
table_specs : TableSpecs
Optional. Overrides builtin_table_specs if supplied.
transform_specs : TransformSpecs
Optional. Overrides builtin_transform_specs if supplied.
summary_specs : SummarySpecs
Optional. Overrides builtin_summary_specs if supplied.
api_secret_name : str
Optional override for all 'api_secret_name' client_spec entries. Used to avoid charges to clients when the container is executing in PROD for non-PROD purposes, e.g. debugging a prior production run or preloading data for a new client.
def extract_cases_failed(job_id: int, keys_versions: list[KeyVersionTuple], batchargs: dict[str, typing.Any])

Clean up if a fatal error occurs during an on demand run.

Args

job_id : int
analysis_jobs.id of the placeholder case (when splitting by DocuVision PID) or patient case (when processing a manual upload) that failed.
keys_versions : list[lu.KeyVersionTuple]
list of keys and versions that were processing when the error occurred.
batchargs : dict[str, Any]
batchargs that were used to instantiate the S3Batch object that was processing the keys when the error occurred.
def facility_data(bucket: str, s3_prefix: str, max_keys: int = 1000) ‑> dict[str, list[DateVersIsEmptyTuple]]

Get the list of S3 objects for the target facility from AWS S3.

Permanently deletes all key versions if a "DeleteMarker" is the most recent version. Sets gvars.INCOMPLETE_CLIENTS and gvars.INCOMPLETE_FACILITIES if gvars.PROCESS_ALL_KEYS is True and the number of keys returned for the target facility exceeds max_keys.

Args

bucket : str
The AWS S3 bucket name.
s3_prefix : str
The prefix for the target facility.
max_keys : int
The maximum number of keys to process. Defaults to 1000.

Returns

dict[str, list[lu.DateVersIsEmptyTuple]]
A dict with keys corresponding to the s3 keys of the objects for the target facility. The values are a list of tuples of form (last modified date, s3 version id). Last modified is the first element to allow for easy sorting and ensures that the most recent version is always the final element in the returned list.
def facility_pdf_library(bucket: str, prefix: str, specs: FacilitySpec) ‑> tuple[dict[str, PDFLibProto], set[KeyVersionTuple] | list[KeyVersionTuple], set[KeyVersionTuple] | list[KeyVersionTuple]]

Calls facility_data() and iterates over resulting dict to pull the most recent PDF version from S3.

(1) The filename of each S3 key is tested to ensure it conforms to the file naming conventions defined in the FacilitySpec of the target facility and (2) filename metadata is collected and stored in the resulting PDFLibProto objects meta properties.

Args

bucket : str
The S3 bucket name.
prefix : str
The prefix for the S3 keys.
specs : sp.FacilitySpec
The FacilitySpec for the target facility.

Returns

tuple
A tuple containing:
    - pdf_library (dict[str, lu.PDFLibProto]): A dictionary with keys corresponding to S3 keys and
      values of form {'body': pdf bytes, 'meta': [dict of filename metadata]}.
    - keys_versions (lu.KeysVersions): A list of KeyVersionTuples for all objects having a
      filename matching a *_test_expr defined from the FacilitySpec.
    - bad_keys_versions (lu.KeysVersions): A list of KeyVersionTuples that failed to match
      any of the FacilitySpec test expressions.
def get_facility_batchargs(client: str, facility: str, raw_specs: Union[FacilitySpec, Any]) ‑> aws_s3_batch.S3BatchArgs

Create the S3BatchArgs for this facility using the global template as a base and overriding with facility specific settings.

Args

client : str
The client currently extracting.
facility : str
The facility currently extracting.
raw_specs : dict[str, Any]
The FacilitySpec for the currently extracting facility.

Returns

s3b.S3BatchArgs
The batch arguments for the specified facility.
def send_error_notification(client: str)

If error notifications were generated, create a notify.Notifier instance and send an email summary of the errors via AWS SES.

Requires a ClaimMaker DB instance. Not supported for "s3 only" operations, e.g. revenue integrity or licensee containers.

def set_global_batchargs(**kwargs)

Create a global S3BatchArgs template to use for all facilities.

KwArgs

See extract_buckets() and extract_cases() KwArgs documentation.