Module aws_extractor
Manages S3 and database interactions for PDF extractor. Called from extract_s3.py.
Functions
def configure_db_from_secret(client_spec: ClientSpecStatic | dict[str, FacilitySpec])
-
Pull connection info from AWS and connect to a ClaimMaker DB instance.
Connection info is pulled from AWS secrets manager. client_spec setting 'db_secret_name' defines the secret to be pulled.
Raises
SystemExit(98): Failed to retrieve the secret specified by 'db_secret_name' from AWS SystemExit(99): Failed to connect to the DB using the connection information found in the specified secret. Raised IAOI (gvars.NO_DB and gvars.DISABLE_DB) == False. Otherwise, a warning is logged and execution is allowed to continue.
Args
client_spec
:sp.ClientSpec
- spec for the current client selected from this run's client_specs object.
def extract_buckets(**kwargs) ‑> collections.abc.Iterator[aws_s3_batch.S3Batch]
-
Process all clients and facilities defined in the client_specs object supplied for this run.
KwArgs
aws_s3_bucket
:str
- The source AWS S3 Bucket name.
no_db
:bool
- Somewhat of a misnomer as features have evolved. If true, the extraction is performed in a "read only" mode. SELECTs against DB resources ARE allowed, but writes TO the DB are NOT. S3 keys will be pulled but not updated or created. Defaults to False.
log_console
:bool
- Send all log data to the console stdout and stderr. Default is False.
log_dir
:str
- If log_console is False, sets the folder where log files are saved. Defaults to the current working directory.
client_list
:list[str]
- The list of client keys to process during the run. Optional. Defaults to all clients in the current client_specs object.
facility_list
:list[str]
- The list of facility keys to process during the run. Optional. Defaults to all facilities for all clients defined in the current client_specs object.
client_specs
:ClientSpecs
- Optional. Overrides builtin_client_specs if supplied.
match_specs
:MatchSpecs
- Optional. Overrides builtin_match_specs if supplied.
section_specs
:SectionSpecs
- Optional. Overrides builtin_section_specs if supplied.
table_specs
:TableSpecs
- Optional. Overrides builtin_table_specs if supplied.
transform_specs
:TransformSpecs
- Optional. Overrides builtin_transform_specs if supplied.
summary_specs
:SummarySpecs
- Optional. Overrides builtin_summary_specs if supplied.
max_keys
:int
- When set to a positive integer, sets the default for the
maximum number of AWS S3 object keys to process per facility.
sp.FacilitySpec
max_keys
settings override the value in "positive integer" mode. A value of-1
triggers recursive operations. In this mode, batches of FacilitySpec.get('max_keys', 1000) S3 keys are processed in series until all keys in the facility S3 folder have been processed. Default is 1000, i.e. the maxiumum number of keys allowed in an AWS boto3 'list objects' response. api_secret_name
:str
- Optional override for all 'api_secret_name' client_spec entries. Used to avoid charges to clients when the container is executing in PROD for non-PROD purposes, e.g. debugging a prior production run or preloading data for a new client.
output_dir
:str
- Forces debug output to save to disk at the supplied
path. Overrides FacilitySpec setting
output_dir
for all facilities processed during the run. Does NOT interfere with DB or S3 interactions if otherwise enabled. initialize
:bool
- If True, sets the global variables for the run and initializes the log file. If False, skips initialization and assumes that the global variables have already been set and the log file has already been initialized. Defaults to True.
Yields
s3b.S3Batch
- A batch object for each facility processed during the run.
def extract_cases(**kwargs)
-
Split a single PDF into cases or process all pages under a single case using docuvision.
Called for on demand runs triggered by uploads to the to-extract/ area of S3 (manual UI uploads and scanner service uploads).
KwArgs
aws_s3_bucket
:str
- The source AWS S3 Bucket name.
no_db
:bool
- Somewhat of a misnomer as features have evolved. If true, the extraction is performed in a "read only" mode. SELECTs against DB resources ARE allowed, but writes TO the DB are NOT. S3 keys will be pulled but not updated or created. Defaults to False. log_console (bool): Send all log data to the console stdout and stderr. Default is False.
log_dir
:str
- If log_console is False, sets the folder where log files are saved. Defaults to the current working directory.
client_specs
:ClientSpecs
- Optional. Overrides builtin_client_specs if supplied.
section_specs
:SectionSpecs
- Optional. Overrides builtin_section_specs if supplied.
table_specs
:TableSpecs
- Optional. Overrides builtin_table_specs if supplied.
transform_specs
:TransformSpecs
- Optional. Overrides builtin_transform_specs if supplied.
summary_specs
:SummarySpecs
- Optional. Overrides builtin_summary_specs if supplied.
api_secret_name
:str
- Optional override for all 'api_secret_name' client_spec entries. Used to avoid charges to clients when the container is executing in PROD for non-PROD purposes, e.g. debugging a prior production run or preloading data for a new client.
def extract_cases_failed(job_id: int, keys_versions: list[KeyVersionTuple], batchargs: dict[str, typing.Any])
-
Clean up if a fatal error occurs during an on demand run.
Args
job_id
:int
- analysis_jobs.id of the placeholder case (when splitting by DocuVision PID) or patient case (when processing a manual upload) that failed.
keys_versions
:list[lu.KeyVersionTuple]
- list of keys and versions that were processing when the error occurred.
batchargs
:dict[str, Any]
- batchargs that were used to instantiate the S3Batch object that was processing the keys when the error occurred.
def facility_data(bucket: str, s3_prefix: str, max_keys: int = 1000) ‑> dict[str, list[DateVersIsEmptyTuple]]
-
Get the list of S3 objects for the target facility from AWS S3.
Permanently deletes all key versions if a "DeleteMarker" is the most recent version. Sets gvars.INCOMPLETE_CLIENTS and gvars.INCOMPLETE_FACILITIES if gvars.PROCESS_ALL_KEYS is True and the number of keys returned for the target facility exceeds
max_keys
.Args
bucket
:str
- The AWS S3 bucket name.
s3_prefix
:str
- The prefix for the target facility.
max_keys
:int
- The maximum number of keys to process. Defaults to 1000.
Returns
dict[str, list[lu.DateVersIsEmptyTuple]]
- A dict with keys corresponding to the s3 keys of the objects for the target facility. The values are a list of tuples of form (last modified date, s3 version id). Last modified is the first element to allow for easy sorting and ensures that the most recent version is always the final element in the returned list.
def facility_pdf_library(bucket: str, prefix: str, specs: FacilitySpec) ‑> tuple[dict[str, PDFLibProto], set[KeyVersionTuple] | list[KeyVersionTuple], set[KeyVersionTuple] | list[KeyVersionTuple]]
-
Calls
facility_data()
and iterates over resulting dict to pull the most recent PDF version from S3.(1) The filename of each S3 key is tested to ensure it conforms to the file naming conventions defined in the FacilitySpec of the target facility and (2) filename metadata is collected and stored in the resulting PDFLibProto objects meta properties.
Args
bucket
:str
- The S3 bucket name.
prefix
:str
- The prefix for the S3 keys.
specs
:sp.FacilitySpec
- The FacilitySpec for the target facility.
Returns
tuple
- A tuple containing:
- pdf_library (dict[str, lu.PDFLibProto]): A dictionary with keys corresponding to S3 keys and values of form {'body': pdf bytes, 'meta': [dict of filename metadata]}. - keys_versions (lu.KeysVersions): A list of KeyVersionTuples for all objects having a filename matching a *_test_expr defined from the FacilitySpec. - bad_keys_versions (lu.KeysVersions): A list of KeyVersionTuples that failed to match any of the FacilitySpec test expressions.
def get_facility_batchargs(client: str, facility: str, raw_specs: Union[FacilitySpec, Any]) ‑> aws_s3_batch.S3BatchArgs
-
Create the S3BatchArgs for this facility using the global template as a base and overriding with facility specific settings.
Args
client
:str
- The client currently extracting.
facility
:str
- The facility currently extracting.
raw_specs
:dict[str, Any]
- The FacilitySpec for the currently extracting facility.
Returns
s3b.S3BatchArgs
- The batch arguments for the specified facility.
def send_error_notification(client: str)
-
If error notifications were generated, create a notify.Notifier instance and send an email summary of the errors via AWS SES.
Requires a ClaimMaker DB instance. Not supported for "s3 only" operations, e.g. revenue integrity or licensee containers.
def set_global_batchargs(**kwargs)
-
Create a global S3BatchArgs template to use for all facilities.
KwArgs
See extract_buckets() and extract_cases() KwArgs documentation.