Module extract_s3

API for AWS and command line execution of aws_extractor.py

Purpose

Reads and processes the contents of an s3 "folder" (i.e. prefix) to extract data from PDFs, CSVs, FWFs, and more. Extracted data can be inserted into a postgresql DB or posted as json to S3 or disk. Processed keys are relocated to either a "processed" or "failed" folder depending on whether the extraction was successful.

Usage

python extract_s3.py [cli options]

Use --help to display detailed documentation of the cli options.

Functions

def eval_specs(specs_dict: dict)

if spec pickle value starts with 'eval::', perform eval() on the text following '::' and store the result in the source key

def main()

Parse cli args and execute the appropriate aws_extractor algorithm.

If the –to-extract-key cli arg is supplied, extract_cases() is called to process only the –to-extract-key. Otherwise, extract_buckets() is called, and all clients and facilities defined in the client specs (either built in or supplied via cli arg) are processed in series.

def process_aws_creds(args)

check AWS env vars and add to args namespace if present.

def process_cli_args() ‑> argparse.Namespace

Create command line arg parser and evaluate, validate, and resolve conflicts with environment variable settings.

Returns

argparse.Namespace
command line arguments namespace supplied as kwargs to extract_buckets() or extract_cases() via **vars(args).
def process_client_specs(args)

Transmogrify pickle from –client-specs-file into valid client specs.

def process_s3_specs(args)

Transmogrify binary files from s3 into valid specs dictionaries.

Searches for args prefixed with 'aws_s3_key_' and ending with 'specs'. If found, the value is used as an s3 key to download a pickle file containing a specs dictionary. The dictionary is unpickled and evaluated to resolve any 'eval::' values. The resulting dictionary is stored in the args namespace with the 'aws_s3_key' prefix removed, e.g. 'aws_s3_key_client_specs' becomes 'client_specs'.

Args

args : argparse.Namespace
command line arguments namespace.
def register_fatal_except_hook()

Set custom sys.excepthook while preserving call to original.

def usage(parser: argparse.ArgumentParser)

Prints arg parse help and exits program if arg validation fails.

Args

parser : argparse.ArgumentParser
parser object to print help from.