Module integrators.docuvision_integrator

Handle API calls to DocuVision.

Classes

class DocuVisionIntegrator (documents: dict[str, utilities.library_utils.PDFLibProto] = <factory>, page_map: dict[str, dict[str, list[int]]] = <factory>, split_by_pid: bool = False, min_confidence: float = 0.1, api_timeout: int = 5400, out_dir: str | None = None, aws_region: str = 'us-east-1', api_secret_name: str = '', base_url: str = '', base_path: str = '', api_key: str = '', service: str = 'docuvision-1', model: str = 'base-medrec-anesthesia', jp_query: str = 'sort_by(\n [].{label:dotmap,value:text_ocr,confidence:confidence},\n &confidence\n )', mock_ids: list[int] = <factory>, fail_on_error: bool = False, default_dos: str = '2024-08-27', table_converters: dict[str, collections.abc.Callable[[list[str]], list[dict[str, str]]]] = <factory>, dv_preferred_networks: list[str] | None = None, dv_required_page_types: set[str] | None = None, *, result_formatter: collections.abc.Callable[..., tuple[str, dict[str, typing.Any]]] | None = None, response_reducer: collections.abc.Callable[..., list[dict[str, typing.Any]]] | None = None)

Post a batch of PDFs to docuvision API, poll for results, and preprocess raw output to clean up and reformat results into form expected by downstream processes.

If values are supplied for base_url, base_path, and api_key, the supplied values will override the corresponding values in the response from the boto3 secretsmanager client when api_secret_name is requested. If all 3 are supplied, no call to boto3 will occur at all allowing for "sans-AWS" use cases.

Basic field validation occurs on first call to post_tasks (or manual call to check_api_info which tests with _post_location()). Instancing process can either pass a "documents" dict in during init OR manually populate the "documents" attribute with entries of filename:bytes for each pdf to be processed. When all documents are loaded, instancing process should call the .post_tasks() method.

All documents will be posted and tasks submitted upon first call of the post_tasks() function.

All tasks will be polled and responses collected upon first access of the results property. Subsequent calls to results will return previously stored results.

Args

documents : dict[str, lu.PDFLibProto]
dict of {filename: PDFLibProto} where PDFLibProto is a namedtuple of (body, meta) where body is a bytes object and meta is a dict of metadata for the pdf.
page_map : dict[str, dict[str, list[int]]]
dict of {new_doc_id: {old_doc_id: [page_nums]}} where new_doc_id is the doc_id for the combined pdf created by docuvision and old_doc_id is the doc_id for the original pdf. page_nums is a list of page numbers from the original pdf that were included in the combined pdf.
split_by_pid : bool
if True, docuvision will split each pdf into separate documents based on patient id. If False, docuvision will combine all pdfs into a single document.
min_confidence : float
minimum confidence required for a result to be included in the final output. All results with confidence below this value will be dropped.
api_timeout : int
number of seconds to wait for a response from the docuvision API before raising an error.
out_dir : str
path to directory where output json files will be written. If None, no files will be written.
aws_region : str
AWS region to use for boto3 calls.
api_secret_name : str
name of secret in AWS secretsmanager containing the base_url, base_path, and api_key values for the docuvision API.
base_url : str
base url for docuvision API. If None, will be pulled from AWS secretsmanager.
base_path : str
base path for docuvision API. If None, will be pulled from AWS secretsmanager.
api_key : str
api key for docuvision API. If None, will be pulled from AWS secretsmanager.
service : str
docuvision service to use. Defaults to "docuvision-1".
model : str
docuvision model to use. Defaults to "base-medrec-anesthesia".
jp_query : str
jmespath query to use for result formatting. Defaults to: '''sort_by( [].{label:dotmap,value:text_ocr,confidence:confidence}, &confidence )'''
mock_ids : list[int]
list of task_ids to use for mock responses. Defaults to [].
fail_on_error : bool
if True, raise an error if any task fails to post or any response fails to be collected. Defaults to False.
default_dos : str
default date of service to use if no date of service is extracted from the facesheet. Defaults to gvars.DEFAULT_DOS.
result_formatter : Callable[…, tuple[str, dict[str, Any]]]
function to use for reformatting raw results. Defaults to self._format_result.
response_reducer : Callable[…, list[dict[str, Any]]]
function to use for reducing raw results to final output. Defaults to self._condense_response.
dv_preferred_networks : list[str] | None
list of preferred Docuvision Neural Networks. Created for facilities where people manually upload 1-page PDFs
table_converters : dict[str, Callable[[list[str]], list[dict[str, str]]]]
function reference for processing '*Table' labels returned by DV-1.
dv_required_page_types : set[str]
if supplied, a case will only be created for a pid if at least one of the pages assigned to that pid have a type in this set.

Ancestors

  • utilities.protocols.DocuVisionIntegratorProtocol
  • typing.Protocol
  • typing.Generic

Class variables

var api_key : str
var api_secret_name : str
var api_timeout : int
var aws_region : str
var base_path : str
var base_url : str
var default_dos : str
var documents : dict[str, utilities.library_utils.PDFLibProto]
var dv_preferred_networks : list[str] | None
var dv_required_page_types : set[str] | None
var fail_on_error : bool
var jp_query : str
var min_confidence : float
var mock_ids : list[int]
var model : str
var out_dir : str | None
var page_map : dict[str, dict[str, list[int]]]
var response_reducer : collections.abc.Callable[..., list[dict[str, typing.Any]]] | None
var result_formatter : collections.abc.Callable[..., tuple[str, dict[str, typing.Any]]] | None
var service : str
var split_by_pid : bool
var table_converters : dict[str, collections.abc.Callable[[list[str]], list[dict[str, str]]]]

Instance variables

prop doc_ex

Exception to utilize if an entire document fails to extract.

Controlled by self.fail_on_error. If True, raise an exception that will halt the entire process. If False, raise a ValueError that will be caught and logged by the LogExHandler decorator.

Returns

LogExOverrideError | ValueError
exception to raise
prop pdf_library : dict[str, utilities.library_utils.PDFLibEntry]

Dict of {case_doc_id: PDFLibEntry} where each entry represents the combined_pdf of a DocuVisionCase created by a child DocuVisionTask instance.

Merged into S3Batch.pdf_library in aws_s3_batch.py for file matching and other downstream processes.

prop pdfs_by_doc_id : dict[str, bytes]

Returns a dict of pdf bytes split and/or concatenated by task pids.

prop results : dict[str, list[dict[str, str]]]

Process responses from all child tasks to produce results according to the current instance settings.

Methods

def check_api_info(self)

Test to see if api url, path, and key values have been provided and attempt to pull from AWS Secrets Manager via self.api_secret_name if not.

Raises

ValueError
if api_key, base_url, or base_path cannot be determined.
def create_tasks(self, documents: dict[str, utilities.library_utils.PDFLibProto] | None = None)

Obtain upload location and post PDFs. If mock_ids were defined, create mock tasks for each id and collect the existing respones.

Args

documents : dict[str, lu.PDFLibProto]
dict of {filename: PDFLibProto} where PDFLibProto is a namedtuple of (body, meta) where body is a bytes object and meta is a dict of metadata for the pdf. Optional. Extends self.documents if supplied.
def job_dict_entries(self, extracted_data: dict[str, dict[str, typing.Any]]) ‑> dict[str, dict[str, typing.Any]]

Dict of {job_id: job_dict} for all tasks in self._tasks where each job_dict contains values for db columns 'input', 'comments', and 'note'.

Called by aws_s3_batch.py to recombine the values for the columns noted above with their corresponding output from table_transformer.py.

Args

extracted_data : dict[str, dict[str, Any]]
TableTransformer output data supplied from aws_s3_batch.S3Batch.transformed.

Returns

dict[str, dict[str, Any]]
dict of {job_id: job_dict} for all tasks in self._tasks.
def reset(self, **kwargs)

Reset the dataclass to prepare for a new facility by clearing all documents, tasks, results, and internal variables.

KwArgs

documents : dict[str, lu.PDFLibProto]
dict of {filename: PDFLibProto} where PDFLibProto is a namedtuple of (body, meta) where body is a bytes object and meta is a dict of metadata for the pdf.
page_map : dict[str, dict[str, list[int]]]
dict of {new_doc_id: {old_doc_id: [page_nums]}} where new_doc_id is the doc_id for the combined pdf created by docuvision and old_doc_id is the doc_id for the original pdf. page_nums is a list of page numbers from the original pdf that were included in the combined pdf.
out_dir : str
path to directory where output json files will be written. If None, no files will be written.
split_by_pid : bool
if True, docuvision will split each pdf into separate documents based on patient id. If False, docuvision will combine all pdfs into a single document.
fail_on_error : bool
if True, raise an error if any task fails to post or any response fails to be collected. Defaults to False.
mock_ids : list[int]
list of task_ids to use for mock responses. Defaults to [].
default_dos : str
default date of service to use if no date of service is extracted from the facesheet. Defaults to gvars.DEFAULT_DOS.
api_secret_name : str
name of secret in AWS secretsmanager containing the base_url, base_path, and api_key values for the docuvision API.
dv_preferred_networks : list[str] | None
list of preferred Docuvision Neural Networks. Created for facilities where people manually upload 1-page PDFs
table_converters : dict[str, Callable[[list[str]], list[dict[str, str]]]]
function reference for processing '*Table' labels returned by DV-1.
dv_required_page_types : set[str]
if supplied, a case will only be created for a pid if at least one of the pages assigned to that pid have a type in this set.
def tables_for(self, doc_id: str, section: str = 'DocuVision', sep: str = '.') ‑> dict[str, list[dict[str, str]]]

Get results for the supplied doc_id in a tabular format suitable for downstream processing in table_transformer.py.

Args

doc_id : str
doc_id for the document to retrieve results for.
section : str
section name to use for the table. Defaults to "DocuVision".
sep : str
separator to use for table keys. Defaults to ".".

Returns

dict[str, list[dict[str, str]]]
dict of {table_name: [table_rows]} where each table_row is a dict of {label: value}.
def task_attr_list(self, attr: str) ‑> list

List the specified attribute for all tasks in self._tasks.