Module integrators.docuvision_task

represents a single request to the docuvision api

Classes

class DocuVisionCase (pid: str, parent: DocuVisionTask, page_types: dict[int, str] = <factory>)

basic representation of a single claimmaker case produced from a DocuVisionTask instance

Class variables

var page_types : dict[int, str]
var parentDocuVisionTask
var pid : str

Instance variables

prop case_doc_id : str

The filename for the object tied to the 'View PDF' button.

prop case_pdfs : dict[str, bytes]

dict of (filename, bytes) for pdfs in input entities

prop case_pdfs_text : dict[str, str]

dict of (filename, bytes) for pdfs in input entities

prop combined_pdf : bytes

PDF with all pages for case; tied to UI 'View PDF' button

prop is_placeholder : bool

convenience function for separate treatment of placeholder case in comprehensions, etc.

prop job_input : dict[str, Any]

A properly formated input column value for the case.

prop names : list[utils.vStr]

list of all identified PatientNames in this case's response

prop response : list[dict[str, Any]]

list of entries from parent.response.result.RESULT occurring on pages assigned to the pid of this case

prop scheduled_cases : dict[str, list[dict[str, str]]]

list of all identified PatientNames in LAST, FIRST and FIRST LAST format

prop type_pages : dict[str, list[int]]

reversal of page_types, returns list of pg #'s for each doc type

class DocuVisionTask (pdf_entry: lu.PDFLibProto, api_url: str, api_key: str, split_by_pid: bool = False, service: str = 'docuvision-1', model: str = 'base-medrec-anesthesia', min_confidence: float = 0.1, task_id: int = 0, location: PostTaskLocation = <factory>, document: PostTaskDocument = <factory>, response: du.GetTaskResponse = <factory>, pdf_text_pages: list[str] = <factory>, children: list[DocuVisionCase] = <factory>, fail_on_error: bool = False, mock: bool = False, default_dos: str = '2024-08-27', required_page_types: set[str] | None = None)

Create and process a single docuvision request / response.

Attributes

pdf_entry
The PDFLibEntry object for the pdf we're processing.
api_url
The URL of the DocuVision API.
api_key
The API key for the DocuVision API.
split_by_pid
If True, split the PDF into individual child cases.
service
The DocuVision service to use.
model
The DocuVision model to use.
min_confidence
The minimum confidence level for the DocuVision response.
task_id
The DocuVision task ID.
location
The upload location for the PDF.
document
The document object for the DocuVision API request.
response
The DocuVision API response.
pdf_text_pages
The text of each page of the original PDF.
children
The child cases.
fail_on_error
If True, raise an error if the DocuVision API returns an error.
mock
If True, use the mock API.
created_db_job
If True, a placeholder job was created in the database.
default_dos
the DOS to use for the placeholder if no DateEncounter labels are returned.
required_page_types
if supplied, cases will be dropped unless they have at least one page whose type appears in this set.

Class variables

var api_key : str
var api_url : str
var children : list[DocuVisionCase]
var created_db_job : bool
var default_dos : str
var documentPostTaskDocument
var fail_on_error : bool
var locationPostTaskLocation
var min_confidence : float
var mock : bool
var model : str
var pdf_entry : utilities.library_utils.PDFLibProto
var pdf_text_pages : list[str]
var required_page_types : set[str] | None
var responseGetTaskResponse
var service : str
var split_by_pid : bool
var task_id : int

Instance variables

prop child_job_entries : dict[str, dict[str, Any]]

get a list of updated job representations to merge with other extracted data prior to pushing to DB/disk

prop db_job : dict[str, Any]

Find or create a placeholder case for this task in the database.

prop doc_id : str

bytes representing the original source PDF

prop file_id

The filename from the original document ID

prop fname_template

a template string for producing child pdf doc_ids. supply note_type and pid when calling format.

prop job_entries

aggregated placeholder and child job entries

prop note_type

The noteType property value assigned to the "Original" input entity and the Complete Record case_pdf of all child cases.

If the original filename followed the UI file naming convention (_.pdf), use the noteType indicated in the filename. Otherwise, assign "Scanned".

prop original_doc_id_query : str

Return a jmespath query for searching an input entities array for references to the s3 key where the source pdf was originally uploaded.

prop page_map : dict[str, dict[str, list[int]]]

The collection of page map entries for all child cases.

Used when splitting by PID to segment the raw PDF and response into individual child cases.

prop pdf : bytes

bytes representing the original source PDF

prop pdf_library : dict[str, lu.PDFLibEntry]

A standard S3Batch pdf_library containing the PDFs of all child cases

prop pdf_reader : pu.PdfReader

bytes representing the original source PDF

prop pdfs_out : dict[str, bytes]

A dict of {document_id: bytes} for all child cases. Used to upload PDFs to S3.

prop placeholderDocuVisionCase

The child to which the "Original" and "Summary" input entities are posted.

prop placeholder_scan_idx : int

next scan index to use when populating noteHeader

prop prefix

The S3 prefix of the original document ID

prop splitting : bool

True if splitting by PID or page map

Methods

def create_children(self, page_map: dict[str, dict[str, list[int]]] | None = None)

build pagemap for document splitting based on cacluated 'pid' in API response

def get(self) ‑> bool

Poll the API for the response.

If this is a mock task, find the original PDF object and download it from S3.

If the response indicates success: - Collect the text from each page of the raw PDF into self.pdf_text_pages - Verify PIDs cascaded to pid_found==False pages via extracted text. - Convert the response to a GetTaskResponse dataclass for downstream processing.

def hdr_for(self, input_id: str, is_placeholder: bool)

Get the noteHeader and noteType input entity properties for the provided input_id.

Args

input_id
The document ID of the input entity.
is_placeholder
If True, this is the placeholder case.

Returns

A dict defining the noteType and noteHeader input entity properties.

def post(self)

Obtain an upload location, upload the PDF, and post the task to the API.

def update_db_job_inputs(self, docvis_summary: list[dict[str, str]])

extend current db_job with entities from this task

class PostTaskDocument (name: str = '', sizeBytes: int = 0, dataKey: str = '', md5Sum: str = '', dataType: str = 'blob', mimeType: str = 'PDF', encodingType: str = 'base64/utf-8', model: str = 'base-medrec-anesthesia', confidenceInterval: float = 0.1, isPerformOCR: bool = True)

dataclass representation of a docuvision response

Class variables

var confidenceInterval : float
var dataKey : str
var dataType : str
var encodingType : str
var isPerformOCR : bool
var md5Sum : str
var mimeType : str
var model : str
var name : str
var sizeBytes : int

Methods

def post_task_payload(self, service: str) ‑> dict[str, str | dict[str, str | dict[str, typing.Any]]]

request formatted for submission to docuvision API

class PostTaskLocation (url: str = '', fields: dict[str, str] = <factory>)

dataclass representation of a 'upload-locations' response

Class variables

var fields : dict[str, str]
var url : str