Module `integrators.docuvision_task`

represents a single request to the docuvision api

Classes

class DocuVisionCase (pid: str, parent: DocuVisionTask, page_types: dict[int, str] = <factory>)

basic representation of a single claimmaker case produced from a DocuVisionTask instance

Class variables

var page_types : dict[int, str]
var parent : DocuVisionTask
var pid : str

Instance variables

prop case_doc_id : str: The filename for the object tied to the 'View PDF' button.
prop case_pdfs : dict[str, bytes]: dict of (filename, bytes) for pdfs in input entities
prop case_pdfs_text : dict[str, str]: dict of (filename, bytes) for pdfs in input entities
prop combined_pdf : bytes: PDF with all pages for case; tied to UI 'View PDF' button
prop is_placeholder : bool: convenience function for separate treatment of placeholder case in comprehensions, etc.
prop job_input : dict[str, Any]: A properly formated input column value for the case.
prop names : list[utils.vStr]: list of all identified PatientNames in this case's response
prop response : list[dict[str, Any]]: list of entries from parent.response.result.RESULT occurring on pages assigned to the pid of this case
prop scheduled_cases : dict[str, list[dict[str, str]]]: list of all identified PatientNames in LAST, FIRST and FIRST LAST format
prop type_pages : dict[str, list[int]]: reversal of page_types, returns list of pg #'s for each doc type

class DocuVisionTask (pdf_entry: lu.PDFLibProto, api_url: str, api_key: str, split_by_pid: bool = False, service: str = 'docuvision-1', model: str = 'base-medrec-anesthesia', min_confidence: float = 0.1, task_id: int = 0, location: PostTaskLocation = <factory>, document: PostTaskDocument = <factory>, response: du.GetTaskResponse = <factory>, pdf_text_pages: list[str] = <factory>, children: list[DocuVisionCase] = <factory>, fail_on_error: bool = False, mock: bool = False, default_dos: str = '2024-08-27', required_page_types: set[str] | None = None)

Create and process a single docuvision request / response.

Attributes

pdf_entry: The PDFLibEntry object for the pdf we're processing.
api_url: The URL of the DocuVision API.
api_key: The API key for the DocuVision API.
split_by_pid: If True, split the PDF into individual child cases.
service: The DocuVision service to use.
model: The DocuVision model to use.
min_confidence: The minimum confidence level for the DocuVision response.
task_id: The DocuVision task ID.
location: The upload location for the PDF.
document: The document object for the DocuVision API request.
response: The DocuVision API response.
pdf_text_pages: The text of each page of the original PDF.
children: The child cases.
fail_on_error: If True, raise an error if the DocuVision API returns an error.
mock: If True, use the mock API.
created_db_job: If True, a placeholder job was created in the database.
default_dos: the DOS to use for the placeholder if no DateEncounter labels are returned.
required_page_types: if supplied, cases will be dropped unless they have at least one page whose type appears in this set.

Class variables

var api_key : str
var api_url : str
var children : list[DocuVisionCase]
var created_db_job : bool
var default_dos : str
var document : PostTaskDocument
var fail_on_error : bool
var location : PostTaskLocation
var min_confidence : float
var mock : bool
var model : str
var pdf_entry : utilities.library_utils.PDFLibProto
var pdf_text_pages : list[str]
var required_page_types : set[str] | None
var response : GetTaskResponse
var service : str
var split_by_pid : bool
var task_id : int

Instance variables

prop child_job_entries : dict[str, dict[str, Any]]: get a list of updated job representations to merge with other extracted data prior to pushing to DB/disk
prop db_job : dict[str, Any]: Find or create a placeholder case for this task in the database.
prop doc_id : str: bytes representing the original source PDF
prop file_id: The filename from the original document ID
prop fname_template: a template string for producing child pdf doc_ids. supply note_type and pid when calling format.
prop job_entries: aggregated placeholder and child job entries
prop note_type: The noteType property value assigned to the "Original" input entity and the Complete Record case_pdf of all child cases.

If the original filename followed the UI file naming convention (_.pdf), use the noteType indicated in the filename. Otherwise, assign "Scanned".
prop original_doc_id_query : str: Return a jmespath query for searching an input entities array for references to the s3 key where the source pdf was originally uploaded.
prop page_map : dict[str, dict[str, list[int]]]: The collection of page map entries for all child cases.

Used when splitting by PID to segment the raw PDF and response into individual child cases.
prop pdf : bytes: bytes representing the original source PDF
prop pdf_library : dict[str, lu.PDFLibEntry]: A standard S3Batch pdf_library containing the PDFs of all child cases
prop pdf_reader : pu.PdfReader: bytes representing the original source PDF
prop pdfs_out : dict[str, bytes]: A dict of {document_id: bytes} for all child cases. Used to upload PDFs to S3.
prop placeholder : DocuVisionCase: The child to which the "Original" and "Summary" input entities are posted.
prop placeholder_scan_idx : int: next scan index to use when populating noteHeader
prop prefix: The S3 prefix of the original document ID
prop splitting : bool: True if splitting by PID or page map

Methods

def create_children(self, page_map: dict[str, dict[str, list[int]]] | None = None)

build pagemap for document splitting based on cacluated 'pid' in API response

def get(self) ‑> bool

Poll the API for the response.

If this is a mock task, find the original PDF object and download it from S3.

If the response indicates success: - Collect the text from each page of the raw PDF into self.pdf_text_pages - Verify PIDs cascaded to pid_found==False pages via extracted text. - Convert the response to a GetTaskResponse dataclass for downstream processing.

def hdr_for(self, input_id: str, is_placeholder: bool)

Get the noteHeader and noteType input entity properties for the provided input_id.

Args

input_id: The document ID of the input entity.
is_placeholder: If True, this is the placeholder case.

Returns

A dict defining the noteType and noteHeader input entity properties.

def post(self)

Obtain an upload location, upload the PDF, and post the task to the API.

def update_db_job_inputs(self, docvis_summary: list[dict[str, str]])

extend current db_job with entities from this task

class PostTaskDocument (name: str = '', sizeBytes: int = 0, dataKey: str = '', md5Sum: str = '', dataType: str = 'blob', mimeType: str = 'PDF', encodingType: str = 'base64/utf-8', model: str = 'base-medrec-anesthesia', confidenceInterval: float = 0.1, isPerformOCR: bool = True)

dataclass representation of a docuvision response

Class variables

var confidenceInterval : float
var dataKey : str
var dataType : str
var encodingType : str
var isPerformOCR : bool
var md5Sum : str
var mimeType : str
var model : str
var name : str
var sizeBytes : int

Methods

def post_task_payload(self, service: str) ‑> dict[str, str | dict[str, str | dict[str, typing.Any]]]: request formatted for submission to docuvision API

class PostTaskLocation (url: str = '', fields: dict[str, str] = <factory>)

dataclass representation of a 'upload-locations' response

Class variables

var fields : dict[str, str]
var url : str