Module integrators.docuvision_task

represents a single request to the docuvision api


class DocuVisionCase (pid: str, parent: DocuVisionTask, page_types: dict[int, str] = <factory>)

basic representation of a single claimmaker case produced from a DocuVisionTask instance

Class variables

var page_types : dict[int, str]
var parentDocuVisionTask
var pid : str

Instance variables

prop case_doc_id : str

The filename for the object tied to the 'View PDF' button.

prop case_pdfs : dict[str, bytes]

dict of (filename, bytes) for pdfs in input entities

prop case_pdfs_text : dict[str, str]

dict of (filename, bytes) for pdfs in input entities

prop combined_pdf : bytes

PDF with all pages for case; tied to UI 'View PDF' button

prop is_placeholder : bool

convenience function for separate treatment of placeholder case in comprehensions, etc.

prop job_input : dict[str, Any]

A properly formated input column value for the case.

prop names : list[utils.vStr]

list of all identified PatientNames in this case's response

prop response : list[dict[str, Any]]

list of entries from parent.response.result.RESULT occurring on pages assigned to the pid of this case

prop scheduled_cases : dict[str, list[dict[str, str]]]

list of all identified PatientNames in LAST, FIRST and FIRST LAST format

prop type_pages : dict[str, list[int]]

reversal of page_types, returns list of pg #'s for each doc type

class DocuVisionTask (pdf_entry: lu.PDFLibProto, api_url: str, api_key: str, split_by_pid: bool = False, service: str = 'docuvision-1', model: str = 'base-medrec-anesthesia', min_confidence: float = 0.1, task_id: int = 0, location: PostTaskLocation = <factory>, document: PostTaskDocument = <factory>, response: du.GetTaskResponse = <factory>, pdf_text_pages: list[str] = <factory>, children: list[DocuVisionCase] = <factory>, fail_on_error: bool = False, mock: bool = False, default_dos: str = '2024-08-27', required_page_types: set[str] | None = None)

Create and process a single docuvision request / response.


The PDFLibEntry object for the pdf we're processing.
The URL of the DocuVision API.
The API key for the DocuVision API.
If True, split the PDF into individual child cases.
The DocuVision service to use.
The DocuVision model to use.
The minimum confidence level for the DocuVision response.
The DocuVision task ID.
The upload location for the PDF.
The document object for the DocuVision API request.
The DocuVision API response.
The text of each page of the original PDF.
The child cases.
If True, raise an error if the DocuVision API returns an error.
If True, use the mock API.
If True, a placeholder job was created in the database.
the DOS to use for the placeholder if no DateEncounter labels are returned.
if supplied, cases will be dropped unless they have at least one page whose type appears in this set.

Class variables

var api_key : str
var api_url : str
var children : list[DocuVisionCase]
var created_db_job : bool
var default_dos : str
var documentPostTaskDocument
var fail_on_error : bool
var locationPostTaskLocation
var min_confidence : float
var mock : bool
var model : str
var pdf_entry : utilities.library_utils.PDFLibProto
var pdf_text_pages : list[str]
var required_page_types : set[str] | None
var responseGetTaskResponse
var service : str
var split_by_pid : bool
var task_id : int

Instance variables

prop child_job_entries : dict[str, dict[str, Any]]

get a list of updated job representations to merge with other extracted data prior to pushing to DB/disk

prop db_job : dict[str, Any]

Find or create a placeholder case for this task in the database.

prop doc_id : str

bytes representing the original source PDF

prop file_id

The filename from the original document ID

prop fname_template

a template string for producing child pdf doc_ids. supply note_type and pid when calling format.

prop job_entries

aggregated placeholder and child job entries

prop note_type

The noteType property value assigned to the "Original" input entity and the Complete Record case_pdf of all child cases.

If the original filename followed the UI file naming convention (_.pdf), use the noteType indicated in the filename. Otherwise, assign "Scanned".

prop original_doc_id_query : str

Return a jmespath query for searching an input entities array for references to the s3 key where the source pdf was originally uploaded.

prop page_map : dict[str, dict[str, list[int]]]

The collection of page map entries for all child cases.

Used when splitting by PID to segment the raw PDF and response into individual child cases.

prop pdf : bytes

bytes representing the original source PDF

prop pdf_library : dict[str, lu.PDFLibEntry]

A standard S3Batch pdf_library containing the PDFs of all child cases

prop pdf_reader : pu.PdfReader

bytes representing the original source PDF

prop pdfs_out : dict[str, bytes]

A dict of {document_id: bytes} for all child cases. Used to upload PDFs to S3.

prop placeholderDocuVisionCase

The child to which the "Original" and "Summary" input entities are posted.

prop placeholder_scan_idx : int

next scan index to use when populating noteHeader

prop prefix

The S3 prefix of the original document ID

prop splitting : bool

True if splitting by PID or page map


def create_children(self, page_map: dict[str, dict[str, list[int]]] | None = None)

build pagemap for document splitting based on cacluated 'pid' in API response

def get(self) ‑> bool

Poll the API for the response.

If this is a mock task, find the original PDF object and download it from S3.

If the response indicates success: - Collect the text from each page of the raw PDF into self.pdf_text_pages - Verify PIDs cascaded to pid_found==False pages via extracted text. - Convert the response to a GetTaskResponse dataclass for downstream processing.

def hdr_for(self, input_id: str, is_placeholder: bool)

Get the noteHeader and noteType input entity properties for the provided input_id.


The document ID of the input entity.
If True, this is the placeholder case.


A dict defining the noteType and noteHeader input entity properties.

def post(self)

Obtain an upload location, upload the PDF, and post the task to the API.

def update_db_job_inputs(self, docvis_summary: list[dict[str, str]])

extend current db_job with entities from this task

class PostTaskDocument (name: str = '', sizeBytes: int = 0, dataKey: str = '', md5Sum: str = '', dataType: str = 'blob', mimeType: str = 'PDF', encodingType: str = 'base64/utf-8', model: str = 'base-medrec-anesthesia', confidenceInterval: float = 0.1, isPerformOCR: bool = True)

dataclass representation of a docuvision response

Class variables

var confidenceInterval : float
var dataKey : str
var dataType : str
var encodingType : str
var isPerformOCR : bool
var md5Sum : str
var mimeType : str
var model : str
var name : str
var sizeBytes : int


def post_task_payload(self, service: str) ‑> dict[str, str | dict[str, str | dict[str, typing.Any]]]

request formatted for submission to docuvision API

class PostTaskLocation (url: str = '', fields: dict[str, str] = <factory>)

dataclass representation of a 'upload-locations' response

Class variables

var fields : dict[str, str]
var url : str