Module integrators.docuvision_task
represents a single request to the docuvision api
Classes
class DocuVisionCase (pid: str, parent: DocuVisionTask, page_types: dict[int, str] = <factory>)-
basic representation of a single claimmaker case produced from a DocuVisionTask instance
Class variables
var page_types : dict[int, str]var parent : DocuVisionTaskvar pid : str
Instance variables
prop case_doc_id : str-
The filename for the object tied to the 'View PDF' button.
prop case_pdfs : dict[str, bytes]-
dict of (filename, bytes) for pdfs in input entities
prop case_pdfs_text : dict[str, str]-
dict of (filename, bytes) for pdfs in input entities
prop combined_pdf : bytes-
PDF with all pages for case; tied to UI 'View PDF' button
prop is_placeholder : bool-
convenience function for separate treatment of placeholder case in comprehensions, etc.
prop job_input : dict[str, Any]-
A properly formated input column value for the case.
prop names : list[utils.vStr]-
list of all identified PatientNames in this case's response
prop response : list[dict[str, Any]]-
list of entries from parent.response.result.RESULT occurring on pages assigned to the pid of this case
prop scheduled_cases : dict[str, list[dict[str, str]]]-
list of all identified PatientNames in LAST, FIRST and FIRST LAST format
prop type_pages : dict[str, list[int]]-
reversal of page_types, returns list of pg #'s for each doc type
class DocuVisionTask (pdf_entry: lu.PDFLibProto, api_url: str, api_key: str, split_by_pid: bool = False, service: str = 'docuvision-1', model: str = 'base-medrec-anesthesia', min_confidence: float = 0.1, task_id: int = 0, location: PostTaskLocation = <factory>, document: PostTaskDocument = <factory>, response: du.GetTaskResponse = <factory>, pdf_text_pages: list[str] = <factory>, children: list[DocuVisionCase] = <factory>, fail_on_error: bool = False, mock: bool = False, default_dos: str = '2024-08-27', required_page_types: set[str] | None = None)-
Create and process a single docuvision request / response.
Attributes
pdf_entry- The PDFLibEntry object for the pdf we're processing.
api_url- The URL of the DocuVision API.
api_key- The API key for the DocuVision API.
split_by_pid- If True, split the PDF into individual child cases.
service- The DocuVision service to use.
model- The DocuVision model to use.
min_confidence- The minimum confidence level for the DocuVision response.
task_id- The DocuVision task ID.
location- The upload location for the PDF.
document- The document object for the DocuVision API request.
response- The DocuVision API response.
pdf_text_pages- The text of each page of the original PDF.
children- The child cases.
fail_on_error- If True, raise an error if the DocuVision API returns an error.
mock- If True, use the mock API.
created_db_job- If True, a placeholder job was created in the database.
default_dos- the DOS to use for the placeholder if no DateEncounter labels are returned.
required_page_types- if supplied, cases will be dropped unless they have at least one page whose type appears in this set.
Class variables
var api_key : strvar api_url : strvar children : list[DocuVisionCase]var created_db_job : boolvar default_dos : strvar document : PostTaskDocumentvar fail_on_error : boolvar location : PostTaskLocationvar min_confidence : floatvar mock : boolvar model : strvar pdf_entry : utilities.library_utils.PDFLibProtovar pdf_text_pages : list[str]var required_page_types : set[str] | Nonevar response : GetTaskResponsevar service : strvar split_by_pid : boolvar task_id : int
Instance variables
prop child_job_entries : dict[str, dict[str, Any]]-
get a list of updated job representations to merge with other extracted data prior to pushing to DB/disk
prop db_job : dict[str, Any]-
Find or create a placeholder case for this task in the database.
prop doc_id : str-
bytes representing the original source PDF
prop file_id-
The filename from the original document ID
prop fname_template-
a template string for producing child pdf doc_ids. supply note_type and pid when calling format.
prop job_entries-
aggregated placeholder and child job entries
prop note_type-
The noteType property value assigned to the "Original" input entity and the Complete Record case_pdf of all child cases.
If the original filename followed the UI file naming convention (
_ .pdf), use the noteType indicated in the filename. Otherwise, assign "Scanned". prop original_doc_id_query : str-
Return a jmespath query for searching an input entities array for references to the s3 key where the source pdf was originally uploaded.
prop page_map : dict[str, dict[str, list[int]]]-
The collection of page map entries for all child cases.
Used when splitting by PID to segment the raw PDF and response into individual child cases.
prop pdf : bytes-
bytes representing the original source PDF
prop pdf_library : dict[str, lu.PDFLibEntry]-
A standard S3Batch pdf_library containing the PDFs of all child cases
prop pdf_reader : pu.PdfReader-
bytes representing the original source PDF
prop pdfs_out : dict[str, bytes]-
A dict of {document_id: bytes} for all child cases. Used to upload PDFs to S3.
prop placeholder : DocuVisionCase-
The child to which the "Original" and "Summary" input entities are posted.
prop placeholder_scan_idx : int-
next scan index to use when populating noteHeader
prop prefix-
The S3 prefix of the original document ID
prop splitting : bool-
True if splitting by PID or page map
Methods
def create_children(self, page_map: dict[str, dict[str, list[int]]] | None = None)-
build pagemap for document splitting based on cacluated 'pid' in API response
def get(self) ‑> bool-
Poll the API for the response.
If this is a mock task, find the original PDF object and download it from S3.
If the response indicates success: - Collect the text from each page of the raw PDF into self.pdf_text_pages - Verify PIDs cascaded to pid_found==False pages via extracted text. - Convert the response to a GetTaskResponse dataclass for downstream processing.
def hdr_for(self, input_id: str, is_placeholder: bool)-
Get the noteHeader and noteType input entity properties for the provided input_id.
Args
input_id- The document ID of the input entity.
is_placeholder- If True, this is the placeholder case.
Returns
A dict defining the noteType and noteHeader input entity properties.
def post(self)-
Obtain an upload location, upload the PDF, and post the task to the API.
def update_db_job_inputs(self, docvis_summary: list[dict[str, str]])-
extend current db_job with entities from this task
class PostTaskDocument (name: str = '', sizeBytes: int = 0, dataKey: str = '', md5Sum: str = '', dataType: str = 'blob', mimeType: str = 'PDF', encodingType: str = 'base64/utf-8', model: str = 'base-medrec-anesthesia', confidenceInterval: float = 0.1, isPerformOCR: bool = True)-
dataclass representation of a docuvision response
Class variables
var confidenceInterval : floatvar dataKey : strvar dataType : strvar encodingType : strvar isPerformOCR : boolvar md5Sum : strvar mimeType : strvar model : strvar name : strvar sizeBytes : int
Methods
def post_task_payload(self, service: str) ‑> dict[str, str | dict[str, str | dict[str, typing.Any]]]-
request formatted for submission to docuvision API
class PostTaskLocation (url: str = '', fields: dict[str, str] = <factory>)-
dataclass representation of a 'upload-locations' response
Class variables
var fields : dict[str, str]var url : str