Module integrators.docuvision_task
represents a single request to the docuvision api
Classes
class DocuVisionCase (pid: str, parent: DocuVisionTask, page_types: dict[int, str] = <factory>)
-
basic representation of a single claimmaker case produced from a DocuVisionTask instance
Class variables
var page_types : dict[int, str]
var parent : DocuVisionTask
var pid : str
Instance variables
prop case_doc_id : str
-
The filename for the object tied to the 'View PDF' button.
prop case_pdfs : dict[str, bytes]
-
dict of (filename, bytes) for pdfs in input entities
prop case_pdfs_text : dict[str, str]
-
dict of (filename, bytes) for pdfs in input entities
prop combined_pdf : bytes
-
PDF with all pages for case; tied to UI 'View PDF' button
prop is_placeholder : bool
-
convenience function for separate treatment of placeholder case in comprehensions, etc.
prop job_input : dict[str, Any]
-
A properly formated input column value for the case.
prop names : list[utils.vStr]
-
list of all identified PatientNames in this case's response
prop response : list[dict[str, Any]]
-
list of entries from parent.response.result.RESULT occurring on pages assigned to the pid of this case
prop scheduled_cases : dict[str, list[dict[str, str]]]
-
list of all identified PatientNames in LAST, FIRST and FIRST LAST format
prop type_pages : dict[str, list[int]]
-
reversal of page_types, returns list of pg #'s for each doc type
class DocuVisionTask (pdf_entry: lu.PDFLibProto, api_url: str, api_key: str, split_by_pid: bool = False, service: str = 'docuvision-1', model: str = 'base-medrec-anesthesia', min_confidence: float = 0.1, task_id: int = 0, location: PostTaskLocation = <factory>, document: PostTaskDocument = <factory>, response: du.GetTaskResponse = <factory>, pdf_text_pages: list[str] = <factory>, children: list[DocuVisionCase] = <factory>, fail_on_error: bool = False, mock: bool = False, default_dos: str = '2024-08-27', required_page_types: set[str] | None = None)
-
Create and process a single docuvision request / response.
Attributes
pdf_entry
- The PDFLibEntry object for the pdf we're processing.
api_url
- The URL of the DocuVision API.
api_key
- The API key for the DocuVision API.
split_by_pid
- If True, split the PDF into individual child cases.
service
- The DocuVision service to use.
model
- The DocuVision model to use.
min_confidence
- The minimum confidence level for the DocuVision response.
task_id
- The DocuVision task ID.
location
- The upload location for the PDF.
document
- The document object for the DocuVision API request.
response
- The DocuVision API response.
pdf_text_pages
- The text of each page of the original PDF.
children
- The child cases.
fail_on_error
- If True, raise an error if the DocuVision API returns an error.
mock
- If True, use the mock API.
created_db_job
- If True, a placeholder job was created in the database.
default_dos
- the DOS to use for the placeholder if no DateEncounter labels are returned.
required_page_types
- if supplied, cases will be dropped unless they have at least one page whose type appears in this set.
Class variables
var api_key : str
var api_url : str
var children : list[DocuVisionCase]
var created_db_job : bool
var default_dos : str
var document : PostTaskDocument
var fail_on_error : bool
var location : PostTaskLocation
var min_confidence : float
var mock : bool
var model : str
var pdf_entry : utilities.library_utils.PDFLibProto
var pdf_text_pages : list[str]
var required_page_types : set[str] | None
var response : GetTaskResponse
var service : str
var split_by_pid : bool
var task_id : int
Instance variables
prop child_job_entries : dict[str, dict[str, Any]]
-
get a list of updated job representations to merge with other extracted data prior to pushing to DB/disk
prop db_job : dict[str, Any]
-
Find or create a placeholder case for this task in the database.
prop doc_id : str
-
bytes representing the original source PDF
prop file_id
-
The filename from the original document ID
prop fname_template
-
a template string for producing child pdf doc_ids. supply note_type and pid when calling format.
prop job_entries
-
aggregated placeholder and child job entries
prop note_type
-
The noteType property value assigned to the "Original" input entity and the Complete Record case_pdf of all child cases.
If the original filename followed the UI file naming convention (
_ .pdf), use the noteType indicated in the filename. Otherwise, assign "Scanned". prop original_doc_id_query : str
-
Return a jmespath query for searching an input entities array for references to the s3 key where the source pdf was originally uploaded.
prop page_map : dict[str, dict[str, list[int]]]
-
The collection of page map entries for all child cases.
Used when splitting by PID to segment the raw PDF and response into individual child cases.
prop pdf : bytes
-
bytes representing the original source PDF
prop pdf_library : dict[str, lu.PDFLibEntry]
-
A standard S3Batch pdf_library containing the PDFs of all child cases
prop pdf_reader : pu.PdfReader
-
bytes representing the original source PDF
prop pdfs_out : dict[str, bytes]
-
A dict of {document_id: bytes} for all child cases. Used to upload PDFs to S3.
prop placeholder : DocuVisionCase
-
The child to which the "Original" and "Summary" input entities are posted.
prop placeholder_scan_idx : int
-
next scan index to use when populating noteHeader
prop prefix
-
The S3 prefix of the original document ID
prop splitting : bool
-
True if splitting by PID or page map
Methods
def create_children(self, page_map: dict[str, dict[str, list[int]]] | None = None)
-
build pagemap for document splitting based on cacluated 'pid' in API response
def get(self) ‑> bool
-
Poll the API for the response.
If this is a mock task, find the original PDF object and download it from S3.
If the response indicates success: - Collect the text from each page of the raw PDF into self.pdf_text_pages - Verify PIDs cascaded to pid_found==False pages via extracted text. - Convert the response to a GetTaskResponse dataclass for downstream processing.
def hdr_for(self, input_id: str, is_placeholder: bool)
-
Get the noteHeader and noteType input entity properties for the provided input_id.
Args
input_id
- The document ID of the input entity.
is_placeholder
- If True, this is the placeholder case.
Returns
A dict defining the noteType and noteHeader input entity properties.
def post(self)
-
Obtain an upload location, upload the PDF, and post the task to the API.
def update_db_job_inputs(self, docvis_summary: list[dict[str, str]])
-
extend current db_job with entities from this task
class PostTaskDocument (name: str = '', sizeBytes: int = 0, dataKey: str = '', md5Sum: str = '', dataType: str = 'blob', mimeType: str = 'PDF', encodingType: str = 'base64/utf-8', model: str = 'base-medrec-anesthesia', confidenceInterval: float = 0.1, isPerformOCR: bool = True)
-
dataclass representation of a docuvision response
Class variables
var confidenceInterval : float
var dataKey : str
var dataType : str
var encodingType : str
var isPerformOCR : bool
var md5Sum : str
var mimeType : str
var model : str
var name : str
var sizeBytes : int
Methods
def post_task_payload(self, service: str) ‑> dict[str, str | dict[str, str | dict[str, typing.Any]]]
-
request formatted for submission to docuvision API
class PostTaskLocation (url: str = '', fields: dict[str, str] = <factory>)
-
dataclass representation of a 'upload-locations' response
Class variables
var fields : dict[str, str]
var url : str