Module pdf_extractor

Usage

from pdf_extractor import extract_multi_patient_pdfs, extract_directory

Function

injests pdf files in a directory and outputs a dict can be injested by section_extractor_factory and a dict of images and/or the following files: [pdf filename].txt Contains OCR text for image based pages + directly extracted text in proper page order [pdf filename]-IXML.json json object containing fields from inline XML if present. [pdf filename]-Tables.json json object containing "tables" created using import_page_sections and import_table_processing modules.

Functions

def extract_filename_only(pdf_library: dict[str, PDFLibProto], **kwargs) ‑> dict[str, FileContentsEntry]

Just pull data from the filename and output an 'anesthesia record' with only a Source File MetaData table.

def extract_multi_patient_pdfs(pdf_library: dict[str, PDFLibProto], split_generator: collections.abc.Callable[[PDFLibProto], collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]] = <function split_by_single_space_header>, junk_line_match_ratio: float = 0.8, output_dir: str | None = None, **kwargs) ‑> dict[str, FileContentsEntry]

extract a consolidated PDF containing data for multiple patients, segregating it according to case and returning a dict consumable by section_extractor.section_extractor_factory.

def extract_pdf_text(doc_idx_pages: dict[DocIndexTuple, str], junk_line_match_ratio, **kwargs) ‑> tuple[tuple[int, ...], tuple[str, ...]]

Process text extracted from PDF into a list of lines. Checks for header and footer lines and removes them from returned lines after extracting any valuable data.

def extract_single_patient_pdfs(pdf_library: dict[str, PDFLibProto], file_match: collections.abc.Callable[[dict[str, PDFLibProto]], collections.abc.Iterator[tuple[list[str], list[str]]]] = <function no_match>, junk_line_match_ratio=0.8, output_dir=None, **kwargs) ‑> dict[str, FileContentsEntry]

extract a precompiled list of PDFs in binary form along with any associated metadata. Used by S3 processes.