Module `section_extractor`

Usage

from section_extractor import section_extractor_factory, SectionExtractor

Function

process text extracted from a pdf patient record and segment into logical sections as they appear in the PDF. Instantiate a TableExtractor for each extracted section to segment into tables and extract contextualized data.

Functions

def section_extractor_factory(files_dict: dict[str, utils.FileContentsEntry], section_spec: sp.SectionSpec, table_spec_getter: Callable[[str], sp.TableSpec], **kwargs) ‑> dict[str, SectExtBase]

Creates a dictionary of SectionExtractor instances for each file in the provided files_dict.

Args

files_dict : dict[str, utils.FileContentsEntry]: A dictionary where keys are file identifiers and values are FileContentsEntry objects containing the contents of the files.
section_spec : sp.SectionSpec: The SectionSpec object defining the sections to be extracted.
table_spec_getter : Callable[[str], sp.TableSpec]: A callable that takes a section name and returns the corresponding TableSpec object.
**kwargs: Additional keyword arguments to pass to the SectionExtractor constructor.

Returns

dict[str, SectExtBase]: A dictionary where keys are file identifiers and values are SectionExtractor instances.

Classes

class SectionExtractor (file_data: str | utils.FileContentsEntry, section_spec: sp.SectionSpec, table_spec_getter: Callable[[str], sp.TableSpec], **kwargs)

Usage

Intended for internal use by section_extractor_factory. sect = SectionExtractor(data, 'Epic') where: data is either: 1) a directory location containing "OCR.txt" files as created by pdf_extractor.extract_directory(), OR 2) a dictionary as outputted by either pdf_extractor.extract_multi_patient_pdfs() OR pdf_extractor.extract_directory. facility_type is a valid key in section_specs.sect_specs out_dir is a valid directory location where json files can be saved (optional).

Function

Using the settings defined in sect_specs[facility_type], segment the text extracted from the PDF into logical sections. Once extracted, instantiate a TableExtractor object to further segment each section into logical tables and then field/value data.

Ancestors

SectExtBase
abc.ABC

Methods

def apply_strippers(self): Apply DocumentStripper classes assigned in section_specs.py
def create_anesthesia_record(self): if any lines remain after section extraction, create an 'Anesthesia Record' section and attribute any unassigned source documents as its source.
def create_table_dictionary(self, sep='.', include_residuals=False): Convert raw self.tables into a format that can be ingested by TableTransformer
def extract_sections(self): process lines from PDF into logical sections based on specs from specs/section_specs.py
def extract_tables(self): Call TableExtractor to identify and extract values from tables for each section
def line_roll_check(self): Find all lines in section that start with specified text. If all lines starting with specified text are equal, check subsequent lines for equivalence. Append text from subsequent lines that also match for all occurrences to the startswith line.
def strip_attributions(self): Remove inline attributions from extracted text to ensure they don't appear in table values later.

Inherited members

SectExtBase:
- new_from_table_dict