Module table_extractor

Usage

Primary: Internal to section_extractor.py. Optional: from table_extractor import TableExtractor

Function

Break the pdf "sections" extracted by a SectionExtractor instance into individual tables and extract conextualized data in the form of a list of ordered dictionaries containing field/value pairs.

Classes

class TableExtractor (lines: str | list[str], section_name: str, source: str, table_spec: TableSpec, **kwargs)

Usage

Intended for internal use by section_extractor.py tab_extractor = TableExtractor( lines=[list of lines for a section extracted from a pdf], section_name=[section heading as it appeared in the pdf], facility_type=[top level key for a table_spec dictionary], source=[record identifier] )

Function

Subdivide a pdf "section" into multiple individual "tables".

Operations

1) gather raw text lines that correspond to a table. 2) pass lines to "interpreter" func defined in self.spec. 3) perform post processing on "raw table" returned by "interpreter" func. See table_utils.clean_up_raw_table() for details on post processing. Output attributes: table_content: dictionary of form { "table 1 name": list of raw text lines, "table 2 name": list of raw text lines,… } raw: dictionary of form { "table 1 name": raw interpreter output, "table 2 name": raw interpreter output,… } clean: dictionary of form { "table 1 name": field/value data for table, "table 2 name": field/value data for table,… } residual: list of lines from the section that were NOT incorporated into any of the tables.

Methods

def clean_raw_tables(self)

Call self.clean_up_raw_table for all tables in self.raw. store results in self.clean for converstion to table_dictionary.

def clean_up_raw_table(self, table_name: str, rows: list[dict[str, str]]) ‑> list[dict[str, str]]

Performs table post processing as defined by self.spec.

Includes

1) Removing rows where key=value for every key to account for repeated header lines after page breaks in the original text. 2) Find and split rows where additional column fields are included as rows during initial extract. 3) Splitting columns/fields as specified by specs["split_table_columns"] 4) Rollup and cascade operations defined in self.spec["rollup_cascade_reference"]

Args

table_name : str
name of the table currently processing
rows : list[dict[str, str]]
raw table data extracted by interpreter

Returns

cleaned list of dict of form: [ { "field1": "value1", "field2": "value2", ... }, ... ]

def extract_raw_tables(self)

Identify lines that correspond to tables within a section and pass extracted lines for each table to the interpreter designated for the corresponding section in table_specs.py. Save output to self.raw and all lines not associated with an extracted table to self.residual.