Module utilities.pdf_utils

Utility functions for raw pdf, CSV, and other delimited file processing.

Functions

def as_pdf_reader(pdf: PDFType) ‑> pypdf._reader.PdfReader

Returns a PdfReader if supplied bytes, io.BytesIO, or a PdfReader.

def concat_bytes(bodies: Sequence[bytes], strip_headers: bool = True) ‑> bytes

Concatenate a series of bytes objects, typically from the body property of a csv or text file PDFLibProto object.

Args

bodies : Sequence[bytes]
list of bytes objects collected from a sequence of lu.PDFLibProto objects from a csv or other text document
strip_headers : bool
Optional. Strip headers from all but the first lu.PDFLibEntry object. Defaults to True.

Returns

bytes
the concatenated bytes
def csv_column_values(csv_bytes: bytes, column: int | str, val_sep_override: str | None = None) ‑> list[str]

Extract a list of values from a column in a delimited file.

This function processes the bytes of a CSV or other delimited file and extracts the values from the specified column. The column can be specified either by its index or by its name. An optional value separator can be provided to override the default separator ",".

Args

csv_bytes : bytes
Bytes from a CSV or other delimited file.
column : int | str
The column to extract values from. Can be specified as an index or a string representing the column name.
val_sep_override : str | None, optional
Override the default value separator for CSV files. Defaults to None.

Returns

list[str]
The list of values for the specified column.

Example

>>> csv_bytes = b"Name,Age,Location\nJohn,30,USA\nJane,25,UK"
>>> csv_column_values(csv_bytes, 1)
['30', '25']
>>> csv_column_values(csv_bytes, "Location")
['USA', 'UK']
def csv_rows(csv_bytes: bytes, val_sep_override: str | None = None, has_header_row: bool = True) ‑> list[dict[str, str]]

Split bytes from CSVs and other delimited files into rows.

This function processes the bytes from a CSV or other delimited file and splits them into rows. Each row is represented as a dictionary with column headers as keys and corresponding cell values as values. It handles various delimiters and can optionally override the default value separator ",".

Args

csv_bytes : bytes
Bytes from a CSV file.
val_sep_override : str | None, optional
Override the default value separator for CSV files. Defaults to None.
has_header_row : bool
Optional. Indicates if the source document includes a header row. Defaults to True.

Returns

list[dict[str, str]]
A list of dictionaries representing the rows of the CSV file.

Example

>>> csv_bytes = b"Name,Age,Location\nJohn,30,USA\nJane,25,UK"
>>> csv_rows(csv_bytes)
[{'name': 'John', 'age': '30', 'location': 'USA'}, {'name': 'Jane', 'age': '25', 'location': 'UK'}]
>>> csv_bytes = b"Name|Age|Location\nJohn|30|USA\nJane|25|UK"
>>> csv_rows(csv_bytes, val_sep_override="|")
[{'name': 'John', 'age': '30', 'location': 'USA'}, {'name': 'Jane', 'age': '25', 'location': 'UK'}]
>>> csv_bytes = b"John,30,USA\nJane,25,UK"
>>> csv_rows(csv_bytes, has_header_row=False)
[{'Column1': 'John', 'Column2': '30', 'Column3': 'USA'}, {'Column1': 'Jane', 'Column2': '25', 'Column3': 'UK'}]
def csv_value_to_dos(date_source: lu.PDFLibProto, **kwargs) ‑> str

Convert data from filebytes into proper date of service

Args

date_source : lu.PDFLibProto
lu.PDFLibProto object with a csv body

KwArgs

column : int | str
The column to extract values from. Can be specified as an index or a string representing the column name. Defaults to "Date".
day_offset : int
number of days to add to the discovered date. Defaults to -1.
val_sep_override : str | None, optional
Override the default value separator for csv files. Defaults to None.

Returns

str
date of service
def designator_from_anesthesia_summary(lines: list[str]) ‑> str

Find the anesthesia summary table and create a designator from the patient name, MRN, and, if present, the date of service.

Args

lines : list[str]
lines from the first page of a case

Raises

ValueError
if the anesthesia summary table is not found

Returns

str
designator for the case

Example

>>> lines = [
...     "Some other line",
...     "Anesthesia Summary - John N Doe [123456]",
...     " Current as of 2020-12-31 23:59"
... ]
>>> designator_from_anesthesia_summary(lines)
'DOE, JOHN N_123456_2020-12-31'
def designator_from_demographics(lines: list[str]) ‑> str

Find the demographics table and create a designator from the patient name, MRN, phone number, and DOB.

Args

lines : list[str]
Lines from the first page of a case.

Raises

ValueError
If the demographics table is not found.

Returns

str
Designator for the case.

Example

>>> lines = [
...     "Some other line",
...     "Patient Demographics",
...     "  Name           MRN       Legal Sex    DOB         SSN            Phone",
...     "  Doe, John      123456    Male         1/1/1980    XXX-XX-1234    1(800)213-2131",
...     "   Michael",
... ]
>>> designator_from_demographics(lines)
'DOE, JOHN MICHAEL_1980-01-01_123456_18002132131'
def dos_from_pdf_page(date_source: lu.PDFLibProto, page_indices: Sequence[int] = (0,), date_pattern: re.Pattern = re.compile('Summary\\n\\s+Date: (?P<dos>[\\d/\\-]+)')) ‑> str

Extracts the text from one or more PDF pages and searches for the date of service via regular expression.

Args

date_source
a pdf library entry with a PDF bytes body
page_indices
the indices of the pages to search
date_pattern
precompiled regex used to search for the date of service. the expression MUST have a named group with label "dos"

Returns

str
the extracted dos value OR the last modified date of the source object - 1 day if no value is found.
def find_junk_lines(page_candidates: dict[DocIndexTuple, str], **kwargs)

Identifies lines that repeat at the top and bottom of all candidate pages.

NOTE: This does NOT handle use cases where 2 or more pages have header/footer set "A" and 2 or more other pages have header/ footer set "B". In such instances, page sets "A" and "B" must be assigned to different doc_idxs. This occurs implicitly for sites with multiple PDFs per patient, but it's up to the caller when extracting multi-patient PDFs (see increment_document_idx).

Args

page_candidates : dict[DocIndexTuple, str]
Dictionary with (page index, document index) tuples as keys and extracted PDF page text as values.

Kwargs

min_page_match_ratio (float): Optional. Minimum match ratio across all pages to consider a line a header or footer. Defaults to 0.6. max_header_lines (int): Optional. Maximum number of header lines. Defaults to 10. max_footer_lines (int): Optional. Maximum number of footer lines. Defaults to 10. debug (bool): Optional. Enable debug logging. Defaults to False.

Returns

dict[int, dict[int, str]]
A nested dictionary containing document indexes as top-level keys and header/footer line indexes as second-level keys. The values are the header/footer lines. Footer lines are keyed as negative integers.

Example

>>> page_candidates = {
...     (0, 0): "Header1\nAn arbitrary non header line\nPage 1 of 3",
...     (1, 0): "Header1\nSome other text that is not a footer\nPage 2 of 3",
...     (2, 0): "Header1\nThis one is neither a footer nor a header\nPage 3 of 3"
... }
>>> find_junk_lines(page_candidates, max_header_lines=1, max_footer_lines=1)
{0: {0: 'Header1', -1: 'Page 2 of 3'}}

Collect lists of formatted lines of text for each page of each document.

This function processes a dictionary of pages indexed by document and page numbers. It extracts the header and footer lines from each page, formats them, and returns a dictionary with document indexes as keys and lists of formatted lines for each page as values.

Args

pages_by_doc_idx_tuple : dict[DocIndexTuple, str]
A dictionary of the form {(page_index, document_index): [multiline string for one PDF page]}.
max_header_lines : int
Optional. Maximum number of header lines. Defaults to 10.
max_footer_lines : int
Optional. Maximum number of footer lines. Defaults to 10.

Returns

dict[int, list[list[str]]]
A dictionary with document indexes as keys. The values are lists, one list per page in the source PDF. Each inner list contains the lines of text on each page, formatted to always have a length of max_header_lines + max_footer_lines by inserting empty string values between the header and footer lines.

Example

>>> pages_by_doc_idx_tuple = {
...     (0, 0): "Header1\nAn arbitrary non header line\nPage 1 of 3",
...     (1, 0): "Header1\nSome other\ntext but\nnot a footer\nPage 2 of 3",
...     (2, 0): "Header1\nPage 3 of 3"
... }
>>> result = header_footer_test_lines(
...     pages_by_doc_idx_tuple, max_header_lines=1, max_footer_lines=3
... )
>>> result == {
...     0: [
...         ['Header1', 'Header1', 'An arbitrary non header line', 'Page 1 of 3'],
...         ['Header1', 'text but', 'not a footer', 'Page 2 of 3'],
...         ['Header1', '', 'Header1', 'Page 3 of 3']
...     ]
... }
True
def match_by_pdf_reference_designator(pdf_library: dict[str, lu.PDFLibProto] | list[str], match_func: Callable[[str, bytes, str, bytes], bool]) ‑> collections.abc.Iterator[tuple[list[str], list[str]]]

Check each PDFLibEntry instance in pdf_library for a matching PDFLibReference.

Args

pdf_library : dict[str, lu.PDFLibProto] | list[str]
dict of form {filename: lu.PDFLibProto} or list of filenames.
match_func : Callable[[str, bytes, str, bytes], bool]
function to compare lu.PDFLibEntry and lu.PDFLibReference. Must return True if match is found.

Yields

Iterator[tuple[list[str], list[str]]]
matched files, empty list in place of dropped files
def match_n_filename_fields(pdf_library: dict[str, lu.PDFLibProto], n: int = 3, re_split_expr: str = '[^ 0-9a-zA-Z]', alt_match_funcs: list[tuple[list[int], Callable[[list[str], list[str]], bool]]] | None = None, **kwargs) ‑> collections.abc.Iterator[tuple[list[str], list[str]]]

Match pdf filenames by n fields from the list of elements generated by a regex split with re_split_expr.

Args

pdf_library : dict[str, lu.PDFLibProto]
The set of pdfs to be matched and extracted.
n : int
Optional. number of fields that must match. Default is 3.
re_split_expr : str
Optional. regex for splitting filename. Default is r"[^ 0-9a-zA-Z]".
alt_match_funcs : list[tuple[list[int], Callable[[list[str], list[str]], bool]]] | None
    list of tuples of form (field_idxs, match_func) where field_idxs is a list of field             indexes to whose values will be passed to match_func and must return True for a             successful match. Defaults to None.

KwArgs

debug : bool
enable debug logging. Defaults to False.
disable_sort : bool
disable sorting of matched files. Defaults to False.
custom_sort_key : Callable[[str], Any]
custom sort key function. Defaults to None.
desc : bool
Sort descending. Defaults to False.
keep_top : int
Keep top n files after sorting. Defaults to None.

Yields

Iterator[tuple[list[str], list[str]]]
matched files, dropped files

Example

>>> pdf_library = {
...     "file_20210101_facesheet.pdf": None,
...     "file_20210101.pdf": None,
...     "file_20210102.pdf": None
... }
>>> list(match_n_filename_fields(pdf_library, n=2))
[(['file_20210101.pdf', 'file_20210101_facesheet.pdf'], []), (['file_20210102.pdf'], [])]
def matched_filename_sort(filename: str, check_str: str, str_xform: Callable[[str], str] = builtins.str, str_comp: Callable[[str, str], Any] = <built-in function contains>, val_true: Any = 1, val_false: Any = 0) ‑> int

Custom sort function for use with match_n_filename_fields.

Args

filename : str
filename
check_str : str
String to test against filename
str_xform : Callable[[str], str]
Optional. Transform to apply to strings prior to comparison, e.g., str.lower. Defaults to str.
str_comp : Callable[[str, str], Any]
Optional. String comparison function where first arg is filename and second is check_str. Defaults to contains.
val_true : Any
Optional. Value supplied as the primary sort key if str_comp returns True. Defaults to 1.
val_false : Any
Optional. Value supplied as the primary sort key if str_comp returns False. Defaults to 0.

Returns

int
Sort key for list.sort function
def no_match(pdf_library: dict[str, lu.PDFLibProto]) ‑> collections.abc.Iterator[tuple[list[str], list[str]]]

Used with pdf_extractor.extract_single_patient_pdfs() when the desired distinct cases match 1-to-1 with the source PDFs.

Simply yeilds the filename of each source PDF in turn.

Args

pdf_library : dict[str, lu.PDFLibProto]
dict of lu.PDFLibProto objects

Yields

Iterator[tuple[list[str], list[str]]]
a tuple of (list of source PDFs, empty list)
def numerics_to_dos(date_source: str | lu.PDFLibProto, **kwargs) ‑> str

Convert data from a filename into a proper date of service. returned value will be extracted datetime - 1 day by default.

Args

date_source : str | lu.PDFLibProto
If str, process supplied value directly to produce candidate datetime w/ gvars.TODAY as fallback. If lu.PDFLibProto, process lu.PDFLibProto.filename to produce candidate datetime with lu.PDFLibProto.last_modified (casted to datetime) as fallback.

KwArgs

numerics_format : str
date format string. Defaults to "%Y%m%d".
day_offset : int
Number of days to add to the discovered date. Defaults to -1.
numerics_xform : Callable[[list[str]], str]
Given the list of contiguous numerical strings extracted from the filename, return a single string representing the date of service. Defaults to lambda x: "".join(x).

Returns

str
date of service

Example

>>> date_source = "report_20230101.pdf"
>>> numerics_to_dos(date_source)
'2022-12-31'
def pdf_text_pages(pdf_binary: PDFType, min_lines_ocr_trigger: int = 6, debug_path: Path | None = None, page_indices: list[int] | None = None, pbar: bool = False) ‑> list[str]

Extract text from PDF pages and return as a list of multiline strings.

Uses PDF code-behind by default. If Azure OCR is enabled and the number of lines of text on a page is less than min_lines_ocr_trigger, uses Azure OCR to extract text from images as well.

Args

pdf_binary : PDFType
PDF to extract text from
min_lines_ocr_trigger : int
Optional. Trigger Azure OCR if enabled and the line count of the text extracted from the codebehind is less than this value. Defaults to 6.
debug_path : Path | None, optional
Path to write debug files to. Defaults to None.
def recursive_pdf_split(splits: SplitGroups, source_pdf: lu.PDFLibProto | bytes) ‑> collections.abc.Iterator[tuple[str, bytes]]

Split a mutlipatient pdf into 1 pdf per patient (pt).

Reduces file sizes by stripping unrendered images embedded in global resource objects (XObjects). Uses recursive binary search to split the pdf into halves until each half contains only 1 patient. Critical for avoiding extremely long execution times for excessively long PDFs.

Args

splits : SplitGroups
dict of form {case identifier: (start_page, end_page)}
source_pdf : lu.PDFLibProto | bytes
the pdf to split as PDFLibProto or bytes.

Yields

Iterator[tuple[str, bytes]]
a tuple of (case identifier, bytes)
def split_by_outline(pdf_entry: lu.PDFLibProto, split_check: Callable[[OutlineItemType], bool] = <function <lambda>>, lines_to_designator: Callable[[list[str]], str] | None = None, increment_document_idx: Callable[[list[str]], bool] = <function <lambda>>) ‑> collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]

Split a pdf according to entries in its navigation pane.

Args

pdf_entry : lu.PDFLibProto
a library entry representing a PDF file
split_check : Callable[[OutlineItemType], bool]
If the specified function returns True when passed an outline item, the page containing that item is considered the first page of a new case.
lines_to_designator : Callable[[list[str]], str]
A function that returns a designator for a case when passed the list of lines from the first page of the case.
increment_document_idx : Callable[[list[str]], bool]
If the specified function returns True when passed a list of lines from a page, the document index for the current case is incremented. Primarily used to account for use cases where the header and footer structure is inconsistent to ensure proper stripping behavior.

Yields

Iterator[RawPDFDataTuple]
generates tuples of elements expected by pdf_extractor.extract_multi_patient_pdfs for each case, namely: - a case identifier for use in downstream operations (str) - the text extracted for each page assigned to the case keyed by both page and document index (dict[DocIndexTuple, str])
def split_by_single_space_header(pdf_entry: lu.PDFLibProto, line_num: int = 0, min_match_ratio: float = 0.8, match_to_designator_converter: Callable[[str], str] = <function <lambda>>, **kwargs) ‑> collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]

Split a multipatient PDF into individual cases based on a fuzzy match of the first line of each page to the first line of the first page of the current case.

Header lines are reduced to a single space between words prior to matching.

Args

pdf_entry : lu.PDFLibProto
a library entry representing a PDF file
line_num : int
line number to use for matching
min_match_ratio : float
minimum match ratio to consider a match
match_to_designator_converter : Callable[[str], str]
function to convert the test string captured from the first page of the case to its identifier for use in downstream operations.

KwArgs

signature : str
Optional. regex to use for matching. Defaults to "^ *(?:\S+ ){2,}.*Page \d{1,4} of \d{1,4}$", i.e. a few identifying bits of info plus 'Page X of Y'
debug : bool
Optional. enable debug logging. Defaults to False.

Yields

Iterator[RawPDFDataTuple]
generates tuples of elements expected by pdf_extractor.extract_multi_patient_pdfs for each case, namely: - a case identifier for use in downstream operations (str) - the text extracted for each page assigned to the case keyed by both page and document index (dict[DocIndexTuple, str])
def split_by_static_lines_check(pdf_entry: lu.PDFLibProto, split_check: Callable[[list[str]], bool] | None = None, lines_to_designator: Callable[[list[str]], str] | None = None, increment_document_idx: Callable[[list[str]], bool] = <function <lambda>>) ‑> collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]

Splits the text extracted from a multipatient PDF into individual cases based on a static "split_check" function.

Args

pdf_entry : lu.PDFLibProto
a library entry representing a PDF file
split_check : Callable[[list[str]], bool]
If the specified function returns True when passed the lines from a page, this page represents the beginning of a new case. Defaults to a function that checks the first line of each page for the text 'Billing and Compliance Report' (a common Epic PDF case header) if not supplied.
lines_to_designator : Callable[[list[str]], str]
A function that returns a designator for a case when passed the list of lines from the first page of the case. Defaults to designator_from_demographics if not supplied.
increment_document_idx : Callable[[list[str]], bool]
If the specified function returns True when passed a list of lines from a page, the document index for the current case is incremented. Primarily used to account for use cases where the header and footer structure is inconsistent to ensure proper stripping behavior.

Yields

Iterator[RawPDFDataTuple]
generates tuples of elements expected by pdf_extractor.extract_multi_patient_pdfs for each case, namely: - a case identifier for use in downstream operations (str) - the text extracted for each page assigned to the case keyed by both page and document index (dict[DocIndexTuple, str])
def strip_filename(filename: str, strip_extension: bool = True) ‑> str

Strips the path and (optionally) the extention from a file path.

def strip_junk(page_text: str, header_footer_lines: dict[int, str], min_match_ratio=0.95) ‑> list[str]

Remove header and footer lines from a page of text.

This function processes the text from a single page of a PDF and removes any lines that match the provided header or footer lines based on a specified minimum match ratio.

Args

page_text : str
Text from a single page of a PDF.
header_footer_lines : dict[int, str]
Dictionary of header/footer lines to remove from page_text. Integer keys correspond to line indices.
min_match_ratio : float
Optional. Minimum match ratio to consider a line a header or footer. Defaults to 0.95.

Returns

list[str]
The lines of text from page_text with header and footer lines removed.

Example

>>> page_text = "Header1\nLine1\nFooter1"
>>> header_footer_lines = {0: "Header1", -1: "Footer1"}
>>> strip_junk(page_text, header_footer_lines)
['Line1']
def write_pdf_from_pages(*, pdf: PDFType, pages: list[int] | list[PageObject], filename: str = '', metadata: dict[str, Any] | None = None) ‑> bytes

Clip specific pages from a source pdf into a new pdf document.

Args

pdf : PDFType
The source PDF.
pages : list[int | PageObject]
The pages to include in the output.
filename : str
Optional. An optional filename to write the extracted PDF to disk for inspection. Defaults to "".
metadata : dict[str, Any] | None
An optional set of metadata to add to the output pdf. Defaults to None.

Returns

bytes
the new pdf
def write_pdf_from_readers(pdf_readers: Sequence[PDFType], filename: str = '', metadata: dict[str, Any] | None = None) ‑> bytes

Combine multiple PDFs into a single document.

Args

pdf_readers : Sequence[PDFType]
The PDFs to combine.
filename : str | None, optional
An optional filename to write the combined PDF to disk for inspection. Defaults to "".
metadata : dict[str, Any] | None
optional metadata to embed in the output PDF. Defaults to None.

Returns

bytes
The combined PDF bytes.

Classes

class DocIndexTuple (page_idx: ForwardRef('int'), doc_idx: ForwardRef('int'))

Associates both a page index and a document index with a PDF object.

Used as a dict key in references containing the raw text extracted from each page of a PDF. doc_idx is used to segregate header/footer processing when multiple documents are associated with a single patient or multiple header/footer styles are detected for a single PDF.

Attributes

page_idx, int: page index doc_idx, int: document index

Ancestors

  • builtins.tuple

Instance variables

var doc_idx : int

Alias for field number 1

var page_idx : int

Alias for field number 0