Module utilities.pdf_utils
Utility functions for raw pdf, CSV, and other delimited file processing.
Functions
def as_pdf_reader(pdf: PDFType) ‑> pypdf._reader.PdfReader
-
Returns a PdfReader if supplied bytes, io.BytesIO, or a PdfReader.
def concat_bytes(bodies: Sequence[bytes], strip_headers: bool = True) ‑> bytes
-
Concatenate a series of bytes objects, typically from the body property of a csv or text file PDFLibProto object.
Args
bodies
:Sequence[bytes]
- list of bytes objects collected from a sequence of lu.PDFLibProto objects from a csv or other text document
strip_headers
:bool
- Optional. Strip headers from all but the first lu.PDFLibEntry object. Defaults to True.
Returns
bytes
- the concatenated bytes
def csv_column_values(csv_bytes: bytes, column: int | str, val_sep_override: str | None = None) ‑> list[str]
-
Extract a list of values from a column in a delimited file.
This function processes the bytes of a CSV or other delimited file and extracts the values from the specified column. The column can be specified either by its index or by its name. An optional value separator can be provided to override the default separator ",".
Args
csv_bytes
:bytes
- Bytes from a CSV or other delimited file.
column
:int | str
- The column to extract values from. Can be specified as an index or a string representing the column name.
val_sep_override
:str | None
, optional- Override the default value separator for CSV files. Defaults to None.
Returns
list[str]
- The list of values for the specified column.
Example
>>> csv_bytes = b"Name,Age,Location\nJohn,30,USA\nJane,25,UK" >>> csv_column_values(csv_bytes, 1) ['30', '25'] >>> csv_column_values(csv_bytes, "Location") ['USA', 'UK']
def csv_rows(csv_bytes: bytes, val_sep_override: str | None = None, has_header_row: bool = True) ‑> list[dict[str, str]]
-
Split bytes from CSVs and other delimited files into rows.
This function processes the bytes from a CSV or other delimited file and splits them into rows. Each row is represented as a dictionary with column headers as keys and corresponding cell values as values. It handles various delimiters and can optionally override the default value separator ",".
Args
csv_bytes
:bytes
- Bytes from a CSV file.
val_sep_override
:str | None
, optional- Override the default value separator for CSV files. Defaults to None.
has_header_row
:bool
- Optional. Indicates if the source document includes a header row. Defaults to True.
Returns
list[dict[str, str]]
- A list of dictionaries representing the rows of the CSV file.
Example
>>> csv_bytes = b"Name,Age,Location\nJohn,30,USA\nJane,25,UK" >>> csv_rows(csv_bytes) [{'name': 'John', 'age': '30', 'location': 'USA'}, {'name': 'Jane', 'age': '25', 'location': 'UK'}] >>> csv_bytes = b"Name|Age|Location\nJohn|30|USA\nJane|25|UK" >>> csv_rows(csv_bytes, val_sep_override="|") [{'name': 'John', 'age': '30', 'location': 'USA'}, {'name': 'Jane', 'age': '25', 'location': 'UK'}] >>> csv_bytes = b"John,30,USA\nJane,25,UK" >>> csv_rows(csv_bytes, has_header_row=False) [{'Column1': 'John', 'Column2': '30', 'Column3': 'USA'}, {'Column1': 'Jane', 'Column2': '25', 'Column3': 'UK'}]
def csv_value_to_dos(date_source: lu.PDFLibProto, **kwargs) ‑> str
-
Convert data from filebytes into proper date of service
Args
date_source
:lu.PDFLibProto
- lu.PDFLibProto object with a csv body
KwArgs
column
:int | str
- The column to extract values from. Can be specified as an index or a string representing the column name. Defaults to "Date".
day_offset
:int
- number of days to add to the discovered date. Defaults to -1.
val_sep_override
:str | None
, optional- Override the default value separator for csv files. Defaults to None.
Returns
str
- date of service
def designator_from_anesthesia_summary(lines: list[str]) ‑> str
-
Find the anesthesia summary table and create a designator from the patient name, MRN, and, if present, the date of service.
Args
lines
:list[str]
- lines from the first page of a case
Raises
ValueError
- if the anesthesia summary table is not found
Returns
str
- designator for the case
Example
>>> lines = [ ... "Some other line", ... "Anesthesia Summary - John N Doe [123456]", ... " Current as of 2020-12-31 23:59" ... ] >>> designator_from_anesthesia_summary(lines) 'DOE, JOHN N_123456_2020-12-31'
def designator_from_demographics(lines: list[str]) ‑> str
-
Find the demographics table and create a designator from the patient name, MRN, phone number, and DOB.
Args
lines
:list[str]
- Lines from the first page of a case.
Raises
ValueError
- If the demographics table is not found.
Returns
str
- Designator for the case.
Example
>>> lines = [ ... "Some other line", ... "Patient Demographics", ... " Name MRN Legal Sex DOB SSN Phone", ... " Doe, John 123456 Male 1/1/1980 XXX-XX-1234 1(800)213-2131", ... " Michael", ... ] >>> designator_from_demographics(lines) 'DOE, JOHN MICHAEL_1980-01-01_123456_18002132131'
def dos_from_pdf_page(date_source: lu.PDFLibProto, page_indices: Sequence[int] = (0,), date_pattern: re.Pattern = re.compile('Summary\\n\\s+Date: (?P<dos>[\\d/\\-]+)')) ‑> str
-
Extracts the text from one or more PDF pages and searches for the date of service via regular expression.
Args
date_source
- a pdf library entry with a PDF bytes body
page_indices
- the indices of the pages to search
date_pattern
- precompiled regex used to search for the date of service. the expression MUST have a named group with label "dos"
Returns
str
- the extracted dos value OR the last modified date of the source object - 1 day if no value is found.
def find_junk_lines(page_candidates: dict[DocIndexTuple, str], **kwargs)
-
Identifies lines that repeat at the top and bottom of all candidate pages.
NOTE: This does NOT handle use cases where 2 or more pages have header/footer set "A" and 2 or more other pages have header/ footer set "B". In such instances, page sets "A" and "B" must be assigned to different doc_idxs. This occurs implicitly for sites with multiple PDFs per patient, but it's up to the caller when extracting multi-patient PDFs (see
increment_document_idx
).Args
page_candidates
:dict[DocIndexTuple, str]
- Dictionary with (page index, document index) tuples as keys and extracted PDF page text as values.
Kwargs
min_page_match_ratio (float): Optional. Minimum match ratio across all pages to consider a line a header or footer. Defaults to 0.6. max_header_lines (int): Optional. Maximum number of header lines. Defaults to 10. max_footer_lines (int): Optional. Maximum number of footer lines. Defaults to 10. debug (bool): Optional. Enable debug logging. Defaults to False.
Returns
dict[int, dict[int, str]]
- A nested dictionary containing document indexes as top-level keys and header/footer line indexes as second-level keys. The values are the header/footer lines. Footer lines are keyed as negative integers.
Example
>>> page_candidates = { ... (0, 0): "Header1\nAn arbitrary non header line\nPage 1 of 3", ... (1, 0): "Header1\nSome other text that is not a footer\nPage 2 of 3", ... (2, 0): "Header1\nThis one is neither a footer nor a header\nPage 3 of 3" ... } >>> find_junk_lines(page_candidates, max_header_lines=1, max_footer_lines=1) {0: {0: 'Header1', -1: 'Page 2 of 3'}}
-
Collect lists of formatted lines of text for each page of each document.
This function processes a dictionary of pages indexed by document and page numbers. It extracts the header and footer lines from each page, formats them, and returns a dictionary with document indexes as keys and lists of formatted lines for each page as values.
Args
pages_by_doc_idx_tuple
:dict[DocIndexTuple, str]
- A dictionary of the form {(page_index, document_index): [multiline string for one PDF page]}.
max_header_lines
:int
- Optional. Maximum number of header lines. Defaults to 10.
max_footer_lines
:int
- Optional. Maximum number of footer lines. Defaults to 10.
Returns
dict[int, list[list[str]]]
- A dictionary with document indexes as keys. The values are lists, one list per page in the source PDF. Each inner list contains the lines of text on each page, formatted to always have a length of max_header_lines + max_footer_lines by inserting empty string values between the header and footer lines.
Example
>>> pages_by_doc_idx_tuple = { ... (0, 0): "Header1\nAn arbitrary non header line\nPage 1 of 3", ... (1, 0): "Header1\nSome other\ntext but\nnot a footer\nPage 2 of 3", ... (2, 0): "Header1\nPage 3 of 3" ... } >>> result = header_footer_test_lines( ... pages_by_doc_idx_tuple, max_header_lines=1, max_footer_lines=3 ... ) >>> result == { ... 0: [ ... ['Header1', 'Header1', 'An arbitrary non header line', 'Page 1 of 3'], ... ['Header1', 'text but', 'not a footer', 'Page 2 of 3'], ... ['Header1', '', 'Header1', 'Page 3 of 3'] ... ] ... } True
def match_by_pdf_reference_designator(pdf_library: dict[str, lu.PDFLibProto] | list[str], match_func: Callable[[str, bytes, str, bytes], bool]) ‑> collections.abc.Iterator[tuple[list[str], list[str]]]
-
Check each PDFLibEntry instance in pdf_library for a matching PDFLibReference.
Args
pdf_library
:dict[str, lu.PDFLibProto] | list[str]
- dict of form {filename: lu.PDFLibProto} or list of filenames.
match_func
:Callable[[str, bytes, str, bytes], bool]
- function to compare lu.PDFLibEntry and lu.PDFLibReference. Must return True if match is found.
Yields
Iterator[tuple[list[str], list[str]]]
- matched files, empty list in place of dropped files
def match_n_filename_fields(pdf_library: dict[str, lu.PDFLibProto], n: int = 3, re_split_expr: str = '[^ 0-9a-zA-Z]', alt_match_funcs: list[tuple[list[int], Callable[[list[str], list[str]], bool]]] | None = None, **kwargs) ‑> collections.abc.Iterator[tuple[list[str], list[str]]]
-
Match pdf filenames by n fields from the list of elements generated by a regex split with re_split_expr.
Args
pdf_library
:dict[str, lu.PDFLibProto]
- The set of pdfs to be matched and extracted.
n
:int
- Optional. number of fields that must match. Default is 3.
re_split_expr
:str
- Optional. regex for splitting filename.
Default is
r"[^ 0-9a-zA-Z]"
. alt_match_funcs
:list[tuple[list[int], Callable[[list[str], list[str]], bool]]] | None
-
list of tuples of form (field_idxs, match_func) where field_idxs is a list of field indexes to whose values will be passed to match_func and must return True for a successful match. Defaults to None.
KwArgs
debug
:bool
- enable debug logging. Defaults to False.
disable_sort
:bool
- disable sorting of matched files. Defaults to False.
custom_sort_key
:Callable[[str], Any]
- custom sort key function. Defaults to None.
desc
:bool
- Sort descending. Defaults to False.
keep_top
:int
- Keep top n files after sorting. Defaults to None.
Yields
Iterator[tuple[list[str], list[str]]]
- matched files, dropped files
Example
>>> pdf_library = { ... "file_20210101_facesheet.pdf": None, ... "file_20210101.pdf": None, ... "file_20210102.pdf": None ... } >>> list(match_n_filename_fields(pdf_library, n=2)) [(['file_20210101.pdf', 'file_20210101_facesheet.pdf'], []), (['file_20210102.pdf'], [])]
def matched_filename_sort(filename: str, check_str: str, str_xform: Callable[[str], str] = builtins.str, str_comp: Callable[[str, str], Any] = <built-in function contains>, val_true: Any = 1, val_false: Any = 0) ‑> int
-
Custom sort function for use with match_n_filename_fields.
Args
filename
:str
- filename
check_str
:str
- String to test against filename
str_xform
:Callable[[str], str]
- Optional. Transform to apply to strings
prior to comparison, e.g.,
str.lower
. Defaults tostr
. str_comp
:Callable[[str, str], Any]
- Optional. String comparison function
where first arg is filename and second is check_str. Defaults to
contains
. val_true
:Any
- Optional. Value supplied as the primary sort key if
str_comp
returns True. Defaults to 1. val_false
:Any
- Optional. Value supplied as the primary sort key if
str_comp
returns False. Defaults to 0.
Returns
int
- Sort key for list.sort function
def no_match(pdf_library: dict[str, lu.PDFLibProto]) ‑> collections.abc.Iterator[tuple[list[str], list[str]]]
-
Used with pdf_extractor.extract_single_patient_pdfs() when the desired distinct cases match 1-to-1 with the source PDFs.
Simply yeilds the filename of each source PDF in turn.
Args
pdf_library
:dict[str, lu.PDFLibProto]
- dict of lu.PDFLibProto objects
Yields
Iterator[tuple[list[str], list[str]]]
- a tuple of (list of source PDFs, empty list)
def numerics_to_dos(date_source: str | lu.PDFLibProto, **kwargs) ‑> str
-
Convert data from a filename into a proper date of service. returned value will be extracted datetime - 1 day by default.
Args
date_source
:str | lu.PDFLibProto
- If str, process supplied value directly to produce candidate datetime w/ gvars.TODAY as fallback. If lu.PDFLibProto, process lu.PDFLibProto.filename to produce candidate datetime with lu.PDFLibProto.last_modified (casted to datetime) as fallback.
KwArgs
numerics_format
:str
- date format string. Defaults to "%Y%m%d".
day_offset
:int
- Number of days to add to the discovered date. Defaults to -1.
numerics_xform
:Callable[[list[str]], str]
- Given the list of contiguous numerical strings extracted from the filename, return a single string representing the date of service. Defaults to lambda x: "".join(x).
Returns
str
- date of service
Example
>>> date_source = "report_20230101.pdf" >>> numerics_to_dos(date_source) '2022-12-31'
def pdf_text_pages(pdf_binary: PDFType, min_lines_ocr_trigger: int = 6, debug_path: Path | None = None, page_indices: list[int] | None = None, pbar: bool = False) ‑> list[str]
-
Extract text from PDF pages and return as a list of multiline strings.
Uses PDF code-behind by default. If Azure OCR is enabled and the number of lines of text on a page is less than min_lines_ocr_trigger, uses Azure OCR to extract text from images as well.
Args
pdf_binary
:PDFType
- PDF to extract text from
min_lines_ocr_trigger
:int
- Optional. Trigger Azure OCR if enabled and the line count of the text extracted from the codebehind is less than this value. Defaults to 6.
debug_path
:Path | None
, optional- Path to write debug files to. Defaults to None.
def recursive_pdf_split(splits: SplitGroups, source_pdf: lu.PDFLibProto | bytes) ‑> collections.abc.Iterator[tuple[str, bytes]]
-
Split a mutlipatient pdf into 1 pdf per patient (pt).
Reduces file sizes by stripping unrendered images embedded in global resource objects (XObjects). Uses recursive binary search to split the pdf into halves until each half contains only 1 patient. Critical for avoiding extremely long execution times for excessively long PDFs.
Args
splits
:SplitGroups
- dict of form {case identifier: (start_page, end_page)}
source_pdf
:lu.PDFLibProto | bytes
- the pdf to split as PDFLibProto or bytes.
Yields
Iterator[tuple[str, bytes]]
- a tuple of (case identifier, bytes)
def split_by_outline(pdf_entry: lu.PDFLibProto, split_check: Callable[[OutlineItemType], bool] = <function <lambda>>, lines_to_designator: Callable[[list[str]], str] | None = None, increment_document_idx: Callable[[list[str]], bool] = <function <lambda>>) ‑> collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]
-
Split a pdf according to entries in its navigation pane.
Args
pdf_entry
:lu.PDFLibProto
- a library entry representing a PDF file
split_check
:Callable[[OutlineItemType], bool]
- If the specified function returns True when passed an outline item, the page containing that item is considered the first page of a new case.
lines_to_designator
:Callable[[list[str]], str]
- A function that returns a designator for a case when passed the list of lines from the first page of the case.
increment_document_idx
:Callable[[list[str]], bool]
- If the specified function returns True when passed a list of lines from a page, the document index for the current case is incremented. Primarily used to account for use cases where the header and footer structure is inconsistent to ensure proper stripping behavior.
Yields
Iterator[RawPDFDataTuple]
- generates tuples of elements expected by pdf_extractor.extract_multi_patient_pdfs for each case, namely: - a case identifier for use in downstream operations (str) - the text extracted for each page assigned to the case keyed by both page and document index (dict[DocIndexTuple, str])
def split_by_single_space_header(pdf_entry: lu.PDFLibProto, line_num: int = 0, min_match_ratio: float = 0.8, match_to_designator_converter: Callable[[str], str] = <function <lambda>>, **kwargs) ‑> collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]
-
Split a multipatient PDF into individual cases based on a fuzzy match of the first line of each page to the first line of the first page of the current case.
Header lines are reduced to a single space between words prior to matching.
Args
pdf_entry
:lu.PDFLibProto
- a library entry representing a PDF file
line_num
:int
- line number to use for matching
min_match_ratio
:float
- minimum match ratio to consider a match
match_to_designator_converter
:Callable[[str], str]
- function to convert the test string captured from the first page of the case to its identifier for use in downstream operations.
KwArgs
signature
:str
- Optional. regex to use for matching. Defaults to
"^ *(?:\S+ ){2,}.*Page \d{1,4} of \d{1,4}$"
, i.e. a few identifying bits of info plus 'Page X of Y' debug
:bool
- Optional. enable debug logging. Defaults to False.
Yields
Iterator[RawPDFDataTuple]
- generates tuples of elements expected by pdf_extractor.extract_multi_patient_pdfs for each case, namely: - a case identifier for use in downstream operations (str) - the text extracted for each page assigned to the case keyed by both page and document index (dict[DocIndexTuple, str])
def split_by_static_lines_check(pdf_entry: lu.PDFLibProto, split_check: Callable[[list[str]], bool] | None = None, lines_to_designator: Callable[[list[str]], str] | None = None, increment_document_idx: Callable[[list[str]], bool] = <function <lambda>>) ‑> collections.abc.Iterator[tuple[str, dict[DocIndexTuple, str]]]
-
Splits the text extracted from a multipatient PDF into individual cases based on a static "split_check" function.
Args
pdf_entry
:lu.PDFLibProto
- a library entry representing a PDF file
split_check
:Callable[[list[str]], bool]
- If the specified function returns True when passed the lines from a page, this page represents the beginning of a new case. Defaults to a function that checks the first line of each page for the text 'Billing and Compliance Report' (a common Epic PDF case header) if not supplied.
lines_to_designator
:Callable[[list[str]], str]
- A function that returns a designator for a case when passed the list of lines from the first page of the case. Defaults to designator_from_demographics if not supplied.
increment_document_idx
:Callable[[list[str]], bool]
- If the specified function returns True when passed a list of lines from a page, the document index for the current case is incremented. Primarily used to account for use cases where the header and footer structure is inconsistent to ensure proper stripping behavior.
Yields
Iterator[RawPDFDataTuple]
- generates tuples of elements expected by pdf_extractor.extract_multi_patient_pdfs for each case, namely: - a case identifier for use in downstream operations (str) - the text extracted for each page assigned to the case keyed by both page and document index (dict[DocIndexTuple, str])
def strip_filename(filename: str, strip_extension: bool = True) ‑> str
-
Strips the path and (optionally) the extention from a file path.
def strip_junk(page_text: str, header_footer_lines: dict[int, str], min_match_ratio=0.95) ‑> list[str]
-
Remove header and footer lines from a page of text.
This function processes the text from a single page of a PDF and removes any lines that match the provided header or footer lines based on a specified minimum match ratio.
Args
page_text
:str
- Text from a single page of a PDF.
header_footer_lines
:dict[int, str]
- Dictionary of header/footer lines to remove from page_text. Integer keys correspond to line indices.
min_match_ratio
:float
- Optional. Minimum match ratio to consider a line a header or footer. Defaults to 0.95.
Returns
list[str]
- The lines of text from page_text with header and footer lines removed.
Example
>>> page_text = "Header1\nLine1\nFooter1" >>> header_footer_lines = {0: "Header1", -1: "Footer1"} >>> strip_junk(page_text, header_footer_lines) ['Line1']
def write_pdf_from_pages(*, pdf: PDFType, pages: list[int] | list[PageObject], filename: str = '', metadata: dict[str, Any] | None = None) ‑> bytes
-
Clip specific pages from a source pdf into a new pdf document.
Args
pdf
:PDFType
- The source PDF.
pages
:list[int | PageObject]
- The pages to include in the output.
filename
:str
- Optional. An optional filename to write the extracted PDF to disk for inspection. Defaults to "".
metadata
:dict[str, Any] | None
- An optional set of metadata to add to the output pdf. Defaults to None.
Returns
bytes
- the new pdf
def write_pdf_from_readers(pdf_readers: Sequence[PDFType], filename: str = '', metadata: dict[str, Any] | None = None) ‑> bytes
-
Combine multiple PDFs into a single document.
Args
pdf_readers
:Sequence[PDFType]
- The PDFs to combine.
filename
:str | None
, optional- An optional filename to write the combined PDF to disk for inspection. Defaults to "".
metadata
:dict[str, Any] | None
- optional metadata to embed in the output PDF. Defaults to None.
Returns
bytes
- The combined PDF bytes.
Classes
class DocIndexTuple (page_idx: ForwardRef('int'), doc_idx: ForwardRef('int'))
-
Associates both a page index and a document index with a PDF object.
Used as a dict key in references containing the raw text extracted from each page of a PDF. doc_idx is used to segregate header/footer processing when multiple documents are associated with a single patient or multiple header/footer styles are detected for a single PDF.
Attributes
page_idx, int: page index doc_idx, int: document index
Ancestors
- builtins.tuple
Instance variables
var doc_idx : int
-
Alias for field number 1
var page_idx : int
-
Alias for field number 0