Module `matchops.matcher`

module containing generic resources for constructing dataframes used in data operations.

Classes

class DataFrameMatcher (left_source: Any = b'', left_constructor: DataFrameConstructor = <factory>, left_static_cols: collections.abc.Mapping[str, str] = <factory>, right_source: Any = b'', right_constructor: DataFrameConstructor = <factory>, right_static_cols: collections.abc.Mapping[str, str] = <factory>, joins: list[JoinSpec] = <factory>, dq_not_match_cols: list[tuple[str, str]] = <factory>, dq_on_match_cols: list[tuple[str, str]] = <factory>, output_index: list[str] = <factory>, output_mapping: collections.abc.Mapping[str, str] = <factory>, sort_ascending: bool = True, keep_unmatched: bool = True, log_unmatched: bool = False, log_duplicates: bool = False, matcher_type: str = 'dataframe_matcher', output_dir: str = '')

Settings and functions required to process a facility schedule and match it to data found in a demographics export and the data extracted from the PDFs

Args

left_source : io.BytesIO, bytes, or str: source data for left (primary) dataframe. All types are converted to io.BytesIO and passed to left_construtor to be loaded by pandas
left_constructor : DataFrameConstrutor: DataFrameConstructor instance defining pandas loading routine and data transformations for the left_source.
left_static_cols : Mapping[str, str]: a mapping representing a name and static value to be added as a column to left (primary) dataframe
right_source : io.BytesIO, bytes, or str: source data for left (matching) dataframe. All types are converted to io.BytesIO and passed to right_constructor to be loaded by pandas
right_constructor : DataFrameConstrutor: DataFrameConstructor instance defining pandas loading routine and data transformations for the right_source.
right_static_cols : Mapping[str, str]: a mapping representing a name and static value to be added as a column to right (secondary) df
joins : list[JoinSpec]: list of successive join operations to call in sequence until all rows in left have been matched to a row in right.
dq_not_match_cols : list[tuple[str, str]]: if columns are populated for both left and right, the values must also match for a join to occur.
dq_on_match_cols : list[tuple[str, str]]: reverse of above, i.e. drop the join if either column is NOT populated or values match.
output_index : list[str]: column or columns used to construct the index of the output dataframe
output_mapping : Mapping[str, list[str]]: default={}, optional mapping applied when converting the dataframe output to a dictionary. (See to_dict() below). Entries should be of form: Key: column name from output dataframe Value: list of keys in the final dict in which the column value will be stored.
sort_ascending : bool: default=True; sort output by index in ascending order. sort will be descending if False is passed.
keep_unmatched : bool: default=True; if true, merge the remaining records in left dataframe with the output dataframe and populate nan values with "". If false, do not include unmatched records in output.
log_unmatched : bool: default=False; if true, save a dictionary of all unmatched records to the log. Otherwise, only print the index for all unmatched records.
log_duplicates : bool: default=False; if true, save a dictionary of duplicate records to the log. Otherwise, print a list of duplicate indices.

Attributes

left : pandas.DataFrame: primary dataframe. when a row in left is matched with a row in right the result is stored in output and the original row is removed from left. this process continues until all rows from left have been matched or all defined joins have been performed.
right : pandas.DataFrame: matching dataframe. source data that is being matched with entries in left.
duplicates : pandas.DataFrame: dataframe containing row data for duplicated indices in output.
output : pandas.DataFrame: joined output dataframe

Class variables

var dq_not_match_cols : list[tuple[str, str]]
var dq_on_match_cols : list[tuple[str, str]]
var joins : list[JoinSpec]
var keep_unmatched : bool
var left_constructor : DataFrameConstructor
var left_source : Any
var left_static_cols : collections.abc.Mapping[str, str]
var log_duplicates : bool
var log_unmatched : bool
var matcher_type : str
var output_dir : str
var output_index : list[str]
var output_mapping : collections.abc.Mapping[str, str]
var right_constructor : DataFrameConstructor
var right_source : Any
var right_static_cols : collections.abc.Mapping[str, str]
var sort_ascending : bool

Static methods

def as_bytesio(source: _io.BytesIO | bytes | str) ‑> _io.BytesIO: convert all various source formats to io.BytesIO

Methods

def match_on(self, join_spec: JoinSpec) ‑> pandas.core.frame.DataFrame

set indices and perform join on left and right dataframes returning a dataframe of successfully matched records, and removing all successfully matched records from the left frame.

Args

join_spec : JoinSpec: dataclass object with columns and join type

Returns

pd.DataFrame: joined dataframe

def to_dict(self, valid_keys: collections.abc.Sequence[str] | None = None) ‑> dict[str, dict[str, utilities.v_str.vStr]]

convert output dataframe to dictionary.

def to_json(self, valid_keys: str | None = None) ‑> str

convert output dataframe to json saving a copy to path if supplied.

class JoinSpec (left_columns: list[str], right_columns: list[str] = <factory>, join_type: Literal['left', 'right', 'inner', 'outer'] = 'inner', right_suffix: str = '_RIGHT', joins_formatters: list[collections.abc.Callable[[str], str]] = <factory>)

Defines a list of columns from a left dataframe and a list of columns from a right dataframe that are to be used to join the two.

Args

left_columns: list of participating column names in the left dataframe
right_columns : optional: list of participating column names in the right dataframe. If not supplied, defaults to left_columns. If supplied, left_columns and right_columns MUST be of equal length.
join_type: "left", "right", "inner", or "outer". Default is "inner".
right_suffix: text appended to right column name in join result when the original column name exists in both left and right.
joins_formatters: list of callables for reformatting index values prior to performing join operation.

Raises

ValueError: if length of left_columns != length of supplied right_columns OR length of left columns != length of supplied joins_formatters.

Class variables

var join_type : Literal['left', 'right', 'inner', 'outer']
var joins_formatters : list[collections.abc.Callable[[str], str]]
var left_columns : list[str]
var right_columns : list[str]
var right_suffix : str