Module matchops.matcher

module containing generic resources for constructing dataframes used in data operations.

Classes

class DataFrameMatcher (left_source: Any = b'', left_constructor: DataFrameConstructor = <factory>, left_static_cols: collections.abc.Mapping[str, str] = <factory>, right_source: Any = b'', right_constructor: DataFrameConstructor = <factory>, right_static_cols: collections.abc.Mapping[str, str] = <factory>, joins: list[JoinSpec] = <factory>, dq_not_match_cols: list[tuple[str, str]] = <factory>, dq_on_match_cols: list[tuple[str, str]] = <factory>, output_index: list[str] = <factory>, output_mapping: collections.abc.Mapping[str, str] = <factory>, sort_ascending: bool = True, keep_unmatched: bool = True, log_unmatched: bool = False, log_duplicates: bool = False, matcher_type: str = 'dataframe_matcher', output_dir: str = '')

Settings and functions required to process a facility schedule and match it to data found in a demographics export and the data extracted from the PDFs

Args

left_source : io.BytesIO, bytes, or str
source data for left (primary) dataframe. All types are converted to io.BytesIO and passed to left_construtor to be loaded by pandas
left_constructor : DataFrameConstrutor
DataFrameConstructor instance defining pandas loading routine and data transformations for the left_source.
left_static_cols : Mapping[str, str]
a mapping representing a name and static value to be added as a column to left (primary) dataframe
right_source : io.BytesIO, bytes, or str
source data for left (matching) dataframe. All types are converted to io.BytesIO and passed to right_constructor to be loaded by pandas
right_constructor : DataFrameConstrutor
DataFrameConstructor instance defining pandas loading routine and data transformations for the right_source.
right_static_cols : Mapping[str, str]
a mapping representing a name and static value to be added as a column to right (secondary) df
joins : list[JoinSpec]
list of successive join operations to call in sequence until all rows in left have been matched to a row in right.
dq_not_match_cols : list[tuple[str, str]]
if columns are populated for both left and right, the values must also match for a join to occur.
dq_on_match_cols : list[tuple[str, str]]
reverse of above, i.e. drop the join if either column is NOT populated or values match.
output_index : list[str]
column or columns used to construct the index of the output dataframe
output_mapping : Mapping[str, list[str]]
default={}, optional mapping applied when converting the dataframe output to a dictionary. (See to_dict() below). Entries should be of form: Key: column name from output dataframe Value: list of keys in the final dict in which the column value will be stored.
sort_ascending : bool
default=True; sort output by index in ascending order. sort will be descending if False is passed.
keep_unmatched : bool
default=True; if true, merge the remaining records in left dataframe with the output dataframe and populate nan values with "". If false, do not include unmatched records in output.
log_unmatched : bool
default=False; if true, save a dictionary of all unmatched records to the log. Otherwise, only print the index for all unmatched records.
log_duplicates : bool
default=False; if true, save a dictionary of duplicate records to the log. Otherwise, print a list of duplicate indices.

Attributes

left : pandas.DataFrame
primary dataframe. when a row in left is matched with a row in right the result is stored in output and the original row is removed from left. this process continues until all rows from left have been matched or all defined joins have been performed.
right : pandas.DataFrame
matching dataframe. source data that is being matched with entries in left.
duplicates : pandas.DataFrame
dataframe containing row data for duplicated indices in output.
output : pandas.DataFrame
joined output dataframe

Class variables

var dq_not_match_cols : list[tuple[str, str]]
var dq_on_match_cols : list[tuple[str, str]]
var joins : list[JoinSpec]
var keep_unmatched : bool
var left_constructorDataFrameConstructor
var left_source : Any
var left_static_cols : collections.abc.Mapping[str, str]
var log_duplicates : bool
var log_unmatched : bool
var matcher_type : str
var output_dir : str
var output_index : list[str]
var output_mapping : collections.abc.Mapping[str, str]
var right_constructorDataFrameConstructor
var right_source : Any
var right_static_cols : collections.abc.Mapping[str, str]
var sort_ascending : bool

Static methods

def as_bytesio(source: _io.BytesIO | bytes | str) ‑> _io.BytesIO

convert all various source formats to io.BytesIO

Methods

def match_on(self, join_spec: JoinSpec) ‑> pandas.core.frame.DataFrame

set indices and perform join on left and right dataframes returning a dataframe of successfully matched records, and removing all successfully matched records from the left frame.

Args

join_spec : JoinSpec
dataclass object with columns and join type

Returns

pd.DataFrame
joined dataframe
def to_dict(self, valid_keys: collections.abc.Sequence[str] | None = None) ‑> dict[str, dict[str, utilities.v_str.vStr]]

convert output dataframe to dictionary.

def to_json(self, valid_keys: str | None = None) ‑> str

convert output dataframe to json saving a copy to path if supplied.

class JoinSpec (left_columns: list[str], right_columns: list[str] = <factory>, join_type: Literal['left', 'right', 'inner', 'outer'] = 'inner', right_suffix: str = '_RIGHT', joins_formatters: list[collections.abc.Callable[[str], str]] = <factory>)

Defines a list of columns from a left dataframe and a list of columns from a right dataframe that are to be used to join the two.

Args

left_columns
list of participating column names in the left dataframe
right_columns : optional
list of participating column names in the right dataframe. If not supplied, defaults to left_columns. If supplied, left_columns and right_columns MUST be of equal length.
join_type
"left", "right", "inner", or "outer". Default is "inner".
right_suffix
text appended to right column name in join result when the original column name exists in both left and right.
joins_formatters
list of callables for reformatting index values prior to performing join operation.

Raises

ValueError
if length of left_columns != length of supplied right_columns OR length of left columns != length of supplied joins_formatters.

Class variables

var join_type : Literal['left', 'right', 'inner', 'outer']
var joins_formatters : list[collections.abc.Callable[[str], str]]
var left_columns : list[str]
var right_columns : list[str]
var right_suffix : str