Module matchops.matcher
module containing generic resources for constructing dataframes used in data operations.
Classes
- class DataFrameMatcher (left_source: Any = b'', left_constructor: DataFrameConstructor = <factory>, left_static_cols: collections.abc.Mapping[str, str] = <factory>, right_source: Any = b'', right_constructor: DataFrameConstructor = <factory>, right_static_cols: collections.abc.Mapping[str, str] = <factory>, joins: list[JoinSpec] = <factory>, dq_not_match_cols: list[tuple[str, str]] = <factory>, dq_on_match_cols: list[tuple[str, str]] = <factory>, output_index: list[str] = <factory>, output_mapping: collections.abc.Mapping[str, str] = <factory>, sort_ascending: bool = True, keep_unmatched: bool = True, log_unmatched: bool = False, log_duplicates: bool = False, matcher_type: str = 'dataframe_matcher', output_dir: str = '')
- 
Settings and functions required to process a facility schedule and match it to data found in a demographics export and the data extracted from the PDFs Args- left_source:- io.BytesIO, bytes,or- str
- source data for left (primary) dataframe. All types are converted to io.BytesIO and passed to left_construtor to be loaded by pandas
- left_constructor:- DataFrameConstrutor
- DataFrameConstructor instance defining pandas loading routine and data transformations for the left_source.
- left_static_cols:- Mapping[str, str]
- a mapping representing a name and static value to be added as a column to left (primary) dataframe
- right_source:- io.BytesIO, bytes,or- str
- source data for left (matching) dataframe. All types are converted to io.BytesIO and passed to right_constructor to be loaded by pandas
- right_constructor:- DataFrameConstrutor
- DataFrameConstructor instance defining pandas loading routine and data transformations for the right_source.
- right_static_cols:- Mapping[str, str]
- a mapping representing a name and static value to be added as a column to right (secondary) df
- joins:- list[JoinSpec]
- list of successive join operations to call in sequence until all rows in left have been matched to a row in right.
- dq_not_match_cols:- list[tuple[str, str]]
- if columns are populated for both left and right, the values must also match for a join to occur.
- dq_on_match_cols:- list[tuple[str, str]]
- reverse of above, i.e. drop the join if either column is NOT populated or values match.
- output_index:- list[str]
- column or columns used to construct the index of the output dataframe
- output_mapping:- Mapping[str, list[str]]
- default={}, optional mapping applied when converting the dataframe output to a dictionary. (See to_dict() below). Entries should be of form: Key: column name from output dataframe Value: list of keys in the final dict in which the column value will be stored.
- sort_ascending:- bool
- default=True; sort output by index in ascending order. sort will be descending if False is passed.
- keep_unmatched:- bool
- default=True; if true, merge the remaining records in left dataframe with the output dataframe and populate nan values with "". If false, do not include unmatched records in output.
- log_unmatched:- bool
- default=False; if true, save a dictionary of all unmatched records to the log. Otherwise, only print the index for all unmatched records.
- log_duplicates:- bool
- default=False; if true, save a dictionary of duplicate records to the log. Otherwise, print a list of duplicate indices.
 Attributes- left:- pandas.DataFrame
- primary dataframe. when a row in left is matched with a row in right the result is stored in output and the original row is removed from left. this process continues until all rows from left have been matched or all defined joins have been performed.
- right:- pandas.DataFrame
- matching dataframe. source data that is being matched with entries in left.
- duplicates:- pandas.DataFrame
- dataframe containing row data for duplicated indices in output.
- output:- pandas.DataFrame
- joined output dataframe
 Class variables- var dq_not_match_cols : list[tuple[str, str]]
- var dq_on_match_cols : list[tuple[str, str]]
- var joins : list[JoinSpec]
- var keep_unmatched : bool
- var left_constructor : DataFrameConstructor
- var left_source : Any
- var left_static_cols : collections.abc.Mapping[str, str]
- var log_duplicates : bool
- var log_unmatched : bool
- var matcher_type : str
- var output_dir : str
- var output_index : list[str]
- var output_mapping : collections.abc.Mapping[str, str]
- var right_constructor : DataFrameConstructor
- var right_source : Any
- var right_static_cols : collections.abc.Mapping[str, str]
- var sort_ascending : bool
 Static methods- def as_bytesio(source: _io.BytesIO | bytes | str) ‑> _io.BytesIO
- 
convert all various source formats to io.BytesIO 
 Methods- def match_on(self, join_spec: JoinSpec) ‑> pandas.core.frame.DataFrame
- 
set indices and perform join on left and right dataframes returning a dataframe of successfully matched records, and removing all successfully matched records from the left frame. Args- join_spec:- JoinSpec
- dataclass object with columns and join type
 Returns- pd.DataFrame
- joined dataframe
 
- def to_dict(self, valid_keys: collections.abc.Sequence[str] | None = None) ‑> dict[str, dict[str, utilities.v_str.vStr]]
- 
convert output dataframe to dictionary. 
- def to_json(self, valid_keys: str | None = None) ‑> str
- 
convert output dataframe to json saving a copy to path if supplied. 
 
- class JoinSpec (left_columns: list[str], right_columns: list[str] = <factory>, join_type: Literal['left', 'right', 'inner', 'outer'] = 'inner', right_suffix: str = '_RIGHT', joins_formatters: list[collections.abc.Callable[[str], str]] = <factory>)
- 
Defines a list of columns from a left dataframe and a list of columns from a right dataframe that are to be used to join the two. Args- left_columns
- list of participating column names in the left dataframe
- right_columns:- optional
- list of participating column names in the right dataframe. If not supplied, defaults to left_columns. If supplied, left_columns and right_columns MUST be of equal length.
- join_type
- "left", "right", "inner", or "outer". Default is "inner".
- right_suffix
- text appended to right column name in join result when the original column name exists in both left and right.
- joins_formatters
- list of callables for reformatting index values prior to performing join operation.
 Raises- ValueError
- if length of left_columns != length of supplied right_columns OR length of left columns != length of supplied joins_formatters.
 Class variables- var join_type : Literal['left', 'right', 'inner', 'outer']
- var joins_formatters : list[collections.abc.Callable[[str], str]]
- var left_columns : list[str]
- var right_columns : list[str]
- var right_suffix : str