Package `rollup_cascade`

Defines the rollup_cascade subpackage for managing rollup and cascading operations when cleaning raw extracted tabular data.

Usage

>>> import rollup_cascade as rc
>>> rollup_cascade_manager = rc.RollupCascadeManager(
...     rc.RollupCascadeSpec(
...         rc.RollupColumn(
...             column="A", when_not_empty=["C"], default_value="--", post_cascade=False
...         ),
...         rc.CascadeColumn(column="C", when_not_empty=["B"]),
...         tables=["table1", "table2"],
...     ),
...     rc.RollupCascadeSpec(
...         rc.RollupColumn(column="D", when_not_empty=["C"], default_value="?"),
...         rc.CascadeColumn(column="E", when_not_empty=["F"]),
...         tables=["table2", "table3"],
...     ),
... )
>>> table_instance = {
...     "a": ["", "", "a2", "a3", "", ""],
...     "b": ["b1", "b2", "b3", "b4", "", ""],
...     "c": ["c1", "", "", "", "", ""],
...     "d": ["d1", "", "", "d2", "d3", "d4"],
... }
>>> table_name = "table2"
>>> cleaned = rollup_cascade_manager.rollup_cascade_rollup(table_name, table_instance)
>>> cleaned
{'a': ['--', '', 'a2', 'a3'],
'b': ['b1', 'b2', 'b3', 'b4'],
'c': ['c1', 'c1', 'c1', 'c1'],
'd': ['d1', '?', '?', 'd2 d3 d4']}

Sub-modules

rollup_cascade.column: define CascadeColumn (base) and RollupColumn (derived) column classes
rollup_cascade.manager: define RollupCascadeManager class as top level manager for all cascade/rollup operations for all tables across a single section. See …
rollup_cascade.spec: Define RollupCascadeSpec for associating a group of cascade/rollup operations with specific table(s)

Classes

class CascadeColumn (*, column: str, when_not_empty: collections.abc.Sequence[str] = (), default_value: str = '', strict: bool = False, custom_trigger: collections.abc.Callable[[numpy.ndarray, numpy.ndarray, collections.abc.Iterable[str]], bool] | None = None)

Specify a column whose value should be cascaded to subsequent rows when one or more reference columns has a value.

KwArgs

column : str: the column whose value should be cascaded
when_not_empty : Sequence[str]: the columns that trigger the cascade when not empty. if this setting itself is empty, the cascade is triggered for ALL rows. Default is ().
default_value : str: the value to cascade when no value has been captured for the cascade column. Default is "".
strict : bool: if True, ALL when_not_empty columns must be populated to trigger a cascade. Default is False.
custom_trigger : Callable | None: a callable accepting the current row value set, the next row value set, and an iterable of all column names

Subclasses

RollupColumn

Class variables

var column : str
var custom_trigger : collections.abc.Callable[[numpy.ndarray, numpy.ndarray, collections.abc.Iterable[str]], bool] | None
var default_value : str
var strict : bool
var when_not_empty : collections.abc.Sequence[str]

Instance variables

prop check_when_not_empty : collections.abc.Callable[[numpy.ndarray], bool]: Retuns the callable for testing whether or not a row meets the when_not_empty constraint.

Methods

def cascade(self, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Cascade values in the table for self.column per defined custom_trigger or when_not_empty column status.

Args

table: dict of lists of strings representing table rows

Returns

dict[str, list[str]]: table after cascading self.column.

def cascade_it(self, row_a: numpy.ndarray, row_b: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Returns True if value from row_a should be cascaded to row_b.

If self.custom_trigger is defined, call it with the supplied parameters and return the result. Otherwise, return self.set_it(row_b, table).

Args

row_a : np.ndarray: the last row with a good value.
row_b : np.ndarray: the target of the cascade, if triggered
table: dict of lists of strings representing table rows

Returns

bool: True if the last good value should be cascaded to the next row

def column_value(self, row: numpy.ndarray, table: dict[str, list[str]]) ‑> str

Retrieves the value of self.column from the supplied row.

Args

row : np.ndarray: the row to check
table: dict of lists of strings representing table rows

Returns

str: the value of self.column in the supplied row.

def header_check(self, table: dict[str, list[str]]) ‑> bool

Check the columns in self (column + when_not_empty) against those in the header of a real table to determine if this operation is relevant given the columns available.

Args

table: dict of lists of strings representing table rows

Returns

bool: indicating whether this Cascade/RollupColumn is applicable given the columns available in the table instance.

def meets_when_not_empty(self, row: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Tests the when_not_empty constraint on this row of values.

Args

row : np.ndarray: the row to check
table: dict of lists of strings representing table rows

Returns

bool: True if when_not_empty isn't defined, strict==False and a value is found at the position of any when_not_empty column, or strict is True and a value is present at the positions of ALL when_not_empty columns.

def set_column(self, value: str, row: numpy.ndarray, table: dict[str, list[str]]) ‑> numpy.ndarray

Set the value of self.column in the supplied row to the supplied value.

Args

value : str: the new value for self.column
row : np.ndarray: the row to check
table: dict of lists of strings representing table rows

Returns

np.ndarray: the updated row

def set_it(self, row: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Returns True if column is empty but the when_not_empty check is True. Used for defaulting and cascading.

Args

row : np.ndarray: the row to check
table: dict of lists of strings representing table rows

Returns

bool: True if this row's value should be set to the default.

class RollupCascadeManager (*args: RollupCascadeSpec, collapse_all: bool = True)

Container for the set of rollup and cascade settings for all of the tables in a given section. rollup_casc_ref_instance[table_name] returns the combined rollup and cascade settings from all specs in which the specified table appears.

Args

*args : tuple[RollupCascadeSpec, …]: RollupCascadeSpec instances shared between a common set of tables (i.e. those of a section)

KwArgs

collapse_all : bool: If True, all tables will have a default rollup operation that rolls up any row with more blanks than the previous row. Default is True. Replaces legacy table_utils.collapse_dict() functionality.

Methods

def cascade(self, table_name: str, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Perform all cascading operations defined for the supplied table_name.

Args

table_name: name of the table currently processing
table: dict of lists of strings representing table rows

Returns

dict[str, list[str]]: dict of lists of strings representing table rows with values cascaded as specified in cascade_columns

def find_cascades(self, table_name: str, table: dict[str, list[str]]) ‑> list[CascadeColumn]

Finds the CascadeColumns defined for this table_name and filters them according to the column names in header. See self._column_check().

Args

table_name: name of the table currently processing
table: dict of lists of strings representing table rows

Returns

list[CascadeColumn]: the cascades to perform for this table instance.

def find_rollups(self, table_name: str, pre_cascade: bool, table: dict[str, list[str]]) ‑> list[RollupColumn]

Finds the RollupColumns defined for this table_name and filters them according to the column names in header (see self._column_check()) and whether or not cascade operations have already occurred.

Args

table_name: name of the table currently processing
pre_cascade: bool indicating whether cascades have occurred.
table: dict of lists of strings representing table rows

Returns

list[CascadeColumn]: the rollups to perform for this table instance.

def rollup(self, table_name: str, table: dict[str, list[str]], pre_cascade: bool = True) ‑> dict[str, list[str]]

Perform all rollup operations defined for the supplied table_name and cascade status (pre or post).

Args

table_name: name of the table currently processing
table: dict of lists of strings representing table rows
pre_cascade : bool: indicates whether this operation is ocurring before or after its sister operation cascade_column_values()

Returns

dict[str, list[str]]: dict of lists of strings representing table rows with values cascaded as specified in cascade_columns

def rollup_cascade_rollup(self, table_name: str, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Performs pre cascade rollup, cascade, and post cascade rollup in series.

Args

table_name: name of the table currently processing
table: dict of lists of strings representing table rows
roll_ref: RollupCascadeManager defining rollup and cascade operations for all tables in this section.

Returns

dict[str, list[str]]: dict of lists of strings representing table rows with values rolled up and cascaded as specified in roll_ref

class RollupCascadeSpec (*args: CascadeColumn | RollupColumn, tables: collections.abc.Sequence[str] = ())

Settings for the rollup_column_values and cascade_column_values functions.

Args

tables : Sequence[str]: the names of the tables to which the settings should be applied. If empty, settings are applied to ALL tables.
column_specs : list[CascadeColumn]: list of CascadeColumn or RollupColumn instances (note that RollupColumn inherts from CascadeColumn) to associated with the listed tables or all tables if tables is empty.

Instance variables

prop cascade_columns : list[CascadeColumn]: Settings for columns whose values should be cascaded to subsequent rows when one or more reference columns has a value.
prop rollup_columns : list[RollupColumn]: Settings for columns whose values should be cascaded to subsequent rows when one or more reference columns has a value.

class RollupColumn (*, column: str, when_not_empty: collections.abc.Sequence[str] = (), default_value: str = '', strict: bool = False, custom_trigger: collections.abc.Callable[[numpy.ndarray, numpy.ndarray, collections.abc.Iterable[str]], bool] | None = None, pre_cascade: bool = True, post_cascade: bool = True, join_function: collections.abc.Callable[[tuple[str, str]], str] = <built-in method join of str object>)

Specify a column whose values should be rolled up into the first row in which one or more reference columns is not empty.

KwArgs

column : str: the column whose value should be rolled up
when_not_empty : Sequence[str]: the columns that trigger the rollup when not empty.
default_value : str: the value to use when no value is present in the rollup column. Default is "".
strict : bool: if True, ALL when_not_empty columns must be populated to trigger a rollup. Default is False.
pre_cascade : bool: if True, the rollup is performed before column values have been cascaded. Default is True.
post_cascade : bool: if True, the rollup is performed after column values have been cascaded. Default is True.
join_function : Callable[[tuple[str, str]], str]: method to use to join the values when a rollup is triggered. Defaults to " ".join.

Ancestors

CascadeColumn

Class variables

var post_cascade : bool
var pre_cascade : bool

Methods

def join_function(iterable, /) ‑> collections.abc.Callable[[tuple[str, str]], str]

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'

def roll_it(self, row_a: numpy.ndarray, row_b: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Check two subsequent table rows to see if row_b should be rolled into row_a.

Triggers rolling behavior when: - row_a meets our when_not_empty constraint, and - row_b does NOT meet our when_not_empty constraint, and - row_a has a value in the target column, and - row_b has a value in the target column, ** OR ** - only the target column is populated in both this and next row

Args

row_a : np.ndarray: the first row to check
row_b : np.ndarray: the second row to check
table: dict of lists of strings representing table rows

Returns

bool: True if row_b should be rolled into row_a

def rollup(self, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Roll up values in the table for self.column per defined custom_trigger or when_not_empty column status.

Args

table: dict of lists of strings representing table rows

Returns

dict[str, list[str]]: table after rolling up self.column.

Inherited members

CascadeColumn:
- cascade
- cascade_it
- check_when_not_empty
- column_value
- header_check
- meets_when_not_empty
- set_column
- set_it