Package rollup_cascade

Defines the rollup_cascade subpackage for managing rollup and cascading operations when cleaning raw extracted tabular data.

Usage

>>> import rollup_cascade as rc
>>> rollup_cascade_manager = rc.RollupCascadeManager(
...     rc.RollupCascadeSpec(
...         rc.RollupColumn(
...             column="A", when_not_empty=["C"], default_value="--", post_cascade=False
...         ),
...         rc.CascadeColumn(column="C", when_not_empty=["B"]),
...         tables=["table1", "table2"],
...     ),
...     rc.RollupCascadeSpec(
...         rc.RollupColumn(column="D", when_not_empty=["C"], default_value="?"),
...         rc.CascadeColumn(column="E", when_not_empty=["F"]),
...         tables=["table2", "table3"],
...     ),
... )
>>> table_instance = {
...     "a": ["", "", "a2", "a3", "", ""],
...     "b": ["b1", "b2", "b3", "b4", "", ""],
...     "c": ["c1", "", "", "", "", ""],
...     "d": ["d1", "", "", "d2", "d3", "d4"],
... }
>>> table_name = "table2"
>>> cleaned = rollup_cascade_manager.rollup_cascade_rollup(table_name, table_instance)
>>> cleaned
{'a': ['--', '', 'a2', 'a3'],
'b': ['b1', 'b2', 'b3', 'b4'],
'c': ['c1', 'c1', 'c1', 'c1'],
'd': ['d1', '?', '?', 'd2 d3 d4']}

Sub-modules

rollup_cascade.column

define CascadeColumn (base) and RollupColumn (derived) column classes

rollup_cascade.manager

define RollupCascadeManager class as top level manager for all cascade/rollup operations for all tables across a single section. See …

rollup_cascade.spec

Define RollupCascadeSpec for associating a group of cascade/rollup operations with specific table(s)

Classes

class CascadeColumn (*, column: str, when_not_empty: collections.abc.Sequence[str] = (), default_value: str = '', strict: bool = False, custom_trigger: collections.abc.Callable[[numpy.ndarray, numpy.ndarray, collections.abc.Iterable[str]], bool] | None = None)

Specify a column whose value should be cascaded to subsequent rows when one or more reference columns has a value.

KwArgs

column : str
the column whose value should be cascaded
when_not_empty : Sequence[str]
the columns that trigger the cascade when not empty. if this setting itself is empty, the cascade is triggered for ALL rows. Default is ().
default_value : str
the value to cascade when no value has been captured for the cascade column. Default is "".
strict : bool
if True, ALL when_not_empty columns must be populated to trigger a cascade. Default is False.
custom_trigger : Callable | None
a callable accepting the current row value set, the next row value set, and an iterable of all column names

Subclasses

Class variables

var column : str
var custom_trigger : collections.abc.Callable[[numpy.ndarray, numpy.ndarray, collections.abc.Iterable[str]], bool] | None
var default_value : str
var strict : bool
var when_not_empty : collections.abc.Sequence[str]

Instance variables

prop check_when_not_empty : collections.abc.Callable[[numpy.ndarray], bool]

Retuns the callable for testing whether or not a row meets the when_not_empty constraint.

Methods

def cascade(self, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Cascade values in the table for self.column per defined custom_trigger or when_not_empty column status.

Args

table
dict of lists of strings representing table rows

Returns

dict[str, list[str]]
table after cascading self.column.
def cascade_it(self, row_a: numpy.ndarray, row_b: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Returns True if value from row_a should be cascaded to row_b.

If self.custom_trigger is defined, call it with the supplied parameters and return the result. Otherwise, return self.set_it(row_b, table).

Args

row_a : np.ndarray
the last row with a good value.
row_b : np.ndarray
the target of the cascade, if triggered
table
dict of lists of strings representing table rows

Returns

bool
True if the last good value should be cascaded to the next row
def column_value(self, row: numpy.ndarray, table: dict[str, list[str]]) ‑> str

Retrieves the value of self.column from the supplied row.

Args

row : np.ndarray
the row to check
table
dict of lists of strings representing table rows

Returns

str
the value of self.column in the supplied row.
def header_check(self, table: dict[str, list[str]]) ‑> bool

Check the columns in self (column + when_not_empty) against those in the header of a real table to determine if this operation is relevant given the columns available.

Args

table
dict of lists of strings representing table rows

Returns

bool
indicating whether this Cascade/RollupColumn is applicable given the columns available in the table instance.
def meets_when_not_empty(self, row: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Tests the when_not_empty constraint on this row of values.

Args

row : np.ndarray
the row to check
table
dict of lists of strings representing table rows

Returns

bool
True if when_not_empty isn't defined, strict==False and a value is found at the position of any when_not_empty column, or strict is True and a value is present at the positions of ALL when_not_empty columns.
def set_column(self, value: str, row: numpy.ndarray, table: dict[str, list[str]]) ‑> numpy.ndarray

Set the value of self.column in the supplied row to the supplied value.

Args

value : str
the new value for self.column
row : np.ndarray
the row to check
table
dict of lists of strings representing table rows

Returns

np.ndarray
the updated row
def set_it(self, row: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Returns True if column is empty but the when_not_empty check is True. Used for defaulting and cascading.

Args

row : np.ndarray
the row to check
table
dict of lists of strings representing table rows

Returns

bool
True if this row's value should be set to the default.
class RollupCascadeManager (*args: RollupCascadeSpec, collapse_all: bool = True)

Container for the set of rollup and cascade settings for all of the tables in a given section. rollup_casc_ref_instance[table_name] returns the combined rollup and cascade settings from all specs in which the specified table appears.

Args

*args : tuple[RollupCascadeSpec, …]
RollupCascadeSpec instances shared between a common set of tables (i.e. those of a section)

KwArgs

collapse_all : bool
If True, all tables will have a default rollup operation that rolls up any row with more blanks than the previous row. Default is True. Replaces legacy table_utils.collapse_dict() functionality.

Methods

def cascade(self, table_name: str, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Perform all cascading operations defined for the supplied table_name.

Args

table_name
name of the table currently processing
table
dict of lists of strings representing table rows

Returns

dict[str, list[str]]
dict of lists of strings representing table rows with values cascaded as specified in cascade_columns
def find_cascades(self, table_name: str, table: dict[str, list[str]]) ‑> list[CascadeColumn]

Finds the CascadeColumns defined for this table_name and filters them according to the column names in header. See self._column_check().

Args

table_name
name of the table currently processing
table
dict of lists of strings representing table rows

Returns

list[CascadeColumn]
the cascades to perform for this table instance.
def find_rollups(self, table_name: str, pre_cascade: bool, table: dict[str, list[str]]) ‑> list[RollupColumn]

Finds the RollupColumns defined for this table_name and filters them according to the column names in header (see self._column_check()) and whether or not cascade operations have already occurred.

Args

table_name
name of the table currently processing
pre_cascade
bool indicating whether cascades have occurred.
table
dict of lists of strings representing table rows

Returns

list[CascadeColumn]
the rollups to perform for this table instance.
def rollup(self, table_name: str, table: dict[str, list[str]], pre_cascade: bool = True) ‑> dict[str, list[str]]

Perform all rollup operations defined for the supplied table_name and cascade status (pre or post).

Args

table_name
name of the table currently processing
table
dict of lists of strings representing table rows
pre_cascade : bool
indicates whether this operation is ocurring before or after its sister operation cascade_column_values()

Returns

dict[str, list[str]]
dict of lists of strings representing table rows with values cascaded as specified in cascade_columns
def rollup_cascade_rollup(self, table_name: str, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Performs pre cascade rollup, cascade, and post cascade rollup in series.

Args

table_name
name of the table currently processing
table
dict of lists of strings representing table rows
roll_ref
RollupCascadeManager defining rollup and cascade operations for all tables in this section.

Returns

dict[str, list[str]]
dict of lists of strings representing table rows with values rolled up and cascaded as specified in roll_ref
class RollupCascadeSpec (*args: CascadeColumn | RollupColumn, tables: collections.abc.Sequence[str] = ())

Settings for the rollup_column_values and cascade_column_values functions.

Args

tables : Sequence[str]
the names of the tables to which the settings should be applied. If empty, settings are applied to ALL tables.
column_specs : list[CascadeColumn]
list of CascadeColumn or RollupColumn instances (note that RollupColumn inherts from CascadeColumn) to associated with the listed tables or all tables if tables is empty.

Instance variables

prop cascade_columns : list[CascadeColumn]

Settings for columns whose values should be cascaded to subsequent rows when one or more reference columns has a value.

prop rollup_columns : list[RollupColumn]

Settings for columns whose values should be cascaded to subsequent rows when one or more reference columns has a value.

class RollupColumn (*, column: str, when_not_empty: collections.abc.Sequence[str] = (), default_value: str = '', strict: bool = False, custom_trigger: collections.abc.Callable[[numpy.ndarray, numpy.ndarray, collections.abc.Iterable[str]], bool] | None = None, pre_cascade: bool = True, post_cascade: bool = True, join_function: collections.abc.Callable[[tuple[str, str]], str] = <built-in method join of str object>)

Specify a column whose values should be rolled up into the first row in which one or more reference columns is not empty.

KwArgs

column : str
the column whose value should be rolled up
when_not_empty : Sequence[str]
the columns that trigger the rollup when not empty.
default_value : str
the value to use when no value is present in the rollup column. Default is "".
strict : bool
if True, ALL when_not_empty columns must be populated to trigger a rollup. Default is False.
pre_cascade : bool
if True, the rollup is performed before column values have been cascaded. Default is True.
post_cascade : bool
if True, the rollup is performed after column values have been cascaded. Default is True.
join_function : Callable[[tuple[str, str]], str]
method to use to join the values when a rollup is triggered. Defaults to " ".join.

Ancestors

Class variables

var post_cascade : bool
var pre_cascade : bool

Methods

def join_function(iterable, /) ‑> collections.abc.Callable[[tuple[str, str]], str]

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'

def roll_it(self, row_a: numpy.ndarray, row_b: numpy.ndarray, table: dict[str, list[str]]) ‑> bool

Check two subsequent table rows to see if row_b should be rolled into row_a.

Triggers rolling behavior when: - row_a meets our when_not_empty constraint, and - row_b does NOT meet our when_not_empty constraint, and - row_a has a value in the target column, and - row_b has a value in the target column, ** OR ** - only the target column is populated in both this and next row

Args

row_a : np.ndarray
the first row to check
row_b : np.ndarray
the second row to check
table
dict of lists of strings representing table rows

Returns

bool
True if row_b should be rolled into row_a
def rollup(self, table: dict[str, list[str]]) ‑> dict[str, list[str]]

Roll up values in the table for self.column per defined custom_trigger or when_not_empty column status.

Args

table
dict of lists of strings representing table rows

Returns

dict[str, list[str]]
table after rolling up self.column.

Inherited members