cascade.meta#

class cascade.meta.Assessor(**kwargs)[source]#

The container for the info on the people who were in charge of labeling process, their experience and other properties.

This is a dataclass, so any additional fields will not be recorded if added. If it needs to be extended, please create a new class instead.

__init__(id: str | None = None, position: str | None = None) None#
id: str | None = None#
position: str | None = None#
class cascade.meta.DataCard(**kwargs)[source]#

The container for the information on dataset. The set of fields here is general and can be extended by providing new keywords into __init__.

Example

>>> from cascade.meta import DataCard, Assessor, LabelingInfo, DataRegistrator
>>> person = Assessor(id=0, position="Assessor")
>>> info = LabelingInfo(who=[person], process_desc="Labeling description")
>>> dc = DataCard(
...     name="Dataset",
...     desc="Example dataset",
...     source="Database",
...     goal="Every dataset should have a goal",
...     labeling_info=info,
...     size=100,
...     metrics={"quality": 100},
...     schema={"label": "value"},
...     custom_field="hello")
>>> dr = DataRegistrator('data_log.yml')
>>> dr.register(dc)
__init__(name: str | None = None, desc: str | None = None, source: str | None = None, goal: str | None = None, labeling_info: LabelingInfo | None = None, size: int | Tuple[int] | None = None, metrics: Dict[str, Any] | None = None, schema: Dict[Any, Any] | None = None, **kwargs: Any) None[source]#
Parameters:
  • name (Optional[str]) – The name of dataset

  • desc (Optional[str]) – Short description

  • source (Optional[str]) – The source of data. Can be URL or textual description of source

  • goal (Optional[str]) – The datasets have a goal - what should be achieved using this data?

  • labeling_info (Optional[LabelingInfo]) – The instance of dataclass describing labeling process placed here

  • size (Union[int, Tuple[int], None]) – This can usually be done automatically - number of items or shape of the table.

  • metrics (Optional[Dict[str, Any]]) – Dictionary with names and values of metrics. Any quality metrics can be included

  • schema (Optional[Dict[Any, Any]]) – Schema dictionary describing table datasets, their columns, data types, possible values, etc. Panderas schema objects can be used when converted into dictionaries

class cascade.meta.DataRegistrator(**kwargs)[source]#

A tool for tracking lineage of datasets. I is useful if dataset is not a static object and has some properties changed during the time.

__init__(filepath: str, raise_on_fail: bool = False) None[source]#
Parameters:
  • filepath (str) – Path to the log file for HistoryLogger

  • raise_on_fail (bool, optional) – Whether to raise a warning or an exception in case when logger failed to read a file for some reason, by default False

register(card: DataCard) None[source]#

Each time this method is called - a new snapshot of gived card is done in the log. Call this method each time the dataset has changed automatically, for example in data update script which is preferable way or manually.

Parameters:

card (DataCard) – Container for all the info on data see DataCard documentation for additional info.

class cascade.meta.DataleakValidator(train_ds: BaseDataset[T], test_ds: BaseDataset[T], hash_fn: Callable[[Any], str] | None = None, **kwargs: Any)[source]#
__init__(train_ds: BaseDataset[T], test_ds: BaseDataset[T], hash_fn: Callable[[Any], str] | None = None, **kwargs: Any) None[source]#

Checks if two datasets have identical items

Calculates hash_fn to identify items Uses python hash as default, but can be customized

Parameters:
  • train_ds (Dataset[T]) – Train dataset

  • test_ds (Dataset[T]) – Test or evaluation dataset

  • hash_fn (Optional[Callable[[Any], str]]) – Hash function, by default None

Raises:

DataValidationException – If identical items found

Example

>>> from cascade.meta import DataleakValidator
>>> from cascade.data import Wrapper
>>> import numpy as np
>>> train_ds = Wrapper(np.zeros((5, 2)))
>>> test_ds = Wrapper(np.zeros((5, 2)))
>>> from cascade.meta import DataleakValidator, numpy_md5
>>> DataleakValidator(train_ds, test_ds, hash_fn=numpy_md5)
Traceback (most recent call last):
...
cascade.meta.validator.DataValidationException:
Train and test datasets have 25 repeating pairs
Train indices: 0, 0, 0, 0, 0 ... 4, 4, 4, 4, 4
Test indices: 0, 1, 2, 3, 4 ... 0, 1, 2, 3, 4
class cascade.meta.DiffViewer(path: str)[source]#

The dash-based server to view meta-data and compare different snapshots using deep diff.

It can work with Repo’s, Line’s, files that store version logs and history of entities such as data registration logs.

__init__(path: str) None[source]#
serve(*args: Any, **kwargs: Any)[source]#
class cascade.meta.HistoryViewer(container: Workspace | Repo | ModelLine, last_lines: int | None = None, last_models: int | None = None, update_period_sec: int = 3)[source]#

The tool which allows user to visualize training history of model versions. Uses shows how metrics of models changed over time and how models with different hyperparameters depend on each other.

__init__(container: Workspace | Repo | ModelLine, last_lines: int | None = None, last_models: int | None = None, update_period_sec: int = 3) None[source]#
Parameters:
  • container (Union[Workspace, Repo, ModelLine]) – Container of models to be viewed

  • last_lines (int, optional) – Constraints the number of lines back from the last one to view

  • last_models (int, optional) – For each line constraints the number of models back from the last one to view

  • update_period_sec (int, default is 3) – Update period in seconds

plot(metric: str, show: bool = False) Any[source]#

Plots training history of model versions using plotly.

Parameters:
  • metric (str) – Metric should be present in meta of at least one model in repo

  • show (bool, optional) – Whether to return and show or just return figure

serve(metric: str | None = None, **kwargs: Any) None[source]#

Runs dash-based server with HistoryViewer, updating plots in real-time.

Parameters:
  • metric – One of the metrics in the repo. May be left None and chosen later in the interface

  • optional – One of the metrics in the repo. May be left None and chosen later in the interface

  • **kwargs – Arguments for app.run_server() for example port or host

Note

This feature needs dash to be installed.

class cascade.meta.LabelingInfo(**kwargs)[source]#

The container for the information on labeling process, people involved, description of the process, documentation links.

This is a dataclass, so any additional fields will not be recorded if added. If it needs to be extended, please create a new class instead.

__init__(who: List[Assessor] | None = None, process_desc: str | None = None, docs: str | None = None) None#
docs: str | None = None#
process_desc: str | None = None#
who: List[Assessor] | None = None#
class cascade.meta.MetaValidator(dataset: BaseDataset[T], root: str | None = None, meta_fmt: Literal['.json', '.yml', '.yaml'] = '.json')[source]#

Standard validator that saves the dataset’s meta on the first run and checks if it is consistent on the following runs.

MetaValidator is a Modifier that checks data consistency in several pipeline runs. If pipeline of data processing consists of cascade Datasets it uses meta of all pipelines to ensure that data is unchanged.

Capabilities of this validator are as powerful as pipelines meta and is defined by extending get_meta methods.

Example

>>> from cascade.data import Modifier, Wrapper
>>> from cascade.meta import MetaValidator
>>> ds = Wrapper([1,2,3,4])  # Define dataset
>>> ds = Modifier(ds)  # Wrap some modifiers
>>> ds = Modifier(ds)
>>> MetaValidator(ds) # Add validation by passing ds, but with no assigning to use data later

In this example on the first run validator saves meta of this pipeline, which looks something like this:

>>> [{'len': 4, 'name': 'cascade.data.dataset.Modifier'},
>>> {'len': 4, 'name': 'cascade.data.dataset.Modifier'},
>>> {'len': 4, 'name': 'cascade.tests.number_dataset.NumberDataset'}]

On the second run of the pipeline it computes pipeline’s meta and then meta’s hash based on the names of blocks. This is needed to check if pipeline structure is changed. If it founds that pipeline has the same structure, then meta dicts are compared using deepdiff and if everything is ok it returns.

If the structure of pipeline is different it saves new meta file.

__init__(dataset: BaseDataset[T], root: str | None = None, meta_fmt: Literal['.json', '.yml', '.yaml'] = '.json') None[source]#
Parameters:
  • dataset (BaseDataset[T]) – Dataset to validate

  • root (str, optional) – Path to the folder in which to store meta default is ‘./.cascade’

  • meta_fmt (str, optional) – Format of metadata files

Raises:

cascade.meta.DataValidationException

class cascade.meta.MetaViewer(root: str, filt: Dict[Any, Any] | None = None)[source]#

The class to view all metadata in folders and subfolders.

__getitem__(index: int) List[Dict[Any, Any]][source]#
Returns:

meta – Meta object that was read from file

Return type:

Meta

__init__(root: str, filt: Dict[Any, Any] | None = None) None[source]#
Parameters:
  • root (str) – path to the folder containing metadata files

  • filt (Dict, optional) – dictionary that specifies which values that should be present in meta for example to find all models use filt={'type': 'model'}

See also

cascade.meta.MetaHandler

read(path: str) List[Dict[Any, Any]][source]#

Loads object from path

write(path: str, obj: Any) None[source]#

Dumps obj to path

class cascade.meta.MetricViewer(repo: Repo | ModelLine, scope: int | str | slice | None = None)[source]#

Interface for viewing metrics in model meta files uses Repo to extract metrics of all models if any. As metrics it uses data from metrics field in models’ meta and as parameters it uses params field.

__getitem__(key: int | str | slice)[source]#

Sets the scope of the viewer after creation. Basically creates new viewer with another scope.

__init__(repo: Repo | ModelLine, scope: int | str | slice | None = None) None[source]#
Parameters:
  • repo (Repo) – Repo object to extract metrics from

  • scope (Union[int, str, slice]) – Index or a name of line to view. Can be set using __getitem__

get_best_by(metric: str, maximize: bool = True) Model[source]#

Loads the best model by the given metric

Parameters:
  • metric (str) – Name of the metric

  • maximize (bool) – The direction of choosing the best model: True if greater is better and False if less is better

Raises:
  • TypeError if metric objects cannot be sorted. If only one model in repo, then

  • returns it without error since no sorting involved.

plot_table(show: bool = False)[source]#

Uses plotly to graphically show table with metrics and parameters.

reload_table() None[source]#
serve(page_size: int = 50, include: List[str] | None = None, exclude: List[str] | None = None, **kwargs: Any) None[source]#

Runs dash-based server with interactive table of metrics and parameters

Parameters:
  • page_size (int, optional) – Size of the table in rows on one page

  • include (List[str], optional:) – List of parameters or metrics to be added. Only they will be present along with some default

  • exclude (List[str], optional:) – List of parameters or metrics to be excluded from table

  • **kwargs – Arguments of dash app. Can be ip or port for example

class cascade.meta.numpy_md5(x: Any)[source]#
class cascade.meta.AggregateValidator(dataset: BaseDataset[T], func: Callable[[BaseDataset[T]], bool], **kwargs)[source]#

This validator accepts an aggregate function that accepts a Dataset and returns True or False

Example

>>> from cascade.data import Wrapper
>>> ds = Wrapper([1, 2, 3, 4, 5])
>>> ds = AggregateValidator(ds, lambda x: len(x) == 5)
__init__(dataset: BaseDataset[T], func: Callable[[BaseDataset[T]], bool], **kwargs) None[source]#

Constructs a Modifier. Modifier represents a step in a pipeline - some data transformation

Parameters:

dataset (BaseDataset[T]) – A dataset to modify

class cascade.meta.DataValidationException[source]#

Raised when data validation fails

class cascade.meta.PredicateValidator(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any)[source]#

This validator accepts function that is applied to each item in a dataset and returns True or False. Calls __getitem__``s of all previous datasets in ``__init__.

Example

>>> from cascade.data import Wrapper
>>> ds = Wrapper([1, 2, 3, 4, 5])
>>> ds = PredicateValidator(ds, lambda x: x < 6)
__init__(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any) None[source]#

Constructs a Modifier. Modifier represents a step in a pipeline - some data transformation

Parameters:

dataset (BaseDataset[T]) – A dataset to modify

class cascade.meta.Validator(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any)[source]#

Base class for validators. Defines basic __init__ structure

__init__(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any) None[source]#

Constructs a Modifier. Modifier represents a step in a pipeline - some data transformation

Parameters:

dataset (BaseDataset[T]) – A dataset to modify