cascade.meta#
- class cascade.meta.Assessor(**kwargs)[source]#
The container for the info on the people who were in charge of labeling process, their experience and other properties.
This is a dataclass, so any additional fields will not be recorded if added. If it needs to be extended, please create a new class instead.
- __init__(id: str | None = None, position: str | None = None) None #
- id: str | None = None#
- position: str | None = None#
- class cascade.meta.DataCard(**kwargs)[source]#
The container for the information on dataset. The set of fields here is general and can be extended by providing new keywords into __init__.
Example
>>> from cascade.meta import DataCard, Assessor, LabelingInfo, DataRegistrator >>> person = Assessor(id=0, position="Assessor") >>> info = LabelingInfo(who=[person], process_desc="Labeling description") >>> dc = DataCard( ... name="Dataset", ... desc="Example dataset", ... source="Database", ... goal="Every dataset should have a goal", ... labeling_info=info, ... size=100, ... metrics={"quality": 100}, ... schema={"label": "value"}, ... custom_field="hello") >>> dr = DataRegistrator('data_log.yml') >>> dr.register(dc)
- __init__(name: str | None = None, desc: str | None = None, source: str | None = None, goal: str | None = None, labeling_info: LabelingInfo | None = None, size: int | Tuple[int] | None = None, metrics: Dict[str, Any] | None = None, schema: Dict[Any, Any] | None = None, **kwargs: Any) None [source]#
- Parameters:
name (Optional[str]) – The name of dataset
desc (Optional[str]) – Short description
source (Optional[str]) – The source of data. Can be URL or textual description of source
goal (Optional[str]) – The datasets have a goal - what should be achieved using this data?
labeling_info (Optional[LabelingInfo]) – The instance of dataclass describing labeling process placed here
size (Union[int, Tuple[int], None]) – This can usually be done automatically - number of items or shape of the table.
metrics (Optional[Dict[str, Any]]) – Dictionary with names and values of metrics. Any quality metrics can be included
schema (Optional[Dict[Any, Any]]) – Schema dictionary describing table datasets, their columns, data types, possible values, etc. Panderas schema objects can be used when converted into dictionaries
- class cascade.meta.DataRegistrator(**kwargs)[source]#
A tool for tracking lineage of datasets. I is useful if dataset is not a static object and has some properties changed during the time.
- __init__(filepath: str, raise_on_fail: bool = False) None [source]#
- Parameters:
filepath (str) – Path to the log file for HistoryLogger
raise_on_fail (bool, optional) – Whether to raise a warning or an exception in case when logger failed to read a file for some reason, by default False
- register(card: DataCard) None [source]#
Each time this method is called - a new snapshot of gived card is done in the log. Call this method each time the dataset has changed automatically, for example in data update script which is preferable way or manually.
- Parameters:
card (DataCard) – Container for all the info on data see DataCard documentation for additional info.
See also
- class cascade.meta.DataleakValidator(train_ds: BaseDataset[T], test_ds: BaseDataset[T], hash_fn: Callable[[Any], str] | None = None, **kwargs: Any)[source]#
- __init__(train_ds: BaseDataset[T], test_ds: BaseDataset[T], hash_fn: Callable[[Any], str] | None = None, **kwargs: Any) None [source]#
Checks if two datasets have identical items
Calculates
hash_fn
to identify items Uses pythonhash
as default, but can be customized- Parameters:
- Raises:
DataValidationException – If identical items found
Example
>>> from cascade.meta import DataleakValidator >>> from cascade.data import Wrapper >>> import numpy as np >>> train_ds = Wrapper(np.zeros((5, 2))) >>> test_ds = Wrapper(np.zeros((5, 2))) >>> from cascade.meta import DataleakValidator, numpy_md5 >>> DataleakValidator(train_ds, test_ds, hash_fn=numpy_md5) Traceback (most recent call last): ... cascade.meta.validator.DataValidationException: Train and test datasets have 25 repeating pairs Train indices: 0, 0, 0, 0, 0 ... 4, 4, 4, 4, 4 Test indices: 0, 1, 2, 3, 4 ... 0, 1, 2, 3, 4
- class cascade.meta.DiffViewer(path: str)[source]#
The dash-based server to view meta-data and compare different snapshots using deep diff.
It can work with Repo’s, Line’s, files that store version logs and history of entities such as data registration logs.
- class cascade.meta.HistoryViewer(container: Workspace | Repo | ModelLine, last_lines: int | None = None, last_models: int | None = None, update_period_sec: int = 3)[source]#
The tool which allows user to visualize training history of model versions. Uses shows how metrics of models changed over time and how models with different hyperparameters depend on each other.
- __init__(container: Workspace | Repo | ModelLine, last_lines: int | None = None, last_models: int | None = None, update_period_sec: int = 3) None [source]#
- Parameters:
container (Union[Workspace, Repo, ModelLine]) – Container of models to be viewed
last_lines (int, optional) – Constraints the number of lines back from the last one to view
last_models (int, optional) – For each line constraints the number of models back from the last one to view
update_period_sec (int, default is 3) – Update period in seconds
- plot(metric: str, show: bool = False) Any [source]#
Plots training history of model versions using plotly.
- Parameters:
metric (str) – Metric should be present in meta of at least one model in repo
show (bool, optional) – Whether to return and show or just return figure
- serve(metric: str | None = None, **kwargs: Any) None [source]#
Runs dash-based server with HistoryViewer, updating plots in real-time.
- Parameters:
metric – One of the metrics in the repo. May be left None and chosen later in the interface
optional – One of the metrics in the repo. May be left None and chosen later in the interface
**kwargs – Arguments for app.run_server() for example port or host
Note
This feature needs
dash
to be installed.
- class cascade.meta.LabelingInfo(**kwargs)[source]#
The container for the information on labeling process, people involved, description of the process, documentation links.
This is a dataclass, so any additional fields will not be recorded if added. If it needs to be extended, please create a new class instead.
- __init__(who: List[Assessor] | None = None, process_desc: str | None = None, docs: str | None = None) None #
- docs: str | None = None#
- process_desc: str | None = None#
- class cascade.meta.MetaValidator(dataset: BaseDataset[T], root: str | None = None, meta_fmt: Literal['.json', '.yml', '.yaml'] = '.json')[source]#
Standard validator that saves the dataset’s meta on the first run and checks if it is consistent on the following runs.
MetaValidator
is aModifier
that checks data consistency in several pipeline runs. If pipeline of data processing consists of cascade Datasets it uses meta of all pipelines to ensure that data is unchanged.Capabilities of this validator are as powerful as pipelines meta and is defined by extending
get_meta
methods.Example
>>> from cascade.data import Modifier, Wrapper >>> from cascade.meta import MetaValidator >>> ds = Wrapper([1,2,3,4]) # Define dataset >>> ds = Modifier(ds) # Wrap some modifiers >>> ds = Modifier(ds) >>> MetaValidator(ds) # Add validation by passing ds, but with no assigning to use data later
In this example on the first run validator saves meta of this pipeline, which looks something like this:
>>> [{'len': 4, 'name': 'cascade.data.dataset.Modifier'}, >>> {'len': 4, 'name': 'cascade.data.dataset.Modifier'}, >>> {'len': 4, 'name': 'cascade.tests.number_dataset.NumberDataset'}]
On the second run of the pipeline it computes pipeline’s meta and then meta’s hash based on the names of blocks. This is needed to check if pipeline structure is changed. If it founds that pipeline has the same structure, then meta dicts are compared using
deepdiff
and if everything is ok it returns.If the structure of pipeline is different it saves new meta file.
See also
- __init__(dataset: BaseDataset[T], root: str | None = None, meta_fmt: Literal['.json', '.yml', '.yaml'] = '.json') None [source]#
- Parameters:
dataset (BaseDataset[T]) – Dataset to validate
root (str, optional) – Path to the folder in which to store meta default is ‘./.cascade’
meta_fmt (str, optional) – Format of metadata files
- Raises:
- class cascade.meta.MetaViewer(root: str, filt: Dict[Any, Any] | None = None)[source]#
The class to view all metadata in folders and subfolders.
- __getitem__(index: int) List[Dict[Any, Any]] [source]#
- Returns:
meta – Meta object that was read from file
- Return type:
Meta
- __init__(root: str, filt: Dict[Any, Any] | None = None) None [source]#
- Parameters:
root (str) – path to the folder containing metadata files
filt (Dict, optional) – dictionary that specifies which values that should be present in meta for example to find all models use
filt={'type': 'model'}
See also
cascade.meta.MetaHandler
- class cascade.meta.MetricViewer(repo: Repo | ModelLine, scope: int | str | slice | None = None)[source]#
Interface for viewing metrics in model meta files uses Repo to extract metrics of all models if any. As metrics it uses data from
metrics
field in models’ meta and as parameters it usesparams
field.- __getitem__(key: int | str | slice)[source]#
Sets the scope of the viewer after creation. Basically creates new viewer with another scope.
- __init__(repo: Repo | ModelLine, scope: int | str | slice | None = None) None [source]#
- Parameters:
repo (Repo) – Repo object to extract metrics from
scope (Union[int, str, slice]) – Index or a name of line to view. Can be set using
__getitem__
- get_best_by(metric: str, maximize: bool = True) Model [source]#
Loads the best model by the given metric
- Parameters:
metric (str) – Name of the metric
maximize (bool) – The direction of choosing the best model:
True
if greater is better andFalse
if less is better
- Raises:
TypeError if metric objects cannot be sorted. If only one model in repo, then –
returns it without error since no sorting involved. –
- plot_table(show: bool = False)[source]#
Uses plotly to graphically show table with metrics and parameters.
- serve(page_size: int = 50, include: List[str] | None = None, exclude: List[str] | None = None, **kwargs: Any) None [source]#
Runs dash-based server with interactive table of metrics and parameters
- Parameters:
page_size (int, optional) – Size of the table in rows on one page
include (List[str], optional:) – List of parameters or metrics to be added. Only they will be present along with some default
exclude (List[str], optional:) – List of parameters or metrics to be excluded from table
**kwargs – Arguments of dash app. Can be ip or port for example
- class cascade.meta.AggregateValidator(dataset: BaseDataset[T], func: Callable[[BaseDataset[T]], bool], **kwargs)[source]#
This validator accepts an aggregate function that accepts a
Dataset
and returnsTrue
orFalse
Example
>>> from cascade.data import Wrapper >>> ds = Wrapper([1, 2, 3, 4, 5]) >>> ds = AggregateValidator(ds, lambda x: len(x) == 5)
- __init__(dataset: BaseDataset[T], func: Callable[[BaseDataset[T]], bool], **kwargs) None [source]#
Constructs a Modifier. Modifier represents a step in a pipeline - some data transformation
- Parameters:
dataset (BaseDataset[T]) – A dataset to modify
- class cascade.meta.PredicateValidator(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any)[source]#
This validator accepts function that is applied to each item in a dataset and returns
True
orFalse
. Calls__getitem__``s of all previous datasets in ``__init__
.Example
>>> from cascade.data import Wrapper >>> ds = Wrapper([1, 2, 3, 4, 5]) >>> ds = PredicateValidator(ds, lambda x: x < 6)
- __init__(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any) None [source]#
Constructs a Modifier. Modifier represents a step in a pipeline - some data transformation
- Parameters:
dataset (BaseDataset[T]) – A dataset to modify
- class cascade.meta.Validator(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any)[source]#
Base class for validators. Defines basic
__init__
structure- __init__(dataset: BaseDataset[T], func: Callable[[T], bool] | List[Callable[[T], bool]], **kwargs: Any) None [source]#
Constructs a Modifier. Modifier represents a step in a pipeline - some data transformation
- Parameters:
dataset (BaseDataset[T]) – A dataset to modify