xaib.metrics.example_selection¶
- class xaib.metrics.example_selection.covariate_regularity.CovariateRegularity(ds: Dataset, model: Model, *args: Any, **kwargs: Any)[source]¶
Measures how noisy the features in the examples provided by the method.
More simple explanations are considered better. This is measured by average Shannon entropy over batch-normalized explanations.
The less the better
Worst case: constant explainer that gives examples with same importance to each feature, that is equal to 1/N where N is the number of features
Best case: constant explainer that gives examples with one feature with maximum value and others zero
- class xaib.metrics.example_selection.model_randomization_check.ModelRandomizationCheck(ds: Dataset, model: Model, noisy_model: Model, **kwargs: Any)[source]¶
Model randomization check is a sanity-check. To ensure that the model influence explanations the following is done. The model is changed and it is expected that explanations should not stay the same is model changed. This check uses random model baselines instead of same models with randomized internal states. Then the explanations on the original data are obtained. They are compared with explanations done with the original model by counting how many examples were the same for same data points.
- The less the better
Worst case: explanations are the same, so it is Constant explainer
Best case: is reached when explanations are the opposite,
distance between them maximized.
The problem with this kind of metric is with its maximization. It seems redundant to maximize it because more different explanations on random states do not mean that the model is more correct. It is difficult to define best case explainer in this case - the metric has no maximum value.
- class xaib.metrics.example_selection.small_noise_check.SmallNoiseCheck(ds: Dataset, noisy_ds: Dataset, model: Model, *args: Any, **kwargs: Any)[source]¶
Since the nature of example selection methods is discrete - we choose an example from finite set, the use of RMSE measure may be not appropriate. This means that the metric that is similar to the one that was used in feature importance case is not suitable. More appropriate metric could be the following: take the test dataset and add small amount of noise to the items as was done in the feature importance case. Then count the number of item pairs when the example provided didn’t change and divide by the total number of items. This ratio then is how continuous example generator is - if it provides the same examples for slightly changed inputs, then it is continuous.
- The less the better
Worst case: Constant explainer
Best case: Random explainer
- class xaib.metrics.example_selection.target_discriminativeness.TargetDiscriminativeness(ds, model, *args, **kwargs)[source]¶
Given true labels and explanations in form of the examples, train another model to discriminate between labels. The quality of the model on the examples given describes the quality of the explanations. The quality can be measured by any performance metric, but it is better to adopt to imbalanced data and use F1-measure for example.
The greater the better
Best case: examples are descriptive of labels so that the model reaches best performance Worst case: constant or random baseline - giving insufficient information to grasp labels
- class xaib.metrics.example_selection.same_class_check.SameClassCheck(ds: Dataset, model: Model, **kwargs: Any)[source]¶
Counts how many times the class of the input and the class of the example produced are the same