# Covered Information Disentanglement: Model Transparency via Unbiased Permutation Importance

###### Abstract

Model transparency is a prerequisite in many domains and an increasingly popular area in machine learning research. In the medical domain, for instance, unveiling the mechanisms behind a disease often has higher priority than the diagnostic itself since it might dictate or guide potential treatments and research directions. One of the most popular approaches to explain model global predictions is the permutation importance where the performance on permuted data is benchmarked against the baseline. However, this method and other related approaches will undervalue the importance of a feature in the presence of covariates since these cover part of its provided information. To address this issue, we propose Covered Information Disentanglement (CID), a method that considers all feature information overlap to correct the values provided by permutation importance. We further show how to compute CID efficiently when coupled with Markov random fields. We demonstrate its efficacy in adjusting permutation importance first on a controlled toy dataset and discuss its effect on real-world medical data.

^{1}Amsterdam University Medical Center,

Meibergdreef 9 1105 AZ,

Amsterdam, The Netherlands

{j.p.belopereira,

## 1 Introduction

Understanding the biological underpinnings of disease is at the core of medical research. Model transparency and feature relevance are thus a top priority to discover new potential treatments or research directions.
One of the current most popular methods to explain local model predictions is SHAP Lipovetsky and Conklin (2001); Štrumbelj and Kononenko (2014); Lundberg and Lee (2017), a game-theoretic approach that considers the features as “players” and measures their marginal contributions to all possible feature subset combinations. SHAP has also been generalized in SAGE Covert et al. (2020) to compute global feature importance. However, recent work by Kumar et al. Kumar et al. (2020) exposes some mathematical issues with SHAP and concludes that this framework is ill-suited as a general solution to quantifying feature importance.
Other local-based methods such as LIME Ribeiro et al. (2016) and its variants (see e.g. Singh et al. (2016); Ribeiro et al. (2018); Guidotti et al. ; Pereira et al. (2019)) build weak yet explainable models on the neighborhood of each instance. While this achieves higher prediction transparency for each data point, in this work, we are mainly concerned with a more holistic view of importance, which may be more appropriate to guide new research directions and unravel disease mechanisms.
Tree-based methods are very commonly selected for this purpose because they compute the impurity or Gini importance Breiman (2001). The impurity importance is biased in favor of variables with many possible split points; i.e. categorical variables with many categories or continuous variables Strobl et al. (2007).
A generally accepted alternative to computing the Gini importance is the permutation importance Fisher et al. (2018), which benchmarks the baseline performance against permuted data. There is, however, the issue of multicollinearity. When features are highly correlated, feature permutation will underestimate the individual importance of at least one of the features, since a great deal of the information provided by this feature is “covered” by its covariates. One option is to permute correlated features together Toloşi and Lengauer (2011). However, this implies choosing an arbitrary correlation grouping threshold or performing cross-validation to determine the optimal number of groups that yield the best estimator, resulting in slow running times. Most importantly, it leaves out the differentiation of their contributions to the final prediction. Motivated by the idea that there is an information overlap between different features, we develop Covered Information Disentanglement (CID),^{1}^{1}1We make an implementation of CID publicly available at: https://github.com/JBPereira/CID. an information-theoretic approach to disentangle the shared information and scale the permutation importance values accordingly. We demonstrate how CID can recover the right importance ranking on artificial data and discuss its efficacy on the Cardiovascular Risk Prediction dataset Hoogeveen et al. (2020).

## 2 Methodology

#### Notation

We denote matrices, 1-dimensional arrays, and scalars/functions with capital bold, bold, and regular text, respectively (e.g. ). Given a dataset , we will denote its random variables by capital regular text with a subscript and the values using lowercase (e.g. and ), while the joint density/mass will be represented as . The expected loss of a function given by: will be denoted by .

### Information Theory background

Information theory (IT) is a useful tool used in quantifying relations between random variables. The basic building block in IT is the entropy of an r.v. , which is defined as:
The joint entropy between r.v.s and is defined as:
The mutual information between r.v.s and is the relative entropy between the joint entropy and the product distribution : 2012).

Using the definitions above, one can derive properties that resemble those of set theory, where joint entropy and mutual information are the information-theoretic counterparts to union and intersection, respectively Ting (2008).
In order to keep this intuition when generalizing to higher dimensions, one can define the entropy of the union of features as:
For a more thorough exposition to IT, the reader can refer to Cover and Thomas (

###### Definition 2.1.

Multivariate Union Entropy

and using the Inclusion-Exclusion principle, we can define the intersection as:

###### Definition 2.2.

Multivariate Intersection Entropy

where is the local entropy.

This definition of multivariate intersection is also called co-information and it may yield negative values. To see this, consider the case of three sets of r.v.s , , and and suppose there is no correlation between and . If we rewrite the mutual information expression into , then this expression may become negative when the information provided by and given a fixed value of is higher than that of . This can happen for instance if has no correlation with but knowing introduces a correlation between the two (what is commonly known as ‘explaining away’). This motivated Williams and Beer to draw the distinction between redundant and synergistic information and propose partial information decomposition (PID) Williams and Beer (2010). Ince Ince (2017) thoroughly analyzed the multivariate properties of PID directly applied to multivariate entropy and suggested to divide the individual terms in definition 2.2, so that positive local entropy terms correspond to redundant entropy, while the negative ones correspond to synergistic entropy.

### Permutation Feature Importance

Feature importance is a subjective notion that may vary with application.
Consider a supervised learning task where a model is trained/tested on dataset and its performance is measured by a function .
In this work, we will refer to feature importance as the extent to which a feature affects , on its own and through its interactions with .
Permutation importance was first introduced by Breiman Breiman (2001) in random forests as a way to understand the interaction of variables that is providing the predictive accuracy.

Consider a dataset and denote the th instance of the th feature by .
Suppose the set is sampled and denote the subsample by . Consider further a random permutation of this subset which we denote by and its th element by . The permutation importance, is given by:

(1) | ||||

(2) |

### Covered Information Disentanglement

In the presence of covariates, the permutation importance measures the performance dip caused by removing the non-mutual information between the feature and the remaining data. That is:

(3) |

where is the expected total importance of feature under model (the quantity we are interested in) and is the expected performance dip covered by all other variables.
To compute would require applying the Inclusion-Exclusion principle and measuring the performance dip for all possible feature combinations of size to the number of features. Instead, we note that intuitively measures the model performance dip when the model is deprived of the information covered by the r.v.s that are correlated with . For an intuitive depiction of the problem, see figure 1.

Motivated by the analogy between set-theory and information measures, we define the joint information between an r.v. and the target variable that is “covered” by the other r.v.s as:

###### Definition 2.3.

Covered information (CI) Given an r.v. and a set of distinct r.v.s , the information of w.r.t. covered by is defined as:

When it is clear from the context what and are, we will abbreviate into , denote the mutual information with by , and the respective local entropy terms for the th row in the dataset with and . We further divide and into its redundant and synergistic counterparts, which for a specific sample are given by:

###### Assumption 2.1.

Permutation importance and entropy terms are related through a map , such that , where is an error term.

Thus, if assumption 2.1 holds, we can use the information of w.r.t. by and approximate equation 3 with:

(4) |

This means we can approximate the result of permuting all possible combinations of features by computing only the single-feature permutation loss and the covered information of r.v. by all the others. Here, we are implicitly defining: , and thus the true importance in the performance difference scale is given by mapping the entropy values when there is no redundant entropy to the space of performance differences.

Since we are predicting the feature importance using a map between entropy terms (which measure model-agnostic importance) and permutation importance values, the end result depends only on how learnable is the model behavior w.r.t to entropy. Moreover, since the entropy values are computed for the different subsample sets , the overall importance variability is also estimated.

There is still the issue of computing , since it involves computing . Since directionality is irrelevant for the purpose of computing overlapping information, we suggest to model using an undirected graphical model (UGM). Let denote a graph with nodes, corresponding to the features, and let be a set of cliques (fully-connected subgraphs) of the graph . Denoting a set of clique-potential functions by , the distribution of a Markov random field (MRF) Koller and Friedman (2009) is given by: , where is the partition function. By the Hammersley-Clifford theorem, any distribution that can be represented in this way satisfies: for any , where is the set . This allows to significantly simplify the expression of covered information yielding the main result of this paper:

###### Theorem 2.1.

Consider an r.v. and set of r.v.s , a response r.v. , as well as the set of r.v.s that are neighbors to both and : . For a Markov Random Field, the covered information of by w.r.t. is given by:

where is a matrix with the product of joint potential values for set of cliques , and are an entry, column, and row of , respectively, while and are arrays with the product of potential values for set of cliques . with fixed and ;

###### Proof.

Using definition 2.1, 2.2 and 2.3:

The probability density for Markov Random fields is equal to , where is the partition function and is the set of cliques in the Markov network. Define two sets of cliques: and . In that case (ignoring the partition function term because it cancels out):

To compute

where is the function for a fixed value of the r.v. . Since the set of cliques , and denoting by , the functions and for fixed values of and , then:

where is an instance of the set of r.v.s that are neighbors to either or , and are column arrays with the different values of and for fixed , is a matrix with all the values with varying values of in the rows and in the columns, and and are row and column vectors of corresponding to fixed and fixed , respectively. This yields the result of the theorem. ∎

#### Considerations and simplifications

If a -clique MRF is chosen, then depends only on and , and can be computed before the expectation.

Gaussian MRF:
Learning an MRF’s network structure is expensive. One popular approach is to use graphical lasso Friedman et al. (2008) which learns the entries of a Gaussian precision matrix by finding: , where is the precision matrix (constrained to belong to , the set of positive semi-definite matrices), is the empirical covariance matrix and acts in analogy to Lasso regularization by penalizing a large number of non-zero precision entries. We can model the potentials using Gaussian Markov random fields whose potentials are

## 3 Experimental Section

To test the CID ranking adjustment, we first tested it on a toy dataset where the real importances are known, and a real-world medical dataset. We implemented CID in Python using scikit-learn’s graphical lasso Pedregosa and et al. (2011). For the toy dataset, we used scikit-learn’s Extremely Randomized Trees and Bayesian Regression implementations, and for the medical dataset we used a Gradient Boosting Survival model Pölsterl (2020).

### Multivariate Generated Data Test

In order to test if CID adjusts the permutation ranking into the correct one, we took samples from a multivariate distribution with the following marginal distributions: , , , , and with and , . We then defined the outcome variable as:

(5) |

where is an observation of . The true importances are thus: We transformed the data into Gaussian using quantile information and chosen gaussian markov random fields to pair with CID. The graph was inferred using graphical lasso with a grid-search cross-validation to determine the optimal penalization parameter. To test the CID correction, we performed Shuffle Splits with Extremely Randomized Trees and computed the Gini importance for each feature, as well as the permutation importance. We then adjusted the feature importances using the CID algorithm and Bayesian Regression as (see assumption 2.1). You can compare the rankings in figure 3. As can be seen from the swarmplot in figure 3, with the exception of , permutation importance placed a nearly equal weight on all features, centered around zero, presumably due to the high feature covariance. The CID was able to rectify this ranking and ranked the features in the right order. It also placed every feature importance at non-zero with a clear gap between unequally important features and similar importance for / and /, matching well the true importances. Moreover, notice how the Gini importance underestimated , presumably because and offer nearly as good partitions as due to their overlap and similarity.