# Connection of information gain and sensitivity indices

Dear colleagues,

I am a UQ guy. I am using information gain a lot in UQ, e.g. to understand locations of the most informative measurements, or to measure information gain when I replace a prior by a posterior.
I also studied from Oliver Le maitre and from Bruno how to compute sensitivity indices from variances.

Today I found something curious: I cite
" Definition: Information gain (IG) measures how much “information” a feature gives us about the class." See more here

This sentence remind me “Sensitivity of uncertain_parameter_1 is its contribution to the total variance (total uncertainty)”.

I found that information gain and sensitivity indices are connected. What do you think? Agree with me?

More references
https://www3.nd.edu/~rjohns15/cse40647.sp14/www/content/lectures/23%20-%20Decision%20Trees%202.pdf (IG in decision trees)

https://medium.com/coinmonks/what-is-entropy-and-why-information-gain-is-matter-4e85d46d2f01 (What is Entropy and why Information gain matter in Decision Trees?)

Hi Alexander,
I am not an expert on practices based on Information Gain (IG) myself, however, I think that your intuition regarding the connection between IG and variance-based sensitivity (Sobol) indices is further supported by the 1st reference, where it is stated that “Unrelated features should give no information”. This is similar to parameters with zero or very small Sobol indices, which typically do not affect the QoI.

Since this reply comes quite a long time since your initial post, have you investigated this connection more? For example., if you try to order a model’s input parameters/RVs based on their IG scores and based on their Sobol indices, do both metrics lead to the same ordering?

Dear Alexander,

Thanks for posting this very interesting topic. I hope it is not too late to share some ideas here.

I agree with you that the “information gain” (or formally the mutual information) has similar behaviors to the Sobol’ indices. Moreover, I think we can group them into a more general framework.

Let’s start with the classical UQ setup: we have input random variables \boldsymbol{X} \sim P_{\boldsymbol{X}} of dimension M and a deterministic simulator expressed as

Y = {\mathcal{M}}(\boldsymbol{X}).

For the following analysis, we split the input vector into two subsets \boldsymbol{X} = (\boldsymbol{X}_{{\boldsymbol{\rm u}}},\boldsymbol{X}_{{\boldsymbol{\rm v}}}), where {\boldsymbol{\rm v}} = \{1,\ldots,M\} \setminus {\boldsymbol{\rm u}}. Under this decomposition, we define an auxiliary random variable

Y_{\boldsymbol{\rm u}}(\boldsymbol{x}_{{\boldsymbol{\rm u}}}) = {\mathcal{M}}(\boldsymbol{x}_{\boldsymbol{\rm u}},\boldsymbol{X}_{\boldsymbol{\rm v}}),

which is equivalent to Y_{{\boldsymbol{\rm u}}}(\boldsymbol{x}_{{\boldsymbol{\rm u}}}) = Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}} = \boldsymbol{x}_{\rm u}.

Consider an arbitrary contrast measure d(Y_1,Y_2) that measures how a random variable Y_2 is different from a given random variables Y_1. As a result, d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{x}_{\boldsymbol{\rm u}})\right) indicates how the knowledge of \boldsymbol{X}_{{\boldsymbol{\rm u}}} = \boldsymbol{x}_{{\boldsymbol{\rm u}}} can make a difference. Then, we can define the sensitivity index S_{\boldsymbol{\rm u}} by

S_{\boldsymbol{\rm u}} = \mathbb{E}_{\boldsymbol{X}_{{\boldsymbol{\rm u}}}}\left[d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{X}_{{\boldsymbol{\rm u}}})\right)\right].

This index shows how knowing \boldsymbol{X}_{{\boldsymbol{\rm u}}} can change the property of Y in expectation.

Sobol’ indices, entropy-based indices (or information gain), and Borgonovo indices can all be interpreted within this framework with a proper choice of the contrast measure.

### Sobol’ indices

We can define d(Y_1,Y_2) as the variance difference between the two random variables

d(Y_1,Y_2) = {\rm Var}[Y_1] - {\rm Var}[Y_2].

Then, we obtain

d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{x}_{\boldsymbol{\rm u}})\right) = {\rm Var}[Y] - {\rm Var}[Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}} = \boldsymbol{x}_{\boldsymbol{\rm u}}],

which shows how we can reduce the variance of Y by knowing \boldsymbol{X}_{\boldsymbol{\rm u}} = \boldsymbol{x}_{\boldsymbol{\rm u}}. Therefore, the associated index S_{\boldsymbol{\rm u}} becomes

\begin{aligned} S_{\boldsymbol{\rm u}} &= \mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{X}_{\boldsymbol{\rm u}})\right)\right] = \mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[{\rm Var}[Y] - {\rm Var}[Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}}]\right] \\ &= {\rm Var}[Y] - \mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[{\rm Var}[Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}}]\right] = {\rm Var}\left[\mathbb{E}\left[Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}}\right]\right]. \end{aligned}

The last equality comes from the Law of total variance. Dividing S_{\boldsymbol{\rm u}} by the total variance of Y will provide the first order Sobol’ index of the group \boldsymbol{X}_{\boldsymbol{\rm u}}.

### Entropy-based indices

We can define d(Y_1,Y_2) as the difference in terms of the differential entropy between two random variables

d(Y_1,Y_2) = h(Y_1) - h(Y_2).

As a result, we have

d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{x}_{\boldsymbol{\rm u}})\right) = h(Y) - h\left(Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}} = \boldsymbol{x}_{\boldsymbol{\rm u}}\right),

which quantifies how we can reduce the differential entropy of Y by knowing \boldsymbol{X}_{\boldsymbol{\rm u}} = \boldsymbol{x}_{\boldsymbol{\rm u}}. The associated index can be further derived as

\begin{aligned} S_{\boldsymbol{\rm u}} &= \mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{X}_{\boldsymbol{\rm u}})\right)\right] = \mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[h(Y) - h\left(Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}} = \boldsymbol{X}_{\boldsymbol{\rm u}}\right)\right] \\ &= h(Y) - h\left(Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}}\right) = I(Y;\boldsymbol{X}_{\boldsymbol{\rm u}}). \end{aligned}

It is worth remarking here that I use the notation h\left(Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}}= \boldsymbol{X}_{\boldsymbol{\rm u}}\right) instead of h\left(Y \mid\boldsymbol{X}_{\boldsymbol{\rm u}} \right) in the derivation above, because the latter has a specific meaning in information theory (called conditional entropy). The last equality corresponds to the definition of mutual information I(Y;\boldsymbol{X}_{\boldsymbol{\rm u}}). The index S_{\boldsymbol{\rm u}} is consistent with the “information gain” defined in decision tree, e.g. https://www3.nd.edu/~rjohns15/cse40647.sp14/www/content/lectures/23%20-%20Decision%20Trees%202.pdf. Similar to the Sobol’ indices, we can normalize S_{\boldsymbol{\rm u}} by h(Y) to have something between [0,1].

### Borgonovo indices

Consider a contrast measure

d(Y_1,Y_2) = \frac{1}{2}\lVert f_1- f_2 \rVert_1,

where f_1 and f_2 are the probability density functions of Y_1 and Y_2, respectively, and \lVert\cdot\rVert_1 denotes the L^1 norm. It is easy to derive the associated index S_{\boldsymbol{\rm u}}:

\begin{aligned} S_{\boldsymbol{\rm u}} &= \mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[d\left(Y,Y_{\boldsymbol{\rm u}}(\boldsymbol{X}_{\boldsymbol{\rm u}})\right)\right] = \frac{1}{2}\mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[ \lVert f_Y - f_{Y\mid \boldsymbol{X}_{\boldsymbol{\rm u}}} \rVert_1\right] \\ &= \frac{1}{2}\mathbb{E}_{\boldsymbol{X}_{\boldsymbol{\rm u}}}\left[\int_{\mathcal{D}_{Y}} \lvert f_Y(y) - f_{Y\mid \boldsymbol{X}_{\boldsymbol{\rm u}}}(y)\rvert {\rm d}y\right], \end{aligned}

which is the definition of the Borgonovo indices.

As a summary, I think that the Sobol’ indices and the information gain are defined in a similar way but with a different focus: the Sobol’ indices use variance as an uncertainty measure, whereas the information gain relies on the entropy.

1 Like

Great, thanks for your post. I need some time to read and to understand.

Dear Dimitris,