On the importance of accurate models of dependence in UQ: a cautionary tale

torree · May 10, 2019, 2:06pm

[I ended up writing a much longer post than previously wanted. Jump to the concluding summary for the take-home message!]

This post aims at bringing to the UQWorld community the fruitful discussion that stemmed from a paper I recently co-authored (see on Arxiv or on Prob. Eng. Mech.) on enhanced probabilistic input models in UQ.

The Gaussian assumption

We investigated how more flexible models of input dependencies than the commonly used Gaussian copula could be used in combinations with uncertainty quantification (UQ) methods to improve statistical estimation. In our three-step-view of UQ, this shifts the focus from the analysis methods, which I dare say covers 95% of research done in the UQ scientific community, to the input model. Why were we interested in working on more flexible input models?

In short, you can say: reality is (often) non-Gaussian. Inputs to systems studied in UQ are rarely jointly Gaussian. You could say: no worries, I know! My loads are Gumbel distributed, and I informed my algorithm about it! Chances are, though, that you still (explicitly or by relying on defaults) modelled those loads as mutually independent variables, or maybe assigned them a Gaussian dependence structure (copula). Most existing UQ software do not even provide different choices. (UQLab will, very soon!). So, is that bad?

A fictional case study

Let’s considered a particularly evil (or good, for me to make a point; and anyway very common) case study: reliability analysis. (This example is treated more extensively in our paper).

We are interested in the failure probability of a planned new bridge (truss structure) which will be occasionally subject to intense traffic. Specifically, we want to make sure that the probability of failure p_f due to traffic meets the requirement: p_f<10^{-4}.

We can imagine, for simplicity, that the vertical loads due to traffic concentrate on six nodes of the truss, and that these are the only uncertain quantities here (sure, there will be more, like that unmeasurable Young modulus, but let’s keep things simple). Our input is, therefore, a 6-dimensional random vector of loads

\mathbf{X}=(X_1,\ldots,X_6).

Traffic measurements around the area the bridge is planned to be built in allow us to assign the loads a joint CDF (for instance by inference on the data). Following the copula formalism, we represent the joint CDF F_{\mathbf{X}}(\mathbf{x}) in terms of marginal distributions and a copula:

F_{\mathbf{X}}(\mathbf{x}) = C(F_{X_1}(x_1), \ldots, F_{X_6}(x_6))

We find out that the Gumbel distribution is a satisfactory model for the marginals, and we fit its parameters to the traffic data (or some expert will kindly provide us which this information). We then only need to assign the 6-dimensional copula, and then we are good to go with our analysis.

<<Wait, a 6-dimensional copula? Can’t even picture it in my head! It’s anyway probably just a detail, let’s forget about it…>>. We thus model the loads as independent: F_{\mathbf{X}}(\mathbf{x}) = F_{X_1}(x_1) \cdot \ldots \cdot F_{X_6}(x_6).
Having F_{\mathbf{X}}(\mathbf{x}), we can run our preferred reliability analysis algorithm (say: FORM/SORM, Subset Simulation, MCMC, …). We obtain an estimated failure probability \hat{p}_f=4\cdot10^{-6}. Great! It’s well below the critical threshold of 10^{-4}. Also, the figure does almost not change across different reliability methods. It must be correct

<<But wait, that’s too good to be true…should I actually consider the load’s dependencies?>>. So, we do our next best: we use a Gaussian copula. We fit its parameters to data (practically any software knows how to do that), rerun our favorite reliability algorithm, and …\hat{p}_f=1 \cdot 10^{-5}.
Hey dude, the estimated p_f more than doubled, but it is still well below 10^{-4}. And again, there is very little difference among different reliability methods

<<What a day! It’s going to be a great bridge indeed…and yet…why can’t I get this copula thing out of my head?>> Devoured by doubt, we finally decide to give a fancy vine copula, one of the many complex models of copulas recently invented by statisticians, a try. Our friends in the UQLab development team give us access to the UQLab development version, which features vine copula models. Three lines of code later, we have our input model consisting of the Gumbel marginals and the vine copula fitted to data. Another three lines of code give us the result of the analysis stemming from this model, which is…\hat{p}_f = 5 \cdot 10^{-4} >10^{-4}.

Can a copula make such a big difference??

If the difference in results mentioned in the example above seems exaggerated, be cautious: it isn’t. Depending on the data and on the quantity of interest, different copulas can bring about a huge difference in the final estimate.

In the example above, which has been simplified and adapted from our publication (see the beginning of this post), the reason for the difference was that loads were upper tail dependent. This means that, if one of them was extremely high, there was a non-zero probability that other loads would take extreme values too. This is a much-expected feature for loads, but one that the Gaussian copula cannot model. Actually, for extreme events the Gaussian copula does not differ much from the independent one: the peak of the Gaussian is well in the center of the probability space and flattens out rapidly when moving away from it. This explains the relatively small difference observed between the two in the example above. And since reliability analysis often pertains very unlikely but highly correlated extreme events, using a Gaussian copula is not a wise choice there.

Remark – The use of the Gaussian copula is also the main factor that led to underappreciating the risk of a few financial markets crashes that eventually happened.

Can a suitably selected copula be more important than a powerful UQ algorithm?

Wrong representation of the dependencies may doom every chance to get decent estimates of the output, irrespective of the UQ method used to solve the specific problem. We hinted at this in our example, mentioning that different methods for reliability analysis led to very similar results.

To see why, let’s consider an ideal scenario: we don’t need any fancy UQ algorithm, because our model is instant fast to evaluate and we can perform a trillion simulations per second (1+1 on any computer takes longer). We can, therefore, run a huge plain Monte Carlo simulation, and estimate with extreme precision the quantity of interest. But the latter still depends on the input distribution! And if the input distribution is misrepresented, the error propagates through the analysis and yields a biased estimate of the output statistic.

In a real case scenario, where we can’t afford this infinitely big Monte-Carlo simulation and need to rely on faster methods, we will be adding an approximation error to the anyway wrong estimate. The approximation error is, however, our least problem: we first need to make sure that the input model is accurate.

In summary

To conclude, we established the following points:

An inaccurate representation of the probabilistic input model yields a biased estimate of the quantity of interest in UQ problems.
This fact is independent of the UQ method used to solve the problem. In fact, even an analytical solution or an infinite-size Monte Carlo simulation would be irremediably affected by the misrepresentation of the input.
The main limitation of probabilistic input models used in UQ in practice is the copula representation. Especially in high dimension (many inputs), finding a suitable copula is complicated. Properly representing and inferring marginals is a comparably simpler and better taken-care-of practice.
Many powerful and flexible copula models of multivariate dependence have been recently proposed in the mathematical literature, such as multivariate archimedean families and vine copulas.
In the future, UQ software will need to incorporate such models of dependence for a better representation of uncertain inputs.
UQLab is planned to feature vine copula representation and inference in its next official release.

damarginal · May 25, 2019, 6:28pm

Welcome to UQWorld, @torree! Thanks for the insight on modeling dependence and copula.

I was previously working on Bayesian calibration problem of a computer model, which was solved using MCMC simulation. Although the parameters were a priori independent, the calibration resulted in a highly correlated sample set.

I was wondering whether it would make sense to represent these high-dimensional correlated sample with copula? At the very least it would be a more compact representation that we can later use or share with others… Is this one of a possible use case of copula representation? What do you think?

torree · July 14, 2019, 1:05pm

Dear @damarginal,

very interesting question. Indeed, modeling the dependence among the parameters of a model - any, therefore also a model obtained by Bayesian methods - makes a lot of sense to me. For instance, it enables a more accurate resampling from that model.

We had a similar case to yours - we were using parametric models of earthquake signals across Europe given historical data, and found strong correlations between some parameters in the model. So we used copulas to model such dependence. We even observed strong tail correlations, hinting at the fact that the parameter models tended to exhibit joint extreme values considerably more often than if they were statistically independent. This, in turn, totally changes the statistics of the signals and the shape of the hazard functions.

From an information theoretical point of view, we could say that these models have some degree of redundancy in their parameters: given the value of some parameters, other parameters are very likely to be confined into a smaller range of likely values, and unlikely to deviate much from those. Unfortunately, I have not found any publication specifically addressing this point. If you find any, please leave a note!

damarginal · July 16, 2019, 7:33am

Yes, actually I’ve seen this in action recently from a presentation by @c.lataniotis on high-dimensional data-driven uncertainty propagation. If I am not wrong, he basically creates a generative model for a high-dimensional input of a simulation given a set of observations from such an input using copula formalism. That’s neat.

Even if for the time being, the math behind vine copulas is still a bit beyond me, on practical level I think I can see how it is useful.