How does the accuracy of a metamodel change, if the copula is ignored for the generation of the experimental design?

colette.buchs · October 23, 2024, 9:03am

Hi there,

We are including UQLab in a Civil Engineering FEM Software (ZSoil).

Our workflow is to:

define marginals of the input
define copula
get a sample of the input
evaluate this sample in ZSoil
define an output
generate a metamodel on this experimental design (PCE, PCK or Kriging)
sensitivity analysis
reliability analysis
bayesian analysis

When using high correlation coefficients in the copula, we encountered issues in the Bayesian analysis. We circumvented this issue by generating the experimental design on an uncorrelated sample.

This lead to a more general question: How does the accuracy of the metamodel change, if the copula is ignored for the generation of the experimental design?

Some thoughts, we had on this topic:

The metamodel might be less prone to overfitting, because the examples are more “different”
The accuracy in more unlikely regions of the random vector space might be better (as shown in this UQ World Post: Choice of distribution for the construction of a meta-model)
The overall accuracy might be slightly worse, because information on the samples is omitted

Thanks!

Anderson_Pires · November 13, 2024, 10:32am

Hey Colette,

Which metamodel are you referring to when you mention the copula? If it is PCE, the effect of dependencies has actually been a subject of study in the Chair in the past. I recommend checking out the paper “Data-driven polynomial chaos expansion for machine learning regression”. In this paper, they discuss two ways of dealing with the problem you are facing:

Ignoring dependencies: They build a PCE with a polynomial basis that is orthonormal with respect to the input marginals. This approach is referred to as aPCEonX.
Using the iso-probabilistic transform: This method transforms the dependent inputs into independent ones, and then they build a PCE on this independent input set. This approach is referred to as aPCEonZ.

Interestingly, they conclude that using the aPCEonX approach leads to better pointwise predictions than aPCEonZ. The reason for this is that the iso-probabilistic transform is usually highly nonlinear. However, using the aPCEonX approach requires caution: if you are using the PCE to compute the moments or Sobol indices of your model, you can not obtain them directly from the PCE coefficients, as you would for the independent case. To get the moments and Sobol indices, you will need to sample points from your actual input PDF and compute them via Monte Carlo simulation, for instance. So, if you decide to go with this approach, my suggestion is to ignore the dependencies, create the PCE model, and then use it as a proxy for your expensive model when computing your QoI via simulation.

Now, Kriging should not be affected by the PDF, as knowledge of the input is not required when modeling it. I believe that, in this case, a space-filling experimental design would be best. Similar to the PCE case, once the metamodel is validated, you can use it as a proxy for your expensive simulation to compute what you need via simulation.

Regarding the Bayesian analysis problem, I was a bit puzzled when you mentioned “using high correlation coefficients in the copula”. What exactly do you mean by that? From my understanding, this coefficient should be inferred from some data you have and is a property of your model, not something we can choose freely. In any case, I believe the solution you have now, using the uncorrelated sample, is fine. Just be careful to sample from the original input distribution whenever computing the QoI.

Hope this helps!
Best regards,
Anderson

colette.buchs · November 28, 2024, 10:44am

Hi Anderson,

thanks, your answer is really helpful!

We implemented PCE, PCK an Kriging as options.

For your question: As we integrate UQLab in a software (ZSoil), we test a lot of different cases, that could be used as a user. One of these tests was a quite high correlation between different stiffness measures in the same soil.

Did the Chair by any chance investigate how to quantify “regional differences” in model accuracy in terms of the output? E.g. How accurate is a PCE for 0.1<Y<0.2, if the model output is 0<Y<10?

Best regards,
Colette