Linear correlation, dependence, copulas, Nataf distribution and all that stuff

bsudret · February 7, 2020, 8:11am

Hello everyone

I started to answer a question about linear correlation for non Gaussian random variables on the forum, and it became rather long. So I decided to move it into this section and I tried to make it comprehensive. The topic is: how to account for linear correlation between non Gaussian variables, what is the point of using copulas, and how is all this implemented in UQLab?

The reason why we adopt the copula formalism in UQLab (as it is the case in all major UQ software) is that it is the only mathematically sound way to define such multivariate distributions. Let me explain.

Case of correlated Gaussian variables

A Gaussian vector is completely defined by the mean and standard deviation of its marginals (which are Gaussian) and a correlation matrix, which happens to be the matrix of linear (Pearson’s) correlation coefficients \rho_{X_i, X_j}.
In terms of copula, a Gaussian vector also has a Gaussian copula, this is actually what defines the latter. Also, for multivariate Gaussian distributions (and only for those), no correlation is equivalent to independence.

With UQLab: To create X_1\sim \mathcal{N}(1,1) and X_2\sim \mathcal{N}(2,0.5) with a correlation coefficient of 0.8, you can use:

  uqlab;
  Inp.Marginals(1).Type = 'Gaussian';
  Input.Marginals(1).Moments= [1 1];
  Input.Marginals(2).Type = 'Gaussian';
  Input.Marginals(2).Moments = [2 0.5];
  Input.Copula.Type = 'Gaussian';
  Input.Copula.Parameters = [ 1 0.8 ; 0.8 1 ];
  myInput = uq_createInput(Input);

Case of correlated non-Gaussian variables

In this case, defining the linear correlation is not the proper way to handle the dependence. As you may know, if you have the joint probability density function (PDF) f_{\boldsymbol{X}} of a random vector \boldsymbol{X} you can compute its marginal distributions and linear correlation matrix [\boldsymbol{\rho}] :

f_{\boldsymbol{X}} \Longrightarrow \{f_{X_i}, i=1,\, \dots,\, d\} and [\boldsymbol{\rho}]

However the reciprocal is not true!! If you only have the marginals and a correlation matrix, there are infinitely many joint PDFs that would comply with this knowledge. The only way to properly define a joint PDF is to define the underlying copula (which describes the dependence) and select its parameters so that the correlation matrix is as expected.

For a long time in engineering (more precisely in structural reliability), the problem was solved by using the so-called Nataf distribution^[1]. However, as shown in that 1986 paper, a pseudo-correlation matrix [\tilde{\boldsymbol{\rho}}] has to be introduced so that the linear correlation eventually obtained is the prescribed one. Unfortunately, this pseudo-correlation depends on the marginals, i.e. it does not only encode dependence: if you change the marginals, you should recompute the pseudo_correlation so as to get the expected linear correlation. Liu & Der Kiureghian even developed approximate equations to make the link between [\tilde{\boldsymbol{\rho}}_{ij}] and [\boldsymbol{\rho}_{ij}] depending on the marginals of X_i and X_j.

In fact, the right measures of dependence to use in such a non-Gaussian case (also called measure of association) are the Spearman rank correlation coefficient \rho_S and the Kendall \tau.

Note that the Nataf distribution with pseudo-correlations has been used for a good 25 years before Lebrun & Dutfoy^[2] realized that it was nothing but a multivariate distribution with arbitrary marginals and … a Gaussian copula.

How to deal with the Gaussian copula (resp. Nataf distribution) in practice ?

As mentioned above, the right way to define dependence is to choose a copula and define its parameters. Unless there are reasons not to do so (see below), the Gaussian copula may be a good choice, and it is rather easy to use.

*** if you don’t have data (“expert probabilistic input model”)

you define your marginals of choice for each input
you choose the Gaussian copula
you define the matrix of rank correlation coefficients [\rho_{S,ij}], from which the parameters matrix [\theta_{ij} ] of the Gaussian copula is computed through:

\begin{equation} \theta_{ij} = 2 \, \sin \displaystyle{\frac{\pi \rho_{S,ij}}{6}} \tag{1} \end{equation}

NB:

the above equation is close to identity, i.e. \theta_{ij} \approx \rho_{S,ij} in practice.
if you don’t have data, you will make an a priori choice about correlations (for instance, based on literature). So there is no reason not to pretend that you choose the rank correlations instead of the linear correlations .

*** If you do have data

you do statistical inference for each parameter separately, to find out the best fitting distribution
you compute the matrix of (empirical) rank correlation coefficients \rho_{S, ij} from the data
you choose a Gaussian copula whose parameters are computed from that matrix using the above equation.

*** Reasons not to choose a Gaussian copula

Beyond the correlation coefficients which refer to the whole distributions of X_i and X_j, the tail dependence may be of interest in certain cases. Broadly speaking, tail dependence coefficients characterize the probability that (X_i,\, X_j) jointly take extreme values. This is a feature of interest when e.g. modelling environmental loadings on structures. As a matter of fact, the Gaussian copula does not have tail dependence (the upper and lower tail dependence coefficients are zero), meaning that even if there is correlation between X_i and X_j, they kind of independently take large values. If such “extreme joint” behaviour is to be captured, Gumbel copula or others may be preferred, see e.g. the UQLab manual of the INPUT module

Gaussian copula using UQLab:

if you don’t have data, you can use the following syntax:

uqlab;
Inp.Marginals(1).Type = 'lognormal';
Input.Marginals(1).Moments= [1 1];
Input.Marginals(2).Type = 'Gamma';
Input.Marginals(2).Moments = [2 0.5];
%
Input.Copula.Type = 'Gaussian';
Input.Copula.Parameters = [ 1 0.7 ; 0.7 1 ];
myInput = uq_createInput(Input);

if you do have data: you’re lucky, because UQLab (thanks to @torree!) does everything for you in two lines:
```
InputOpts.Inference.Data = [Your Data, a N x d array];
InferredInput = uq_createInput(InputOpts);
```

Behind these two lines, UQLab follows exactly the procedure described above, i.e. infer the best marginal for each of the d parameters separately (column-by-column) out of many families of available distributions, then transform the data onto uniform marginals, then infers the best copula among all the ones available in the UQLab (it can take some time for large data sets though!).

Conclusions

Dependence between random variables should be nowadays always cast in terms of copulas. The most straightforward to use / parametrize is the Gaussian copula, which only requires computing rank correlation coefficients between pairs of variables.

Liu, P.-L. & Der Kiureghian, A. (1986) Multivariate distribution models with prescribed marginals and covariances, Prob. Eng. Mech., 1, 105–112. ↩︎
Lebrun, R. and Dutfoy, A. (2009) An innovating analysis of the Nataf transformation from the copula viewpoint, Prob. Eng. Mech., 24, 312–320 ↩︎