Number of inputs for Sobol' indices

Hello all,
I have a FEM model and I want to use PCE to evaluate the Sobol indices for my input parameters. My question is this:
Are there any limitations in the number of inputs when I want to use PCE and calculate Sobol indices based on it? I mean is it OK if I have for example 30 input variables in my study?
One other question is that, are there any relations (or rule of thumbs) between the number of parameters and the number of evaluations of my model? I mean how should I make sure that my PCE and Sobol indices are correct when I have a large number of inputs? Is checking the convergence of PCE enough?
Thanks in advance

Dear @Aep93

Thanks for your question. There is no theoretical limitation in the number of input variables when you want to build a polynomial chaos expansion (PCE) and then post-process it to get Sobol’ sensitivity indices.
Using the 'LARS' option to build the PCE, you create a so-called sparse PCE, which has only a limited number of terms that are relevant to your model. In the past years we used sparse PCE without any issue with up to a few hundreds input variables. So 30 is definitely feasible.

If your computational cost is not a big issue (say, your FEM model takes a few seconds to minutes to run), I recommend to start with N \approx 10\cdot M, that is an experimental deisgn (ED) of size equal to 10 times the dimensionality of your problem. That would mean 300 runs in your case.

However, the nice thing with sparse PCE using LARS is that you decide how much model runs you can afford! With say M=30 inputs you could already start with N=50 or N=100 model runs if the latter are extremely costly. You should use a space-filling design, say an optimized LHS (that’s what UQLab does by default if you don’t specify your ED manually).

Of course, you must make the best use of these, say 100-300, model runs: this means building PCEs with ‘LARS’ and different truncation schemes for the candidate basis. Then UQLab selects the best combination based on the smallest leave-one-out error.
When M>10 it is a good practice to introduce so-called q-norm truncation. You can choose q=0.5 (PCEOpts.TruncOptions.qNorm = 0.5) or a list, say PCEOpts.TruncOptions.qNorm = [0.3 0.5 0.7]. It is also a good practice to start the analysis by imposing low-order interaction terms (i.e. allowing only polynomials that depend on 2 or 3 variables): this drastically reduces the set of candidate polynomials in which the LARS algorithm searches for the sparse basis. You can use e.g. PCEOpts.TruncOptions.MaxInteraction =2.

Now, the quality of the result is highly dependent on the FEM model, more precisely on the effective dimensionality. If there are only 5 important parameters (out of 30) with large Sobol’ indices (and 25 with S_i \approx 0), probably 300 runs is already too much and you could build an accurate PCE with less runs. If, in contrast, many parameters are important, then you need more terms in your sparse PC expansion, thus more model runs.

The quality of your PCE is in any case measured by the so-called leave-one-out error, which is an estimator of the mean-square error that is computed without additional model runs. One you’ve built your PCE, you can get it by typing uq_print(myPCE) or by browsing the PCE object (type myPCE.error). In engineering applications a PCE is already excellent if \varepsilon_{LOO} \le 10^{-3}. For the sake of computing Sobol’ indices, \varepsilon_{LOO} \approx 10^{-2}, or even a few percent, is already good enough.

As a summary, you can use N \approx 10\cdot M as a rule of thumb for the number of FEM runs N, when you have M inputs. And if the resulting PCE has a leave-one-out error smaller than 10^{-2} you can be confident in the resulting Sobol’ indices. To make the best use of your model runs, you should try different truncations schemes when using the ‘LARS’ solver.

Good luck!

Post-Scriptum: why don’t you post your experimental design and UQLab input file on the forum, in the Case studies category. You first upload your UQLab input object as a .m file(you can anonymize the variables by calling them X_1, X_2, etc.). Then you upload separately a Matlab .mat file containing two arrays:

  • XED is a N \times M array containing you N input vectors for your N model runs
  • YED is the vector of corresponding outputs of size N \times 1 if you have a single scalar quantity of interest, or of size N \times N_{QOI} if you have N_{QOI} different output quantities if interest.

Then the UQWorld community may help you to find the best PCE strategy, or more generally the best surrogate modelling approach. Again, everything can be anonymized if your data is confidential, as you upload only matrices of floats, and a list of distributions in the UQLab INPUT object.


Thank you very much @bsudret for your detailed and useful explanation. I will surely consider posting my files in the case studies in the future. Thanks again.