Whether any dataset can be used to construct a highly accurate PCE model?

Dear nluethen,
@nluethen
First of all, I would like to express my sincere appreciation to you and your team.

There is no doubt that the PCE model is a very powerful tool for modeling based on small samples.
Recently, I had a huge confusion when I was using a dataset with unknown marginal types to construct a PCE model.

Based on the finite element analysis, I obtained two different data sets as follow:
The first three columns are inputs and the fourth column is an output,and the bolded five groups are used as the validation set
dateset1

a	b	c		lc
29.070 	6.329 	0.04997		4
61.220 	7.504 	0.04849		10
61.220 	7.847 	0.04984		10
97.200 	9.480 	0.04953		44
53.470 	7.376 	0.04945		8
102.800 	10.200 	0.05183		52
**43.070 	7.009 	0.05012		10**
**99.090 	9.729 	0.05085		50**

dateset2

a	b	c		lc
88.560 	7.725 	0.05188		18
94.670 	6.345 	0.04388		10
82.360 	4.940 	0.03764		8
104.900 	6.092 	0.04184		56
**119.500 	7.637 	0.04544		78**
**102.700 	7.369 	0.04669		40**
**101.500 	6.400 	0.04187		24**

The accuracy of the PCE model constructed based on [dateset1] is as follows, but the reliability is insufficient due to the data points is too few, the PCE prediction value is very different from the validation value, although the validation error value is small enough.

Leave-one-out error:          3.5505929e-02
   Modified leave-one-out error: 1.3426940e-01
   Validation error:             1.3059044e-02

When I constructed the PCE using the [dateset1 and 2] , it showed a significant degree of reduction in accuracy.

 Leave-one-out error:          3.3504323e-01
   Modified leave-one-out error: 7.9010776e-01
   Validation error:             6.1208222e-01

I tried the following four combinations to infer the marginal types of the input parameters, but the results were not good.

InputOpts.Marginals.Type = 'auto';  %kernel smoothing (ks)\auto
InputOpts.Inference.Criterion = 'BIC';  %KS\BIC

InputOpts.Marginals(1).Inference.Data = X(:,1);
InputOpts.Marginals(2).Inference.Data = X(:,2);
InputOpts.Marginals(3).Inference.Data = X(:,3);

InputHat = uq_createInput(InputOpts);
uq_print(InputHat);

I can’t figure out why the accuracy of the PCE model is greatly reduced when the sample size is increased?
This has never happened in my previous work(The accuracy of the PCE model basically improves with the increase of the sample size of the experimental design set)

From my point, I have thought about the following reasons:

  1. Poor selection of validation set
  2. The sample size is still insufficient (more sample size is needed to infer marginal types than known marginal types)
  3. The data itself is not suitable for constructing surrogate models

This will give me a deeper understanding of the PCE model.
Any reply I would appreciate!
Many thanks

bests

Dear felix,
The assumed stochastic distribution based on discrete data will lead to error. I recommend that you could consider using the data-driven PCE called arbitrary polynomial chaos expansion or DD-PCE-Corr surrogate model. You can refer to the paper:
2011-Oladyshkin

A concept for data-driven uncertainty quantification and its application
to carbon dioxide storage in geological formations

2020-Lin

A data-driven polynomial chaos method considering correlated
random variables

Best wishes!
Qizhe Li

1 Like

dear lqz,
Thank you very much for your valuable advice, I will download these two articles to study.
Have a nice day!

bests

dear lqz,
@lqz
I used arbitrary polynomial chaos expansion instead of the default polynomial basis in uqlab at your suggestion, but the result is still not much improved.

FILELOCATION = fullfile(uq_rootPath, 'Examples', 'SimpleDataSets',...
    'Cang');
load(fullfile(FILELOCATION,'Cang_DOE10.mat'), 'X', 'Y');
load(fullfile(FILELOCATION,'Cang_VAL5.mat'), 'Xval', 'Yval');


InputOpts.Marginals.Type = 'auto';  %kernel smoothing (ks)\auto
InputOpts.Inference.Criterion = 'KS';  %KS\BIC
InputOpts.Marginals(1).Inference.Data = X(:,1);
InputOpts.Marginals(2).Inference.Data = X(:,2);
InputOpts.Marginals(3).Inference.Data = X(:,3);
InputHat = uq_createInput(InputOpts);
uq_print(InputHat);


MetaOpts.Type = 'Metamodel';
MetaOpts.MetaType = 'PCE';
MetaOpts.ExpDesign.X = X;
MetaOpts.ExpDesign.Y = Y;
MetaOpts.ValidationSet.X = Xval;
MetaOpts.ValidationSet.Y = Yval;

%MetaOpts.Method = 'LARS' ;
MetaOpts.TruncOptions.qNorm = 0.75;
MetaOpts.Degree = 1:5;

MetaOpts.PolyTypes = {'arbitrary','arbitrary','arbitrary'};
rng(100,'twister')
myPCE_Arb = uq_createModel(MetaOpts);
uq_print(myPCE_Arb)
uq_display(myPCE_Arb)

This highest accuracy is still not satisfactory

  Leave-one-out error:          3.0736567e-01
   Modified leave-one-out error: 7.6811426e-01
   Validation error:             1.9377363e-01

Is it because the number of data is too few or the data itself has the non-adaptive problem of PCE modeling?

Many thanks
bests

Dear @felix fel

I wonder why you are treating your data as separate data sets. Are both data sets from the same type of simulation and are the inputs expected to follow the same distribution? Looking at the numbers, I have a feeling this is not the case, and that would explain your counterintuitive results.

Best regards
Styfen

1 Like

dear Styfen,
@styfen.schaer
Thank you very much for your valuable suggestions.

The two sets of data originate from two different models (different intrinsic structure relationships and materials, different load settings), but the core issue of the study is the same.

The input data a,b,c are the three constants of the parabolic line (derived by fitting the stress curve from the FEA), and output lc is a metric that is closely related to the final object of study.

So I can’t predetermine the type of distribution for the input data, I can only construct a PCE model by inferring the marginal distribution of the input after getting the input-output data set.

Similarly, dozens of data sets are generated through a series of physical experiments, with a portion of them as input and the target quantity of interest as output, and in such cases where the type of input parameter distribution is unknown, it seems that the PCE model can only be constructed based on data-driven (DD-PCE).

However, after inferring the marginal distribution and setting arbitrary polynomial basis to construct DD-PCE, the result is very unsatisfactory.

bests

Dear @felix

Even though the core issue is the same, as I understand your problem statement, you cannot treat these two data sets as having the same mapping from input a,b,c to output lc. You can also see that a of the first data set most likely follows a different distribution, or at least has different distribution parameters than a in the second data set. You will need to create two different surrogates, and to do this accurately, you should increase the experimental design for both models.

Best regards
Styfen

1 Like

dear styfen,
Thank you very much for your valuable advice.
I will try to investigate this problem using a multi-surrogate model.
Many thanks!

bests