Large PCE error for high dimensional input space : how to reduce it?

Dear UQLab team,

I am trying to run a PCE based sensitivity analysis of a hydrogeophysical model (the output of the model is a time series of geophysical signals at Nt = 66 time steps).

Here is a quick description of my problem and the experimental design :

  • Problem dimensionality : 22 parameters
  • Experimental design : Nsamples = 9977 (split into training and validation sets of size 75/25 %)

Despite the large number of samples, I still get large errors (maximum LOO error \approx 0.17) which is too much for reliable Sobol indices computation (I have read in another post that it should be around 10^{-2}).

Do you have an idea on how I could reduce the PCE LOO error in order to make it suitable for reliable Sobol indices computation ?

Thank you in advance for your precious help.
Below is the code I am using to build the PCE, if you want to have a look.
Best regards,
Guillaume

The samples files are too large for me to upload here, but here is the code I am using to build the PCE :

#STEP2 : Retrieve simulation data
simfiles_dir = 'Res_10ksim/'

# first set of simulation results
Y_E2_1 = np.loadtxt(simfiles_dir+'GSA31_E2.csv', delimiter = ',')

X1 = np.loadtxt(simfiles_dir+'GSA31_X.csv', delimiter = ',')


# second set of simulation results
Y_E2_2 = np.loadtxt(simfiles_dir+'GSA32_E2.csv', delimiter = ',')

X2 = np.loadtxt(simfiles_dir+'GSA32_X.csv', delimiter = ',')


Y_E2_3 = np.loadtxt(simfiles_dir+'GSA33_E2.csv', delimiter = ',')

X3 = np.loadtxt(simfiles_dir+'GSA33_X.csv', delimiter = ',')


Y_E2_4 = np.loadtxt(simfiles_dir+'GSA34_E2.csv', delimiter = ',')

X4 = np.loadtxt(simfiles_dir+'GSA34_X.csv', delimiter = ',')


X = np.concatenate((X1,X2,X3,X4))
Y_E2 = np.concatenate((Y_E2_1,Y_E2_2,Y_E2_3,Y_E2_4))

nanidx = np.argwhere(np.isnan(Y_E2[:,0]))

Y_E2 = np.delete(Y_E2, nanidx, 0)
X = np.delete(X, nanidx, 0)

#split into calibration/validation set (here, size : 0.75 / 0.25)

Ncal = int(3*(int(X.shape[0]/4)))

X_cal = X[:Ncal]
Y_E2_cal = Y_E2[:Ncal]


X_val = X[Ncal:]
Y_E2_val = Y_E2[Ncal:]

print(X_cal.shape)
print(Y_E2_cal.shape)

print(X_val.shape)
print(Y_E2_val.shape)

Output :

(7482, 22)
(7482, 66)

(2495, 22)
(2495, 66)

# STEP 3 : probabilistic model input 

InputOpts = {
    "Marginals": [
        # Layer 1
        {"Type": "Uniform",
         "Parameters": [0, 3] # log10Ks1
        },
        {"Type": "Uniform",
         "Parameters": [0.01, 0.04] # tr1
        },
        {"Type": "Uniform",
         "Parameters": [0.30, 0.50] #ts1
        },
        {"Type": "Uniform",
         "Parameters": [0.01, 0.15] # alpha1
        },
        {"Type": "Uniform",
         "Parameters": [1, 1.8] # n1
        },
        # Layer 2
        {"Type": "Uniform",
         "Parameters": [0, 3] # log10Ks2
        },
        {"Type": "Uniform",
         "Parameters": [0.01, 0.04] # tr2
        },
        {"Type": "Uniform",
         "Parameters": [0.30, 0.50] #ts2
        },
        {"Type": "Uniform",
         "Parameters": [0.01, 0.15] # alpha2
        },
        {"Type": "Uniform",
         "Parameters": [1, 1.8] # n2
        },
        # Layer 3
        {"Type": "Uniform",
         "Parameters": [0, 3] # log10Ks3
        },
        {"Type": "Uniform",
         "Parameters": [0.01, 0.04] # tr3
        },
        {"Type": "Uniform",
         "Parameters": [0.30, 0.50] #ts3
        },
        {"Type": "Uniform",
         "Parameters": [0.01, 0.15] # alpha3
        },
        {"Type": "Uniform",
         "Parameters": [1, 1.8] # n3
        },
        # Non simulated zone parameters
        {"Type": "Uniform",
         "Parameters": [0.01, 0.05] # tRMP1
        },
        {"Type": "Uniform",
         "Parameters": [0, 1e-8] # tRMP2
        },
        {"Type": "Uniform",
         "Parameters": [0, 1e-8] # tRMP3
        },
        # Field capacity parameter
        {"Type": "Uniform",
         "Parameters": [0.20, 0.30] # theta_c1
        },
        {"Type": "Uniform",
         "Parameters": [0.20, 0.30] # theta_c2
        },
        {"Type": "Uniform",
         "Parameters": [0.20, 0.30] # theta_c3
        },
        # delta_h piezo
        {"Type": "Uniform",
         "Parameters": [-20.0, 20.0] 
        },
    ]
}

myInput = uq.createInput(InputOpts)

MetaOpts = {
    'Type': 'Metamodel',
    'MetaType': 'PCE'
}

# Use experimental design loaded from the data files:


MetaOpts['ExpDesign'] = {
    'X': X_cal.tolist(),
    'Y': Y_E2_cal.tolist()
}

#Set the maximum polynomial degree to 5:

MetaOpts["Degree"] = np.arange(1,5).tolist()

# Specify the parameters of the PCE truncation scheme

MetaOpts['TruncOptions'] = {
    'qNorm': 0.7,
    'MaxInteraction': 2
}



#Provide the validation data set to get the validation error:

MetaOpts['ValidationSet'] = {
    'X': X_val.tolist(),
    'Y': Y_E2_val.tolist()
}

# Create the PCE metamodel:

myPCE_E2 = uq.createModel(MetaOpts)

# Print a summary of the resulting PCE metamodel:
uq.print(myPCE_E2)

Output :

Warning: The selected PCE has more than 1 output. Only the 1st output will be
printed

In uq_PCE_print (line 12)
In uq_print_uq_metamodel (line 16)
In uq_module/Print (line 52)
In uq_print (line 29)
In uq_web_cli_filebased (line 301)
You can specify the outputs you want to be displayed with the syntax:
uq_display(PCEModule, OUTARRAY)
where OUTARRAY is the index of desired outputs, e.g. 1:3 for the first three

------------ Polynomial chaos output ------------
Number of input variables: 22
Maximal degree: 4
q-norm: 0.70
Size of full basis: 782
Size of sparse basis: 141
Full model evaluations: 7482
Leave-one-out error: 3.4710601e-02
Modified leave-one-out error: 3.6061287e-02
Validation error: 3.6288626e-02
Mean value: 72.7910
Standard deviation: 43.7980
Coef. of variation: 60.169%
--------------------------------------------------

Retrieving maximum LOO error on the trained PCE :

LOO_errors_2 = []
for i in range(len(myPCE_E2['Error'])):
    LOO_errors_2.append(myPCE_E2['Error'][i]['LOO'])

max(LOO_errors_2)

Output : 0.16775668945295738

Hello @GuillaumeGru

Do I understand correctly that you are handling your output signals with 66 PCE models, each responsible for one output time step? If so, how do the LOO errors behave for different outputs? For example, do you observe higher accuracy for early time steps and worse for later ones?

Make sure that the split between training and testing is random. It is not clear from your code whether this is the case. There are Python packages that can do this for you.

Have you tested different PCE bases? E.g. with higher order interactions, higher degree or stricter/softer q-norm? Since you have 22 input variables, you will need to balance these parameters somewhat carefully, but given the number of samples you have, there is some leeway to increase the complexity of your model.

Hope this helps a bit!

Hello @styfen.schaer and thanks for your reply !

Indeed, from what I understood, as my target output Y is of size (Nsamples, Nt = 66), the function uq.createModel creates one PCE for each column of the array Y. In other words, one PCE for each time step of the signal time series, is that right ?
Concerning the behavior of the LOO errors for different outputs, here is a figure of the LOO error as a function of the time step index :
LOO_error_idx

Interestingly, the shape of this curve is very similar to the shape of the signal time series, as an example, here is a figure of the output of the PCE and the target signal time series for the first sample of the test set :
PCE_sig_comp

For the second point, the split between training and testing sets is indeed not random in this version of the code : the limit index is chosen to be 75% of the total number of samples, the training set is made of the first 75% of the samples and the test set is composed of the rest.

I have tried using the train_test_split function from sklearn to make the split random, but this doesn’t improve the results of the PCE. As every row of the X array is the realization of a random vector, I don’t see how shuffling the order of the rows would help the PCE training.
What is the advantage of making the split random versus simply splitting the arrays at a given index ?

I have made some tests on the number of interactions, the maximum degree and the q-norm, but so far every change I made led to very high computation time when calling the createModel function (higher than the timeout limit which I manually set to 10 000 seconds). I will run some more tests and keep this post updated on the results.

Thanks again for your time.
Best regards,
Guillaume

Yes, your understanding is correct. UQLab creates a separate PCE model for each output variable.

Regarding the split between training and testing: if your samples are all randomly generated, the split you made is fine. It just wasn’t obvious from the code. The problem is, if they are not random, the training data may differ significantly from the test data. It would then not be surprising if the training result is much better than the test result.

Training 66 PCE models definitely takes some time, especially when using adaptivity. However, since all models are trained independently of each other, you can concentrate on just one output for the time being. E.g. on time step 11, as this is the worst. Once you have found a good setting for this output, you can look at the full problem again.

Dear @styfen.schaer,

thank you again for your reply and precious advice.

My samples are all generated randomly with LHS (uq.getSample).

That’s a very interesting suggestion to try and reduce the error on one output first and then go back to the full problem. I’ll look into this and keep this post updated.

Guillaume