How to restart at a crashed iteration in UQLink model

Anarionz · April 13, 2020, 6:37pm

Hello, I have a UQLink model where I am conducting a SOBOL sensitivity analysis with relatively large sample sizes that results in approximately 90000 iterations that takes about a week to complete. Sometimes during the simulation, I get a NaN error or such, and sometimes the run crashes.

I am wondering if there is a way to restart the run using the same random seed at the iteration of your choice (i.e. where it last failed / crashed).

I’ve looked through the manual, but there doesn’t seem to be an intuitively obvious way to do it.

Any help would be much appreciated. Thank you.

damarginal · April 15, 2020, 11:22am

Hi @Anarionz ,

UQLink does support recover and resume calculations, but because the way Sobol’ indices are estimated using the Monte Carlo procedure of UQLab, there is currently no way to decouple the sample points generation, model evaluation, and the indices estimation. That is, everything is done in one pass.

Using the post-processed .mat file produced by UQLink, you might be able to recover/resume some of your failed calculations to complete the sample points required to estimate the indices. But as far as I know, there’s no straightforward way to use these pre-calculated sample points to estimate Sobol’ indices using UQLab.

For large problems coupled with relatively expensive computational models (having non-neglible risk of failed calculations), you might want to consider using a metamodel instead for your sensitivity analysis. I think, with some care, you might be able to recycle some of the computations already done to estimate the Sobol’ indices to construct a metamodel (these results are stored in the processed .mat file produced by UQLink).

Or, maybe @moustapha has an alternative approach to this problem?

PS: Crashes in code can either be just a bad luck (say, some bad combination of input values that create numerical difficulties), or it can be a symptom of a more systematic error (e.g., the code is used outside the range of verifiable inputs). I guess, the latter problem is more problematic, so it would be nice to be somewhat sure about the cause.

moustapha · April 27, 2020, 10:03am

Hi @Anarionz,

There is not much to add to what @damarginal has said. There are two ways to recover/resume a UQLink analysis:

Y = uq_evalModel(X, ‘recover’) will run again the model for rows of X whose corresponding response is NaN in the saved .mat file. (At each evaluation of a UQLink model the input and output are stored in a .mat files).
Y = uq_evalModel(X, ‘resume’) will run the analysis again starting from the first row that was not evaluated. This feature can be used if the model evaluation was first interrupted externally.

In this case, the ‘recover’ option seems to be the appropriate one. However as Damar mentioned, it may not be easy to recover your whole Sobol’ analysis evaluations as the .mat file is overwritten ever time the uq_evalModel is run (this is something that we plan to improve in the next release).
In general with models as expensive as yours, I would also suggest building a surrogate model (say PCE) and then using it for the Sobol’ sensitivity analysis.

Anarionz · May 8, 2020, 4:03pm

Thank you @damarginal @moustapha, I will give the “resume” way a try.

Regards