Probability of failure of 10^{-16} with MCS on an HPC server

Hello everyone. I am planning my research with UQLab V2.0 and MATLAB 2018 installed on an HPC server. Especially I hope to estimate a very small probability of failure with Monte Carlo simulation (MCS) method. I am looking to find a probability of failure of 10^{-16} equivalente to reliability index of 8.22, with a CoV=0.05, so around 4x10^{18} simulations will be required.

  • Can UQLab V2.0 on an HPC server, using the MCS method of the reliability analysis module, do 4x10^{18} simulations to estimate a probability of failure of 10^{-16}?
  • Would you recommend using the HPC dispatcher module?

Thanks in advance for your answers

Dear @CsrCastillo

Estimating such small probabilites using Monte Carlo simulation is not recommended. Assuming a single simulations takes 1ms, then it would still require 100 million CPU years to run all simulations. I recommend you to take a look at the UQLab manual on structural reliability instead:

It will give you a good starting point on how we usually approximate such tiny failure probabilities much more efficiently.

Best regards
Styfen

1 Like

Hi there,
if I can add something, the state of the art in low probability estimation is actually based on active learning, for which we even have a dedicated manual and a recent review paper:
User manual: https://www.uqlab.com/active-learning-user-manual
Review paper: link

Best regards,
Stefano

1 Like

Dear @styfen.schaer

Thank you very much for her response.

My question arises from a scientific article from 2018 where they used UQLab V0.92. The authors compared the relative errors of reliability indices between the MCS (taken as a real value) and the FORM, SORM, MCS-IS, and MCS-LHS methods (taken as measured values). Where the probability of failure was around 2x10{-2} equivalent to a reliability index of 2.05

Captura de pantalla 2024-04-09 084007

The big difference with my problem is that my probability of failure is very small 10^{-16}, and I need an enormous number of simulations that I thought an HPC cluster would solve it. When it is not possible to calculate the MCS due to RAM and CPU limitations on both a computer and an HPC cluster, would I indicate in my document that it could not be calculated for these reasons?

Now, I also have another problem where the probability of failure is around 6x10^{-7} equivalent to a reliability index of 4.9, where for a CoV=0.05, I need around 6.7x10^{8} simulations, which my computer cannot perform due to RAM and CPU limitations. Considering that I already used the other reliability analysis methods: Can UQLab V2.0 on an HPC server, using the MCS method of the reliability analysis module, perform 6.7x10^{8} simulations to estimate a probability of failure of 6x10^{-7}?

Thanks in advance for your answers

Thank you very much for your reply @ste

I’m going to review the material.

Best regards.

Hi @CsrCastillo

Especially for methodological papers, it is indeed useful to have an unbiased, accurate reference value that has been determined using Monte Carlo simulation. However, your failure probability is simply too low. You can do the math and you’ll find that the cost of the electricity required to run these simulations is, at best, millions of dollars. So I think it is indeed a valid reason not to calculate this MC reference value. Whether this is an acceptable justification in your document, I cannot say. But out of curiosity, how did you come up with the number 10^{-16}?

Your second case with a failure probability of 6 \cdot 10^{-7} might be feasible, but it really depends on your problem and the hardware you have available. UQLab provides a very convenient interface to distribute your MCs across multiple machines, but the total work done remains about the same. So the question is, how long does a single run of your model take? And how many resources do you have available?

Hi @styfen.schaer

Thanks for your reply. Now I understand the great computational cost that would be involved in calculating a failure probability of 10^{-16} with Monte Carlo simulation in an HPC cluster. This value occurred to me from obtaining the failure probabilities of 1.74⋅10^{-13}, 6.34⋅10^{-14}, 6.08⋅10^{-14} and 6.98⋅10^{-15}, with the FORM, SORM, MCS-IS and MCS-SS, respectively, using UQLab V2.0 on my computer. However, I wanted to provide the calculation of the relative error comparing them with the value of an MCS. My probabilistic model includes 10 random variables.

Regarding the second question, the problem would be similar, where the probabilistic model has 10 random parameters and 11 deterministic parameters, with the only difference that the value of a deterministic parameter increases. I am consulting the available resources such as CPUs and RAM, and when I have a response I will respond specifying them. On the other hand, how do I calculate how long it takes to run a single run of my model? Does this mean the time it takes to run 1 simulation, where 1 sampling of the 10 random parameters is done and where the model output is 1 value?

Thanks in advance for your answers