Problems when dispatching reliability analyses to cluster

Hi everyone,
I’m currently trying to use UQLab for reliability analyses on ETH’s Euler cluster.
I tried to follow UQLAB USER MANUAL - THE HPC DISPATCHER MODULE as good as I could but now I could need some help :slight_smile:

My framework makes use of a wrapper, which is defined as the model with “modelOpts.mFile = myWrapper”. myWrapper is a function using X as input and calling myAnalysis which is a MATLAB script which is itself calling several other scripts and functions. Finally, myWrapper returns Y by catching the right variable from my framework. My entire framework is provided via “myDispatcher.AddToPath” on the MATLAB path on my directory on Euler.

In my local UQLab sessions I can use this framework for e.g. MCS or Subset Simulations and everything works fine. When I use the dispatcher module I can make it run, but in the end I can’t fetch the results since an error occurs.

I set the HPC=true directly in the beginning of the function uq_evalLimitState.m from UQLab, since this is where uq_evalModel() is used. Is there a better way to do this? In the manual uq_evalModel is applied directly, but in my case I do not see another way…

When I release the dispatch in my local session I get the information of the “Dispatcher object” and after several “Checking the status of the remote execution…” I get
“Job Status: ‘failed’.
something went wrong while trying to dispatch the desired operation
You may find additional information in the following exception message:
Can’t fetch results; Job is failed!
Something went wrong while performing the analysis
You may find additional information in the following exception message:
Can’t fetch results; Job is failed!”

When checking the new folder in my Euler directory in the file with the ending .stdout I can follow all calculations that I did e.g. for 100 MCS. All 100 simulations ran through and I even see the Y for each run. Ergo, the simulation itself was completed. After the last run there is a message
“PS: Read file <03Mar2021_at_16385306.stderr> for stderr output of this job.”
Within this .stderr file I see "matlab(19):ERROR:105: Unable to locate a modulefile for ‘matlab/R2020a’ "

I don’t know what to do with this message. MATLAB obviously was running since I got back my 100 simulation results.

Right now, I have no clue why I can’t fetch the results. Wheter it has to do with my way of setting the HPC flag, or if it has to do with this error message, or if it is something entirely else.
Thank you for some hints!

Hello @Stephan_Schilling,

Welcome to UQWorld!

The Dispatcher module is indeed currently rather prickly with errors produced in the standard error; the module will assume the Job is not running correctly and thus returns the status 'failed' if something is produced there.

It seems that you tried to load a non-existing module in Euler, at least not with the correct syntax. How do you usually load your choice of MATLAB version in your Euler session, and did you put anything on .PrevCommands option? Is it module load matlab/R2020a (I think this does not work with the environment modules and will give you that error message) or module load new matlab/R2020a?

I do find it curious that although the MATLAB you requested failed to load, you can still run the command matlab on Euler (is there any other commands that load MATLAB?).

NOTE: We haven’t exhaustively tested the functionality of the Dispatcher module on the reliability analysis module although it is possible to do that through the flag you mentioned. Be sure to double check the remote results (once you’ve got it working) with the local ones (i.e., using a smaller problem)! :smiley:

Hi @damarginal ,

thanks a lot for the fast reply! :smiley: You were right and I used module load matlab/R2020a instead of module load new matlab/R2020a. :see_no_evil:

Also thanks for the NOTE!

Meanwhile I tried to compute a cheap model within my framework to see if everything is working before replying. Unfortunately, in several cases I got errors again. Further, there are some things I do not understand within specific methods.

I could successfully dispatch crude MCS with only one process but also with e.g. 12 numProcs.

However, I could not successfully dispatch Importance Sampling or Subset Simulations. Neither with one, nor with several processes.
E.g. in the case of Subset Simulations I get the following:

Starting Subset Simulation Analysis…
Current subset: 1
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
Current subset: 2
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
something went wrong while trying to dispatch the desired operation
You may find additional information in the following exception message:
date: /cluster/home/schsteph/MATLAB/Sim/04Mar2021_at_13054228/04Mar2021_at_13054228.stdout: No such file or directory

This file is available in the stated directory though…

As it can be seen, the first Subset is finished. The second one has several completed jobs but it doesn’t get to Subset 3. In another case I got the same and after several of these jobs it reached Subset 3 but failed later on anyway. The only difference in the error message was the file ending that it could not find. This file with the ending uq_job_started is indeed unavailable in the directory when I check afterwards…

something went wrong while trying to dispatch the desired operation
You may find additional information in the following exception message:
date: /cluster/home/schsteph/MATLAB/Sim/04Mar2021_at_14195297/.uq_job_started: No such file or directory

Inbetween I changed a small thing in my framework: I saved and loaded some variables before, that caused some problems with multiple processes why I deleted this save and load part which was basically unnecessary.

Could you please explain why there are sometimes multiple jobs within one subset? The same happend btw. also in the case of Importance Sampling. Is there some criteria built in that might not be reached? Should I encrease the batch size in that case?
Do you have any idea what causes the error message concerning this missing file?

Furhter, I don’t understand properly the workflow of the dispatcher. Why does my local session need to send several dispatches when I installed MATLAB and UQLab on Euler? I specified the following in myHPCProfile:

MATLABCommand = ‘/cluster/apps/matlab/R2020a/bin/matlab’;
RemoteUQLabPath = ‘/cluster/home/schsteph/UQLabCore_Rel1.4.0’;
EnvSetup = {‘module load open_mpi’};
PrevCommands = {‘module load new matlab/R2020a’};

Within my model I also specified modelOpts.RemoteMATLAB = true. In your manual you promoted this in the case of a UQLink model though, which I do not have, since I use all MATLAB with my own wrapper. Ergo, I’m not sure, if this is available in my case.

When I understand it correctly, in the current workflow I’m stuck several times in the queue of the scheduler, right? When I later on want to apply the framewok on heavier models I will need to specify the necessary computation time, right? Does then the specified time relate to one dispatch only or to the entire time?

I know these were many questions, I hope it is still understandable though :sweat_smile: Many thanks in adavance for some further hints/explanations :smiley:

Hi Stephan,

Sorry for the really long silence. Did you continue using the dispatcher? Were you able to solve the problem? How? :slight_smile:

When I understand it correctly, in the current workflow I’m stuck several times in the queue of the scheduler, right?

Yes, that’s right. For your other questions I would need to have a closer look at the Dispatcher module myself, or maybe @damarginal is still around and knows some answers off the top of his head? :slight_smile:

Hi Nora,
No, sadly I did not manage to get it up and running. Meanwhile I’m using several computers for my investigations :see_no_evil: When we continue with this kind of investigations after I handed in the thesis we will likely get back to it though.

Oh, then all the best for the writing of your thesis! :relaxed: And let us know when you get back to it, would be nice to make it work for your problem.

Best,
Nora