Hi @damarginal ,
thanks a lot for the fast reply! You were right and I used module load matlab/R2020a
instead of module load new matlab/R2020a
.
Also thanks for the NOTE!
Meanwhile I tried to compute a cheap model within my framework to see if everything is working before replying. Unfortunately, in several cases I got errors again. Further, there are some things I do not understand within specific methods.
I could successfully dispatch crude MCS with only one process but also with e.g. 12 numProcs.
However, I could not successfully dispatch Importance Sampling or Subset Simulations. Neither with one, nor with several processes.
E.g. in the case of Subset Simulations I get the following:
Starting Subset Simulation Analysis…
Current subset: 1
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
Current subset: 2
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
Job Status: ‘complete’ reached.
[DISPATCHER] Send the dispatch package…(OK)
[DISPATCHER] Give permission to execute: mpifile.sh…(OK)
[DISPATCHER] Give permission to execute: qfile.sh…(OK)
[DISPATCHER] Submit the Job to the remote machine…(OK)
[DISPATCHER] Get the remote Job ID…(OK)
Checking the status of the remote execution…
something went wrong while trying to dispatch the desired operation
You may find additional information in the following exception message:
date: /cluster/home/schsteph/MATLAB/Sim/04Mar2021_at_13054228/04Mar2021_at_13054228.stdout: No such file or directory
This file is available in the stated directory though…
As it can be seen, the first Subset is finished. The second one has several completed jobs but it doesn’t get to Subset 3. In another case I got the same and after several of these jobs it reached Subset 3 but failed later on anyway. The only difference in the error message was the file ending that it could not find. This file with the ending uq_job_started is indeed unavailable in the directory when I check afterwards…
something went wrong while trying to dispatch the desired operation
You may find additional information in the following exception message:
date: /cluster/home/schsteph/MATLAB/Sim/04Mar2021_at_14195297/.uq_job_started: No such file or directory
Inbetween I changed a small thing in my framework: I saved and loaded some variables before, that caused some problems with multiple processes why I deleted this save and load part which was basically unnecessary.
Could you please explain why there are sometimes multiple jobs within one subset? The same happend btw. also in the case of Importance Sampling. Is there some criteria built in that might not be reached? Should I encrease the batch size in that case?
Do you have any idea what causes the error message concerning this missing file?
Furhter, I don’t understand properly the workflow of the dispatcher. Why does my local session need to send several dispatches when I installed MATLAB and UQLab on Euler? I specified the following in myHPCProfile:
MATLABCommand = ‘/cluster/apps/matlab/R2020a/bin/matlab’;
RemoteUQLabPath = ‘/cluster/home/schsteph/UQLabCore_Rel1.4.0’;
EnvSetup = {‘module load open_mpi’};
PrevCommands = {‘module load new matlab/R2020a’};
Within my model I also specified modelOpts.RemoteMATLAB = true. In your manual you promoted this in the case of a UQLink model though, which I do not have, since I use all MATLAB with my own wrapper. Ergo, I’m not sure, if this is available in my case.
When I understand it correctly, in the current workflow I’m stuck several times in the queue of the scheduler, right? When I later on want to apply the framewok on heavier models I will need to specify the necessary computation time, right? Does then the specified time relate to one dispatch only or to the entire time?
I know these were many questions, I hope it is still understandable though Many thanks in adavance for some further hints/explanations