Uq[py]lab v0.95 - trouble connecting to a UQ cloud session

ari_f · April 8, 2024, 9:35am

I’m having some trouble connecting to a UQ cloud session. I get a timeout error:

Traceback (most recent call last):

  File /opt/anaconda3/envs/uqlab/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/Documents/Postdoc_IQF/python/uqlab/uq_sens_analysis_Hgmodel_loc_iodine.py:516
    mySession = sessions.cloud()

  File /opt/anaconda3/envs/uqlab/lib/python3.10/site-packages/uqpylab/sessions.py:23 in __call__
    cls._instances[cls] = super(_Singleton, cls).__call__(*args, **kwargs)

  File /opt/anaconda3/envs/uqlab/lib/python3.10/site-packages/uqpylab/sessions.py:421 in __init__
    self.new()

  File /opt/anaconda3/envs/uqlab/lib/python3.10/site-packages/uqpylab/sessions.py:435 in new
    raise RuntimeError(resp['Message'])

RuntimeError: Timeout reached

I’m not sure if the issue is related to a user-limit imposed on the cloud usage? Also, I was wondering if there are plans to release uq[py]lab with local computation capabilities in the future?

GuillaumeGru · April 9, 2024, 12:36pm

Hello everyone,
I have the same problem as Aryeh, I was able to connect to the cloud this morning and perform analysis but now I get the “Timeout reached” error when I try to create a UQ cloud session.
Is there a seerver problem ?

Best regards,
Guillaume Gru

Adela · April 9, 2024, 12:53pm

Hello everyone,

Thanks a lot for letting us know! We discovered that one process was consuming almost all of the resources. We have terminated it, and everything should be working properly again.

Best,
Adela

ste · April 9, 2024, 12:55pm

Dear @ari_f and @GuillaumeGru,
thank you for reporting this.

I checked the main server, and indeed one worker was using over 100GB of RAM, hence causing slowdowns. I have manually killed the process, and I will investigate why this was not caught by the automatic resource control (we are updating the backend continuously).

@ari_f : yes we plan to provide a completely offline version of UQPyLab towards the end of this year/early next year, that will substitute the current centrally hosted system.
While the client won’t be affected (so current scripts will continue to work), the responsibility to host the server will be moved to the user or their institutions, and we will stop hosting free instances.
Details about this process are still being finalized.

Best regards,
Stefano

GuillaumeGru · April 9, 2024, 1:23pm

Dear @Adela and @ste,

Thank you for your reply and the fix.
It still seems quite unstable : I managed to connect for a minute and run an analysis but now I’m back to “Timeout reached error”.

Best regards,
Guillaume Gru

Adela · April 9, 2024, 1:44pm

Dear @GuillaumeGru ,

If your analysis takes more than 3 minutes (to minimize dangling sessions), the session is closed automatically. You can set the timeout yourself from the client if you need longer wait times–it can be set in seconds. As an example, to set it to 1000s, use the following line:

mySession.timeout = 1000;

Sometimes, the calculations cannot be gracefully closed even after the timeout is reached, and the worker keeps running until the operation ends.

In these cases, you can use the following command to force a worker to restart:

mySession = sessions.cloud(host=myInstance,token=myToken,force_restart=True)

Let me know if this helps!

Best regards
Adela

GuillaumeGru · April 9, 2024, 2:28pm

Dear @Adela,

thank you for your reply.

To give you a little bit of context on the analysis I am trying to make :

I am trying to do a PCE based sensitivity analysis (First order sobol index).
The analysis worked for a first simpler model with 8 parameters and now I am trying to make it with a more complex model (18 parameters).
I am training the PCE on samples that were previousely computed (I did that for the first analysis as well).

The mySession.timeout trick worked and I don’t get the Timeout error anymore but when I run the
myPCE = uq.createModel(MetaOpts)
command line, the analysis doesn’t seem to end (it has been currently runing for 13 min). I am surprised because there is an example of PCE based Sobol analysis with more than 100 parameters on the UQ[py]Lab website.

Do you have any idea on where this issue might come from ?

Best regards,
Guillaume Gru

Adela · April 9, 2024, 2:49pm

Dear @GuillaumeGru,

Could you please provide me with a minimal working example so that I can pinpoint the issue?

Thank you very much!

Best regards,
Adela

GuillaumeGru · April 9, 2024, 3:00pm

Actually, there was a detail that I forgot to take from the multi-dimensional example : the part where the PCE truncation scheme is provided.

PCEOpts[‘TruncOptions’] = {
‘qNorm’: 0.7,
‘MaxInteraction’: 2
}

With this fix, the PCE training works.

Thank you for your help, best regards,
Guillaume

ari_f · April 10, 2024, 7:26am

Thanks @Adela and @ste! Good to know about these plans. I’m interested in running sensitivity analyses for many cases so I wasn’t sure if this is appropriate for the cloud system.

Adela · April 10, 2024, 8:02am

Dear @GuillaumeGru ,

I am glad that it works for you now!

Best regards,
Adela

Jacek_Swiegoda · December 19, 2024, 10:19am

Hi, I didn’t want to start a new topic, so I’m writing here.

I’m encountering the same issue with UQ[py]LAB 1.0.2 and RuntimeError: Timeout Reached. I don’t think it’s the problem with session time limit because I get an error message before the timeout.session value is reached. Also, the error appears when I try to start the uqpylab session.

I would be very grateful for help in solving this problem,
Jacek.

Adela · December 19, 2024, 10:45am

Dear @Jacek_Swiegoda,

Everything seems to be fine on our side. If the timeout is not a problem, the session might still be active and not properly terminated. In this case, you can use the following command to force a worker to restart:

mySession = sessions.cloud(host=myInstance,token=myToken,force_restart=True)

Could you try it and let me know if it resolves the issue?

Best regards
Adela

ste · December 19, 2024, 10:50am

To add to @Adela’s comment: when you kill a job on the client, e.g. through Ctrl-C, the remote job is not killed immediately, and for some specific jobs it is not killed until completion (e.g. long Bayesian analyses). In these cases, connecting to the server does not provide a response until your worker is free, resulting in a timeout even during the connection.

The “force_restart” flag instructs the remote API to force-kill the worker and restart a fresh session, without waiting for worker availability.

I hope this makes sense!

Jacek_Swiegoda · December 19, 2024, 11:52am

Dear @Adela and @ste

The error was actually on my side due to manual interruption of the python script. A force restart allowed the session to be restarted. Thank you very much for such quick help and comprehensive explanation of the problem!

Best regards,
Jacek

Adela · December 19, 2024, 1:40pm

Dear @Jacek_Swiegoda,

I’m glad we resolved the issue!

Happy UQ[py]Lab-ing!

Best regards
Adela