[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-09-20 Thread Xaver Stiensmeier via slurm-users
Hey Nate, we actually fixed our underlying issue that caused the NOT_RESPONDING flag - on fails we automatically terminated the node manually instead of letting Slurm call the terminate script. That lead to Slurm believing the node should still be there when it was terminated already. Therefore,

[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Xaver Stiensmeier via slurm-users
Thanks Steffen, that makes a lot of sense. I will just not start slurmd in the master ansible role when the master is not to be used for computing. Best regards, Xaver On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote: On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote: Dear Slu

[slurm-users] How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Xaver Stiensmeier via slurm-users
Dear Slurm users, in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:     PartitionName=SomePartition Nodes=master or something similar. Apparently, thi

[slurm-users] Slurm.conf and workers

2024-04-15 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show t

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-09 Thread Xaver Stiensmeier via slurm-users
any hot-fix/updates from the base image or changes. By running it from the node, it would alleviate any cpu spikes on the slurm head node. Just a possible path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elast

[slurm-users] Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Xaver Stiensmeier via slurm-users
Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass th

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-29 Thread Xaver Stiensmeier via slurm-users
I am wondering why my question (below) didn't catch anyone's attention. Just for me as a feedback. Is it unclear where my problem lies or is it clear, but no solution is known? I looked through the documentation and now searched the Slurm repository, but am still unable to clearly identify how to

[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-23 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is

[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0

2024-02-19 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job

[slurm-users] Re: Errors upgrading to 23.11.0 -- jwt-secret.key

2024-02-08 Thread Xaver Stiensmeier via slurm-users
Thank you for your response. I have found found out why there was no error in the log: I've been looking at the wrong log. The error didn't occur on the master, but on our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as just another worker in the same network. The error I get

[slurm-users] Errors upgrading to 23.11.0

2024-02-07 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see nothi