Dear slurm-user list,
as far as I understood it, the slurm.conf needs to be present on the
master and on the workers at slurm.conf (if no other path is set via
SLURM_CONF). However, I noticed that when adding a partition only in the
master's slurm.conf, all workers were able to "correctly" show
hot-fix/updates
from the base image or changes. By running it from the node, it would
alleviate any cpu spikes on the slurm head node.
Just a possible path to look at.
Brian Andrus
On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote:
Dear slurm user list,
we make use of elastic cl
Dear slurm user list,
we make use of elastic cloud computing i.e. node instances are created
on demand and are destroyed when they are not used for a certain amount
of time. Created instances are set up via Ansible. If more than one
instance is requested at the exact same time, Slurm will pass
to handle "NOT_RESPONDING".
I would really like to improve my question if necessary.
Best regards,
Xaver
On 23.02.24 18:55, Xaver Stiensmeier wrote:
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it
can happen that slurm's resumeTimeout
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can
happen that slurm's resumeTimeout is reached and the node is therefore
powered down. We have set ReturnToService=2 in order to avoid the node
being marked down, because the instance behind that node is
Dear slurm-user list,
I had cases where our resumeProgram failed due to temporary cloud
timeouts. In that case the resumeProgram returns a value =/= 0. Why does
Slurm still wait until resumeTimeout instead of just accepting the
startup as failed which then should lead to a rescheduling of the
Thank you for your response.
I have found found out why there was no error in the log: I've been
looking at the wrong log. The error didn't occur on the master, but on
our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as
just another worker in the same network. The error I
Dear slurm-user list,
I got this error:
Unable to start service slurmctld: Job for slurmctld.service failed
because the control process exited with error code.\nSee \"systemctl
status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for
details.
but in slurmctld.service I see
. You can run 'df
-h' and see some info that would get you started.
Brian Andrus
On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote:
Dear slurm-user list,
during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is
hat Slurmd is placing in this dir that fills
up the space. Do you have any ideas? Due to the workflow used, we have a
hard time reconstructing the exact scenario that caused this error. I
guess, the "fix" is to just pick a bit larger disk, but I am unsure
whether Slurm behaves normal here.
or not, but it's worth a try.
Best regards
Xaver
On 06.12.23 12:03, Ole Holm Nielsen wrote:
On 12/6/23 11:51, Xaver Stiensmeier wrote:
Good idea. Here's our current version:
```
sinfo -V
slurm 22.05.7
```
Quick googling told me that the latest version is 23.11. Does the
upgrade change anything
matter for your power saving experience. Do
you run an updated version?
/Ole
On 12/6/23 10:54, Xaver Stiensmeier wrote:
Hi Ole,
I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might
:
Hi Xavier,
On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in
|slurm_update error: Invalid node state specified|
when we called:
|scontrol update NodeName="$1" state=
Maybe someone has a great idea how to tackle this problem.
Best regards
Xaver Stiensmeier
art slurmd*
# master
run without any issues afterwards.
Thank you for all your help!
Best regards,
Xaver
On 19.07.23 17:05, Xaver Stiensmeier wrote:
Hi Hermann,
count doesn't make a difference, but I noticed that when I reconfigure
slurm and do reloads afterwards, the error "gpu c
the "Count=..." part in gres.conf
It should read
NodeName=NName Name=gpu File=/dev/tty0 Count=1
in your case.
Regards,
Hermann
On 7/19/23 14:19, Xaver Stiensmeier wrote:
Okay,
thanks to S. Zhang I was able to figure out why nothing changed.
While I did restart systemctld at the begi
her. I am thankful for any ideas in that regard.
Best regards,
Xaver
On 19.07.23 10:23, Xaver Stiensmeier wrote:
Alright,
I tried a few more things, but I still wasn't able to get past: srun:
error: Unable to allocate resources: Invalid generic resource (gres)
specification.
I should ment
----
*From:* slurm-users on behalf
of Xaver Stiensmeier
*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,
Good idea, but we are already using `SelectType=select/cons_tres`. After
setting everything up ag
for testing purposes. Could this be the issue?
Best regards,
Xaver Stiensmeier
On 17.07.23 14:11, Hermann Schwärzler wrote:
Hi Xaver,
what kind of SelectType are you using in your slurm.conf?
Per https://slurm.schedmd.com/gres.html you have to consider:
"As for the --gpu* option, these op
) and using one of
those didn't work in my case.
Obviously, I am misunderstanding something, but I am unsure where to look.
Best regards,
Xaver Stiensmeier
:
Allowing all nodes to be powered up, but without automatic suspending
for some nodes except when triggering power down manually.
---
I tried using negative times for SuspendTime, but that didn't seem to
work as no nodes are powered up then.
Best regards,
Xaver Stiensmeier
partitions and allocates
all 8 nodes.
Best regards,
Xaver Stiensmeier
question asks
how to have multiple default partitions which could include having
others that are not default.
Best regards,
Xaver Stiensmeier
On 17.04.23 11:12, Xaver Stiensmeier wrote:
Dear slurm-users list,
is it possible to somehow have two default partitions? In the best case
in a way
Dear slurm-users list,
is it possible to somehow have two default partitions? In the best case
in a way that slurm schedules to partition1 on default and only to
partition2 when partition1 can't handle the job right now.
Best regards,
Xaver Stiensmeier
ere larger instances started than needed? ...
I know that this question is currently very open, but I am still trying
to narrow down where I have to look. The final goal is of course to use
this evaluation to pick better timeout values and improve cloud scheduling.
Best regards,
Xaver Stiensmeier
.
So I am basically looking for custom requirements.
Best regards,
Xaver Stiensmeier
ition" in `JobSubmitPlugins`
and this might be the solution. However, I think this is something so
basic that it probably shouldn't need a plugin so I am unsure.
Can anyone point me towards how setting the default partition is done?
Best regards,
Xaver Stiensmeier
e
maximum explicit.
Best regards,
Xaver Stiensmeier
PS: This is the first time I use the slurm-user list and I hope I am not
violating any rules with this question. Please let me know, if I do.
28 matches
Mail list logo