[slurm-users] Re: [ext] scrontab question

2024-05-07 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hm, strange. I don't see a problem with the time specs, although I
would use
*/5 * * * *
to run something every 5 minutes. In my scrontab I also specify a
partition, etc. But I don't think that is necessary.
regards
magnus

On Di, 2024-05-07 at 12:06 -0500, Sandor via slurm-users wrote:
> I am working out the details of scrontab. My initial testing is
> giving me an unsolvable question
> Within scrontab editor I have the following example from the slurm
> documentation:
> 
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
> /directory/subdirectory/crontest.sh
> 
> When I save it, scrontab marks the line with #BAD: I do not
> understand why. The only difference I have is the directory
> structure.
> 
> Is there an underlying assumption that traditional Linux crontab is
> available to the general user?
> 



smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] scrontab question

2024-05-07 Thread Sandor via slurm-users
I am working out the details of scrontab. My initial testing is giving me
an unsolvable question
Within scrontab editor I have the following example from the slurm
documentation:

0,5,10,15,20,25,30,35,40,45,50,55 * * * *
/directory/subdirectory/crontest.sh

When I save it, scrontab marks the line with #BAD: I do not understand why.
The only difference I have is the directory structure.

Is there an underlying assumption that traditional Linux crontab is
available to the general user?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users

On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few 
hours.  Looks like I **can** still hit the issue with cgroups disabled.  
Incident rate was 8 out of >11k jobs so dropped an order of magnitude or 
so.  Guessing that exonerates cgroups as the cause, but possibly just a 
good way to tickle the real issue.  Over the next few days, I’ll try to 
roll everything back to RHEL 8.9 and see how that goes.


My 2 cents: RHEL/AlmaLinux/RockyLinux 9.4 is out now, maybe it's worth a 
try to update to 9.4?


/Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Henderson, Brent via slurm-users
Over the past few days I grabbed some time on the nodes and ran for a few 
hours.  Looks like I *can* still hit the issue with cgroups disabled.  Incident 
rate was 8 out of >11k jobs so dropped an order of magnitude or so.  Guessing 
that exonerates cgroups as the cause, but possibly just a good way to tickle 
the real issue.  Over the next few days, I'll try to roll everything back to 
RHEL 8.9 and see how that goes.

Brent


From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Thursday, May 2, 2024 11:32 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: srun launched mpi job occasionally core dumps

Re-tested with slurm 23.02.7 (had to also disable slurmdbd and run the 
controller with the '-i' option) but still reproduced the issue fairly quickly. 
 Feels like the issue might be some interaction with RHEL 9.3 cgroups and 
slurm.  Not sure what to try next - hoping for some suggestions.

Thanks,

Brent


From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Wednesday, May 1, 2024 11:21 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] srun launched mpi job occasionally core dumps

Greetings Slurm gurus --

I've been having an issue where very occasionally an srun launched OpenMPI job 
launched will die during startup within MPI_Init().  E.g. srun -N 8 
--ntasks-per-node=1 ./hello_world_mpi.  Same binary launched with mpirun does 
not experience the issue.  E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi.  
The failure rate seems to be in the 0.5% - 1.0% range when using srun for 
launch.

SW stack is self-built with:

* Dual socket AMD nodes

* RHEL 9.3 base system + tools

* Single 100 Gb card per host

* hwloc 2.9.3

* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)

* slurm 23.11.6 (started with 23.11.5 - update did not change the 
behavior)

* openmpi 5.0.3

The MPI code is a simple hello_world_mpi.c - anything that goes through startup 
via srun - does not seem to matter.  Application core dump looks like the 
following regardless of the test running:

[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] 
/share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] 
/lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] 
/share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] 
/share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] 
/share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***

More than one rank can die with the same stacktrace on a node when this happens 
- I've seen as many as 6.  One other interesting note is that if I change my 
srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace 
 ./hello_world_mpi) the issue appears to go away.  0 failures 
in ~2500 runs.  Another thing that seems to help is to disabling cgroups in the 
slurm.conf.  After the change, saw 0 failures in >6100 hello_world_mpi runs.

The changes in the slurm.conf were - original:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup

Changed
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
JobAcctGatherType=jobacct_gather/linux

My cgroup.conf file contains:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=95

Curious is anyone has any thoughts on next steps to help figure out what might 
be going on and how to resolve it.  Currently, I'm planning to back down to the 
23.02.7 release and see how that goes but open to other suggestions.

Thanks,

Brent



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Davide DelVento via slurm-users
Are you seeking something simple rather than sophisticated? If so, you can
use the controller local disk for StateSaveLocation and place a cron job
(on the same node or somewhere else) to take that data out via e.g. rsync
and put it where you need it (NFS?) for the backup control node to use
if/when needed. That obviously introduces a time delay which might or might
not be problematic depending on what kind of failures you are trying to
protect from and with what level of guarantee you wish the HA would have:
you will not be protected in every possible scenario. On the other hand,
given the size of the cluster that might be adequate and it's basically
zero effort, so it might be "good enough" for you.

On Tue, May 7, 2024 at 4:44 AM Pierre Abele via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi all,
>
> I am looking for a clean way to set up Slurms native high availability
> feature. I am managing a Slurm cluster with one control node (hosting
> both slurmctld and slurmdbd), one login node and a few dozen compute
> nodes. I have a virtual machine that I want to set up as a backup
> control node.
>
> The Slurm documentation says the following about the StateSaveLocation
> directory:
>
> > The directory used should be on a low-latency local disk to prevent file
> system delays from affecting Slurm performance. If using a backup host, the
> StateSaveLocation should reside on a file system shared by the two hosts.
> We do not recommend using NFS to make the directory accessible to both
> hosts, but do recommend a shared mount that is accessible to the two
> controllers and allows low-latency reads and writes to the disk. If a
> controller comes up without access to the state information, queued and
> running jobs will be cancelled. [1]
>
> My question: How do I implement the shared file system for the
> StateSaveLocation?
>
> I do not want to introduce a single point of failure by having a single
> node that hosts the StateSaveLocation, neither do I want to put that
> directory on the clusters NFS storage since outages/downtime of the
> storage system will happen at some point and I do not want that to cause
> an outage of the Slurm controller.
>
> Any help or ideas would be appreciated.
>
> Best,
> Pierre
>
>
> [1] https://slurm.schedmd.com/quickstart_admin.html#Config
>
> --
> Pierre Abele, M.Sc.
>
> HPC Administrator
> Max-Planck-Institute for Evolutionary Anthropology
> Department of Primate Behavior and Evolution
>
> Deutscher Platz 6
> 04103 Leipzig
>
> Room: U2.80
> E-Mail: pierre_ab...@eva.mpg.de
> Phone: +49 (0) 341 3550 245
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Fabio Ranalli via slurm-users

You can try DRBD
https://linbit.com/drbd/

or a shared-disk (clustered) FS like GFS2, OCFS2, etc

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/configuring_gfs2_file_systems/index
https://docs.oracle.com/en/operating-systems/oracle-linux/9/shareadmin/shareadmin-ManagingtheOracleClusterFileSystemVersion2inOracleLinux.html

--

*Fabio Ranalli* | Principal Systems Administrator

Schrödinger, Inc. 

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] "token expired" errors with auth/slurm

2024-05-07 Thread Fabio Ranalli via slurm-users

Hi there,

We've updated to 23.11.6 and replaced MUNGE with SACK.

Performance and stability have both been pretty good, but we're 
occasionally seeing this in the slurmctld.log


/[2024-05-07T03:50:16.638] error: decode_jwt: token expired at 1715053769
[2024-05-07T03:50:16.638] error: cred_p_unpack: decode_jwt() failed
[2024-05-07T03:50:16.638] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received
[2024-05-07T03:50:16.641] error: slurm_receive_msg_and_forward: 
[[headnode.internal]:58286] failed: Header lengths are longer than data 
received
[2024-05-07T03:50:16.648] error: service_connection: slurm_receive_msg: 
Header lengths are longer than data received/


it seems to impact a subset of nodes: jobs get killed and no new ones 
are allocated.
Full functionality can be restored by simply restarting slurmctld first, 
and then slurmd.


Is the token expected to actually expire? I didn't see this possibility 
mentioned in the docs.


The problem occurs on an R cloud cluster based on EL9, with a pretty 
"flat" setup.

_headnode_: configless slurmctld, slurmdbd, mariadb, nfsd
_elastic compute nodes_: autofs, slurmd

*//etc/slurm/slurm.conf/*
AuthType=auth/slurm
AuthInfo=use_client_ids
CredType=cred/slurm

*//etc/slurm/slurmdbd.conf/*
AuthType=auth/slurm
AuthInfo=use_client_ids


Has anyone else encountered the same error?

Thanks,
Fabio

--

*Fabio Ranalli* | Principal Systems Administrator

Schrödinger, Inc. 

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] StateSaveLocation and Slurm HA

2024-05-07 Thread Pierre Abele via slurm-users

Hi all,

I am looking for a clean way to set up Slurms native high availability 
feature. I am managing a Slurm cluster with one control node (hosting 
both slurmctld and slurmdbd), one login node and a few dozen compute 
nodes. I have a virtual machine that I want to set up as a backup 
control node.


The Slurm documentation says the following about the StateSaveLocation 
directory:



The directory used should be on a low-latency local disk to prevent file system 
delays from affecting Slurm performance. If using a backup host, the 
StateSaveLocation should reside on a file system shared by the two hosts. We do 
not recommend using NFS to make the directory accessible to both hosts, but do 
recommend a shared mount that is accessible to the two controllers and allows 
low-latency reads and writes to the disk. If a controller comes up without 
access to the state information, queued and running jobs will be cancelled. [1]


My question: How do I implement the shared file system for the 
StateSaveLocation?


I do not want to introduce a single point of failure by having a single 
node that hosts the StateSaveLocation, neither do I want to put that 
directory on the clusters NFS storage since outages/downtime of the 
storage system will happen at some point and I do not want that to cause 
an outage of the Slurm controller.


Any help or ideas would be appreciated.

Best,
Pierre


[1] https://slurm.schedmd.com/quickstart_admin.html#Config

--
Pierre Abele, M.Sc.

HPC Administrator
Max-Planck-Institute for Evolutionary Anthropology
Department of Primate Behavior and Evolution

Deutscher Platz 6
04103 Leipzig

Room: U2.80
E-Mail: pierre_ab...@eva.mpg.de
Phone: +49 (0) 341 3550 245


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-07 Thread Bjørn-Helge Mevik via slurm-users
Tim Wickberg via slurm-users  writes:

> [1] Slinky is not an acronym (neither is Slurm [2]), but loosely
> stands for "Slurm in Kubernetes".

And not at all inspired by Slinky Dog in Toy Story, I guess. :D

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com