[slurm-users] Re: Mailing list upgrade - slurm-users list paused

2024-01-30 Thread Tim Wickberg via slurm-users

Welcome to the updated list. Posting is re-enabled now.

- Tim

On 1/30/24 11:56, Tim Wickberg wrote:

Hey folks -

The mailing list will be offline for about an hour as we upgrade the 
host, upgrade the mailing list software, and change the mail 
configuration around.


As part of these changes, the "From: " field will no longer be the 
original sender, but instead use the mailing list ID itself. This is to 
comply with DMARC sending options, and allow us to start DKIM signing 
messages to ensure deliverability once Google and Yahoo impose new 
policy changes in February.


This is the last post on the current (mailman2) list. I'll send a 
welcome message on the upgraded (mailman3) list once finished, and when 
the list is open to new traffic again.


- Tim



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Mailing list upgrade - slurm-users list paused

2024-01-30 Thread Tim Wickberg

Hey folks -

The mailing list will be offline for about an hour as we upgrade the 
host, upgrade the mailing list software, and change the mail 
configuration around.


As part of these changes, the "From: " field will no longer be the 
original sender, but instead use the mailing list ID itself. This is to 
comply with DMARC sending options, and allow us to start DKIM signing 
messages to ensure deliverability once Google and Yahoo impose new 
policy changes in February.


This is the last post on the current (mailman2) list. I'll send a 
welcome message on the upgraded (mailman3) list once finished, and when 
the list is open to new traffic again.


- Tim

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support



Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines



This is definitely a NVML thing crashing slurmstepd.  Here is what I find
doing an strace of the slurmstepd: [3681401.0] process at the point the 
crash happens:


[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0
[pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339
[pid 1132920] openat(AT_FDCWD, "/proc/1132950/status", O_RDONLY) = 12
[pid 1132920] read(12, "Name:\tbash\nUmask:\t0002\nState:\tS "..., 4095) = 
1431

[pid 1132920] close(12) = 0
[pid 1132920] close(10) = 0
[pid 1132920] getpid()  = 1132919
[pid 1132920] getpid()  = 1132919
[pid 1132920] getpid()  = 1132919
[pid 1132920] getpid()  = 1132919
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 
0x14dc6db683f0) = 0

[pid 1132920] getpid()  = 1132919
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 
0x14dc6db683f0) = 0
[pid 1132920] ioctl(16, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 
0x14dc6db683f0) = 0
[pid 1132920] writev(2, [{iov_base="free(): invalid next size (fast)", 
iov_len=32}, {iov_base="\n", iov_len=1}], 2) = 33

[pid 1132924] <... poll resumed>)   = 1 ([{fd=13, revents=POLLIN}])
[pid 1132920] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 

[pid 1132924] read(13, "free(): invalid next size (fast)"..., 3850) = 34
[pid 1132920] <... mmap resumed>)   = 0x14dc6da6d000
[pid 1132920] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  


From "ls -l /proc/1132919/fd" before the crash happens I can tell

that FileDescriptor 16 is /dev/nvidiactl.  SO it is doing an ioctl
to /dev/nvidiactl write before this crash.  In some cases the error
is a "free(): invalid next size (fast)" and sometimes it is malloc()

Job submitted as:

srun -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G 
--time=1-10:00:00  --cpus-per-task=2 --gpus=8 --nodelist=rtx-02 --pty 
/bin/bash


Command run in interactive shell is:  gpuburn 30

In this case I am not getting the defunct slurmd process left behind
but there is a strange 'sleep 1' left behind I have to kill

[root@rtx-02 ~]# find $(find /sys/fs/cgroup/ -name job_3681401 ) -name 
cgroup.procs -exec cat {} \; | sort | uniq

1132916
[root@rtx-02 ~]# ps -f -p 1132916
UID  PIDPPID  C STIME TTY  TIME CMD
root 1132916   1  0 11:13 ?00:00:00 sleep 1


The NVML library I have installed is

$ rpm -qa | grep NVML
nvidia-driver-NVML-535.54.03-1.el8.x86_64

on both the box where the SLURM binaries were built and on this box
where slurmstepd is crashing.  /usr/local/cuda has a 11.6 CUDA

Okay, I just upgraded the NVIDIA driver on rtx-02 with

  dnf --enablerepo=cuda module reset nvidia-driver
  dnf --enablerepo=cuda module install nvidia-driver:535-dkms

Restarted everything and it appears with my initial couple of
tests the problem has gone away.  Going to need to have real users
test with real jobs.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:

   External Email - Use Caution 


I built 23.02.7 and tried that and had the same problems.

BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with 
NVIDIA 535.54.03 proprietary drives installed).


The behavior I was seeing was one would start a GPU job. It was fine at first 
but at some point the slurmstepd processes for it would crash/die. Sometimes 
the user process for it would die too, sometimes not.  In an interactive job 
you would sometimes see some final line about "malloc" and "invalid value" 
and the terminal would be hung until the job was 'scancel'ed.   'ps' would 
show a 'slurmd ' process that was

unkillable (killing the main slurmd process would get rid of it)

How the slurm controller saw the job seemed random.  Sometimes it saw it as a 
crashed job and it would be reported like that in the system. Sometimes it 
was stuck as a permanently CG (completing) job. Sometimes it did not notice 
anything wrong and just stayed as a seemingly perfect running job according 
to slurmctld (I never waited for the TimeLimit to hit to see what happened 
then but did scancel).


I scancelled a "failed" job in the CG or R state, it would not actually
kill the user processes on the node but it would clear the job from
squeue.

Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 
install built sepearately on these) have all been working fine so far.
The install on the Ubuntu GPU boxes was built separately on them and also 
another difference is they are still using NVIDIA 470 drivers.


I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding
slurm 23.11.3 there and installing it to see if that worked
to fix things.  It did not.

I then tried installing old 22.05.6 RPMs I had built on my Rocky 8
box back in Nov 2022 on all Rocky 8 GPU boxes.  This seems to
have fixed the problem and jobs are no longer showing the issue.
Not 

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Heckes, Frank
These are scary news. I just updated to 23.11.1, but couldn't confirm the 
problems described so far. I'll do some more extensive and intensive tests.
In case of desaster: Does anyone knows how to rollback the DB, as some new DB 
'objects' attributes are introduced in 23.11.1. I never had the chance to do 
this before :-0
As we have support contract I would open a ticket.

> -Original Message-
> From: slurm-users  On Behalf Of
> Ole Holm Nielsen
> Sent: Tuesday, 30 January 2024 10:04
> To: slurm-users@lists.schedmd.com
> Subject: Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in 
> completion
> state
>
> On 1/30/24 09:36, Fokke Dijkstra wrote:
> > We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck
> > in a completing state and slurmd daemons can't be killed because they
> > are left in a CLOSE-WAIT state. See my previous mail to the mailing
> > list for the details. And also
> > https://bugs.schedmd.com/show_bug.cgi?id=18561
> >  for another site
> > having issues.
>
> Bug 18561 was submitted by a user with no support contract, so it's unlikely
> that SchedMD will look into it.
>
> I guess many sites are considering the upgrade to 23.11, and if there is an
> issue as reported, a site with a valid support contract needs to open a
> support case.  I'm very interested in hearing about any progress with 23.11!
>
> Thanks,
> Ole



smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines



I built 23.02.7 and tried that and had the same problems.

BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes 
with NVIDIA 535.54.03 proprietary drives installed).


The behavior I was seeing was one would start a GPU job. It was fine at 
first but at some point the slurmstepd processes for it would crash/die. 
Sometimes the user process for it would die too, sometimes not.  In an 
interactive job you would sometimes see some final line about "malloc" and 
"invalid value" and the terminal would be hung until the job was 
'scancel'ed.   'ps' would show a 'slurmd ' process that was

unkillable (killing the main slurmd process would get rid of it)

How the slurm controller saw the job seemed random.  Sometimes it saw it 
as a crashed job and it would be reported like that in the system. 
Sometimes it was stuck as a permanently CG (completing) job. Sometimes it 
did not notice anything wrong and just stayed as a seemingly perfect 
running job according to slurmctld (I never waited for the TimeLimit to 
hit to see what happened then but did scancel).


I scancelled a "failed" job in the CG or R state, it would not actually
kill the user processes on the node but it would clear the job from
squeue.

Jobs on my non-GPU Rocky 8 nodes or on my Ubuntu GPU nodes (slurm 23.11.3 
install built sepearately on these) have all been working fine so far.
The install on the Ubuntu GPU boxes was built separately on them and also 
another difference is they are still using NVIDIA 470 drivers.


I tried downgrading a Rocky 8 GPU box to NVIDIA 470 and rebuilding
slurm 23.11.3 there and installing it to see if that worked
to fix things.  It did not.

I then tried installing old 22.05.6 RPMs I had built on my Rocky 8
box back in Nov 2022 on all Rocky 8 GPU boxes.  This seems to
have fixed the problem and jobs are no longer showing the issue.
Not and ideal solution but good enough for now.

Both those 22.05.6 RPMs and the 23.11.3 RPMs are built on the
same Rocky 8 GPU box.  So the differences are:

   1) slurm version obviously
   2) built using different gcc/lib versions mostly likely
  due to OS updates between Nov 2022 and now
   3) built with a different NVIDIA driver/cuda installed
  between then and now but I am not sure what I had
  in Nov 2022

I highly suspect #2 or #3 as the underlying issue here and I wonder if the 
NVML library at the time of build is the key (though like I said I tried

rebuiling with NVIDIA 470 and that still had the issue)


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:

   External Email - Use Caution 


We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
a completing state and slurmd daemons can't be killed because they are left
in a CLOSE-WAIT state. See my previous mail to the mailing list for the
details. And also 
https://secure-web.cisco.com/16zDED-XPBGKgr-e6-N-CrY1J1qxgnDyBeTc5vVc0ghVjnVNEtDKVr66UkuO1IT_ALDy93Yc0AmTeUXzIZviB36h_FFVEoHsmy36XMo6j8usUs-kk5AjTgDUP4KUbfOzFSwWiOoSZGqRWubpW1s47fI9tY-S6zRRk3tl2C_RZ7VpbkyCXnPlNaSsYQvxM_MkVKpbsUrwVDWQ0aqr4jIDrKr-ddHpVhwaN7YmukIpwG4dXoXlh5yqq1VznG-xDJ2Xabhr08FF6AdRHwHQXYPWTR1hLchTEGLxENISVpgGs_PwosuU-VzgMEGp9KSRIjCM9y3MV5gXoJhG44TKX9jkYQg/https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D18561
 for
another site having issues.
We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
gets rid of most issues. If possible, I would try to also downgrade
slurmctld to an earlier release, but this requires getting rid of all
running and queued jobs.

Kind regards,

Fokke

Op ma 29 jan 2024 om 01:00 schreef Paul Raines :


Some more info on what I am seeing after the 23.11.3 upgrade.

Here is a case where a job is cancelled but seems permanently
stuck in 'CG' state in squeue

[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
[2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
#CPUs=4 Partition=rtx8000
[2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
`/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
[2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
JobId=3679903 uid 5875902
[2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm
Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
ExitCode 0
[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
JobId=3679903 action:normal
[2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
removed JobId=3679903 from part rtx8000 row 0
[2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
successful 0x8004
[2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06
[2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
Nodelist=rtx-06

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Ole Holm Nielsen

On 1/30/24 09:36, Fokke Dijkstra wrote:
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in 
a completing state and slurmd daemons can't be killed because they are 
left in a CLOSE-WAIT state. See my previous mail to the mailing list for 
the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 
 for another site having 
issues.


Bug 18561 was submitted by a user with no support contract, so it's 
unlikely that SchedMD will look into it.


I guess many sites are considering the upgrade to 23.11, and if there is 
an issue as reported, a site with a valid support contract needs to open a 
support case.  I'm very interested in hearing about any progress with 23.11!


Thanks,
Ole



Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Fokke Dijkstra
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
a completing state and slurmd daemons can't be killed because they are left
in a CLOSE-WAIT state. See my previous mail to the mailing list for the
details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 for
another site having issues.
We've now downgraded the clients (slurmd and login nodes) to 23.02.7 which
gets rid of most issues. If possible, I would try to also downgrade
slurmctld to an earlier release, but this requires getting rid of all
running and queued jobs.

Kind regards,

Fokke

Op ma 29 jan 2024 om 01:00 schreef Paul Raines :

> Some more info on what I am seeing after the 23.11.3 upgrade.
>
> Here is a case where a job is cancelled but seems permanently
> stuck in 'CG' state in squeue
>
> [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
> [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
> #CPUs=4 Partition=rtx8000
> [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file
> `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t
> [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB
> JobId=3679903 uid 5875902
> [2024-01-28T17:42:27.725] debug:  email msg to sg1526: Slurm
> Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED,
> ExitCode 0
> [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
> JobId=3679903 action:normal
> [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job:
> removed JobId=3679903 from part rtx8000 row 0
> [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903
> successful 0x8004
> [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903
> Nodelist=rtx-06
> [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903
> Nodelist=rtx-06
> [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903
> Nodelist=rtx-06
> [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903
> Nodelist=rtx-06
> [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903
> Nodelist=rtx-06
>
>
> So at 17:42 the user must of done an scancel.  In the slurmd log on the
> node I see:
>
> [2024-01-28T17:42:27.727] debug:  _rpc_terminate_job: uid = 1150
> JobId=3679903
> [2024-01-28T17:42:27.728] debug:  credential for job 3679903 revoked
> [2024-01-28T17:42:27.728] debug:  _rpc_terminate_job: sent SUCCESS for
> 3679903, waiting for prolog to finish
> [2024-01-28T17:42:27.728] debug:  Waiting for job 3679903's prolog to
> complete
> [2024-01-28T17:43:19.002] debug:  _rpc_terminate_job: uid = 1150
> _JobId=3679903
> [2024-01-28T17:44:20.001] debug:  _rpc_terminate_job: uid = 1150
> JobId=3679903
> [2024-01-28T17:45:20.002] debug:  _rpc_terminate_job: uid = 1150
> JobId=3679903
> [2024-01-28T17:46:20.001] debug:  _rpc_terminate_job: uid = 1150
> JobId=3679903
> [2024-01-28T17:47:20.002] debug:  _rpc_terminate_job: uid = 1150
> JobId=3679903
>
> Strange that a prolog is being called on job cancel
>
> slurmd seems to be getting the repeated calls to terminate the job
> from slurmctld but it is not happening.  Also the process table has
>
> [root@rtx-06 ~]# ps auxw | grep slurmd
> root  161784  0.0  0.0 436748 21720 ?Ssl  13:44   0:00
> /usr/sbin/slurmd --systemd
> root  190494  0.0  0.0  0 0 ?Zs   17:34   0:00
> [slurmd] 
>
> where there is now a zombie slurmd process I cannot kill even with kill -9
>
> If I do a 'systemctl stop slurmd' it takes a long time but eventually stop
> slurmd and gets rid of the zombie process but kills the "good" running
> jobs too with NODE_FAIL.
>
> Another case is where a job will be cancelled and SLURM acts like it
> is cancelled with it not showing up in squeue but the process keep running
> on the box.
>
> # pstree -u sg1526 -p | grep ^slurm
> slurm_script(185763)---python(185796)-+-{python}(185797)
> # strings /proc/185763/environ | grep JOB_ID
> SLURM_JOB_ID=3679888
> # squeue -j 3679888
> slurm_load_jobs error: Invalid job id specified
>
> sacct shows that job being cancelled.  In the slurmd log we see
>
> [2024-01-28T17:33:58.757] debug:  _rpc_terminate_job: uid = 1150
> JobId=3679888
> [2024-01-28T17:33:58.757] debug:  credential for job 3679888 revoked
> [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
> /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused
> [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
> /var/slurm/spool/d/rtx-06_3679888.4294967292
> [2024-01-28T17:33:58.757] debug:  signal for nonexistent
> StepId=3679888.extern stepd_connect failed: Connection refused
> [2024-01-28T17:33:58.757] debug:  _step_connect: connect() failed for
> /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused
> [2024-01-28T17:33:58.757] debug:  Cleaned up stray socket
> /var/slurm/spool/d/rtx-06_3679888.4294967291
> [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job
> script /var/slurm/spool/d/job3679888/slurm_script