Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Juergen Salk
Hi Alan and Paul, I can't clain to be a Lustre guru but my understanding is that Lustre failover does not imply umount/mount of the file system on the client side. On the client side the OSTs just stall until they are back. So open file handles should actually be kept during that process.

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Paul Edmon
I think it depends on the filesystem type.  Lustre generally fails over nicely and handles reconnections with out much of a problem.  We've done this before with out any hitches, even with the jobs being live.  Generally the jobs just hang and then resolve once the filesystem comes back.  On a

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Alan Orth
Dear Jurgen and Paul, This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job processes. The processes remain in memory, but are paused. What happens to open file handles, since the underlying filesystem goes

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-22 Thread Juergen Salk
Thanks, Paul, for confirming our planned approach. We did it that way and it worked very well. I have to admit that my fingers were a bit wet when suspending thousands of running jobs, but it worked without any problems. I just didn't dare to resume all suspended jobs at once, but did that in a

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-19 Thread Paul Edmon
Yup, we follow the same process for when we do Slurm upgrades, this looks analogous to our process. -Paul Edmon- On 10/19/2021 3:06 PM, Juergen Salk wrote: Dear all, we are planning to perform some maintenance work on our Lustre file system which may or may not harm running jobs. Although

[slurm-users] Suspending jobs for file system maintenance

2021-10-19 Thread Juergen Salk
Dear all, we are planning to perform some maintenance work on our Lustre file system which may or may not harm running jobs. Although failover functionality is enabled on the Lustre servers we'd like to minimize risk for running jobs in case something goes wrong. Therefore, we thought about