Re: [slurm-users] PreemptExemptTime
On 3/7/23 6:46 am, Groner, Rob wrote: Over global settings are PreemptMode=SUSPEND,GANG and PreemptType=preempt/partition_prio. We have a high priority partition that nothing should ever preempt, and an open partition that is always preemptable. In between is a burst partition. It can be preempted if the high priority partition needs the resources. That's the partition we'd like to guarantee a 1 hour run time on. Looking at the sacctmgr man page, it gives this info on QOS Just a quick comment, here you're talking about both partitions and QOS's with respect to preemption, I think for this you need to pick just one of those options and only use those configs. For instance we just use QOS's for preemption and our exempt time works in that case. Hope this helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Unable to delete account
Thank you Kilian. We are running Slurm 20.11.8. Looks like this is exactly what's going on. Simon On Mon, Mar 6, 2023 at 3:00 PM Kilian Cavalotti wrote: > > Hi Simon, > > On Mon, Mar 6, 2023 at 1:34 PM Simon Gao wrote: > > We are experiencing an issue with deleting any Slurm account. > > > > When running a command like: sacctmgr delete account , > > following errors are returned and the command failed. > > > > # sacctmgr delete account > > Database is busy or waiting for lock from other user. > > sacctmgr: error: Getting response to message type: DBD_REMOVE_ACCOUNTS > > sacctmgr: error: DBD_REMOVE_ACCTS failure: No error > > Error with request: No error > > You haven't specified the version you're using, but that looks like > https://bugs.schedmd.com/show_bug.cgi?id=11742, fixed in version > 21.08. > > Cheers, > -- > Kilian >
Re: [slurm-users] [ext] Re: Cleanup of job_container/tmpfs
That was exactly the bit I was missing. Thank you very much, Magnus! Best Niels Carl On 3/7/23 3:13 PM, Hagdorn, Magnus Karl Moritz wrote: I just upgrade slurm to 23.02 on our test cluster to try out the new job_container/tmpfs stuff. I can confirm it works with autofs (hurrah!) but you need to set the Shared=true option in the job_container.conf file. Cheers magnus On Tue, 2023-03-07 at 09:19 +0100, Ole Holm Nielsen wrote: Hi Brian, Presumably the users' home directory is NFS automounted using autofs, and therefore it doesn't exist when the job starts. The job_container/tmpfs plugin ought to work correctly with autofs, but maybe this is still broken in 23.02? /Ole On 3/6/23 21:06, Brian Andrus wrote: That looks like the users' home directory doesn't exist on the node. If you are not using a shared home for the nodes, your onboarding process should be looked at to ensure it can handle any issues that may arise. If you are using a shared home, you should do the above and have the node ensure the shared filesystems are mounted before allowing jobs. -Brian Andrus On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote: Hi all Seems there still are some issues with the autofs - job_container/tmpfs functionality in Slurm 23.02. If the required directories aren't mounted on the allocated node(s) before jobstart, we get: slurmstepd: error: couldn't chdir to `/users/lutest': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/users/lutest': No such file or directory: going to /tmp instead An easy workaround however, is to include this line in the slurm prolog on the slurmd -nodes: /usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true -but there might exist a better way to solve the problem?
[slurm-users] PreemptExemptTime
I found a thread about this topic that's a year old and at that time seemed to give no hope, I'm just wondering if the situation has changed. My testing so far isn't encouraging. In the thread (here: https://groups.google.com/g/slurm-users/c/yhnSVBoohik) it talks about wanting to give lower priority jobs some amount of guaranteed run time. That's what we're trying to do. Over global settings are PreemptMode=SUSPEND,GANG and PreemptType=preempt/partition_prio. We have a high priority partition that nothing should ever preempt, and an open partition that is always preemptable. In between is a burst partition. It can be preempted if the high priority partition needs the resources. That's the partition we'd like to guarantee a 1 hour run time on. Looking at the sacctmgr man page, it gives this info on QOS: PreemptExemptTime Specifies a minimum run time for jobs of this QOS before they are considered for preemption. This QOS option takes precedence over the global PreemptExemptTime. This is only honored for PreemptMode=REQUEUE and PreemptMode=CANCEL. This sounds like exactly what we want. So I went into the burst QOS we have available on the burst partition and I set a preemptExemptTime of 30 seconds and a preemptMode of cancel, and tested. Whenever something of a higher priority came along, my job was immediately cancelled, no exempt time was utliized. Am I not understanding how this is supposed to work, or am I asking for an impossible slurm configuration? Thanks, Rob
Re: [slurm-users] [ext] Re: Cleanup of job_container/tmpfs
I just upgrade slurm to 23.02 on our test cluster to try out the new job_container/tmpfs stuff. I can confirm it works with autofs (hurrah!) but you need to set the Shared=true option in the job_container.conf file. Cheers magnus On Tue, 2023-03-07 at 09:19 +0100, Ole Holm Nielsen wrote: > Hi Brian, > > Presumably the users' home directory is NFS automounted using autofs, > and > therefore it doesn't exist when the job starts. > > The job_container/tmpfs plugin ought to work correctly with autofs, > but > maybe this is still broken in 23.02? > > /Ole > > > On 3/6/23 21:06, Brian Andrus wrote: > > That looks like the users' home directory doesn't exist on the > > node. > > > > If you are not using a shared home for the nodes, your onboarding > > process > > should be looked at to ensure it can handle any issues that may > > arise. > > > > If you are using a shared home, you should do the above and have > > the node > > ensure the shared filesystems are mounted before allowing jobs. > > > > -Brian Andrus > > > > On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote: > > > Hi all > > > > > > Seems there still are some issues with the autofs - > > > job_container/tmpfs > > > functionality in Slurm 23.02. > > > If the required directories aren't mounted on the allocated > > > node(s) > > > before jobstart, we get: > > > > > > slurmstepd: error: couldn't chdir to `/users/lutest': No such > > > file or > > > directory: going to /tmp instead > > > slurmstepd: error: couldn't chdir to `/users/lutest': No such > > > file or > > > directory: going to /tmp instead > > > > > > An easy workaround however, is to include this line in the slurm > > > prolog > > > on the slurmd -nodes: > > > > > > /usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true > > > > > > -but there might exist a better way to solve the problem? > -- Magnus Hagdorn Charité – Universitätsmedizin Berlin Geschäftsbereich IT | Scientific Computing Campus Charité Virchow Klinikum Forum 4 | Ebene 02 | Raum 2.020 Augustenburger Platz 1 13353 Berlin magnus.hagd...@charite.de https://www.charite.de HPC Helpdesk: sc-hpc-helpd...@charite.de smime.p7s Description: S/MIME cryptographic signature
Re: [slurm-users] Cleanup of job_container/tmpfs
Hi Brian, Presumably the users' home directory is NFS automounted using autofs, and therefore it doesn't exist when the job starts. The job_container/tmpfs plugin ought to work correctly with autofs, but maybe this is still broken in 23.02? /Ole On 3/6/23 21:06, Brian Andrus wrote: That looks like the users' home directory doesn't exist on the node. If you are not using a shared home for the nodes, your onboarding process should be looked at to ensure it can handle any issues that may arise. If you are using a shared home, you should do the above and have the node ensure the shared filesystems are mounted before allowing jobs. -Brian Andrus On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote: Hi all Seems there still are some issues with the autofs - job_container/tmpfs functionality in Slurm 23.02. If the required directories aren't mounted on the allocated node(s) before jobstart, we get: slurmstepd: error: couldn't chdir to `/users/lutest': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/users/lutest': No such file or directory: going to /tmp instead An easy workaround however, is to include this line in the slurm prolog on the slurmd -nodes: /usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true -but there might exist a better way to solve the problem?