Re: [slurm-users] PreemptExemptTime

2023-03-07 Thread Christopher Samuel

On 3/7/23 6:46 am, Groner, Rob wrote:

Over global settings are PreemptMode=SUSPEND,GANG and 
PreemptType=preempt/partition_prio.  We have a high priority partition 
that nothing should ever preempt, and an open partition that is always 
preemptable.  In between is a burst partition.  It can be preempted if 
the high priority partition needs the resources.  That's the partition 
we'd like to guarantee a 1 hour run time on.  Looking at the sacctmgr 
man page, it gives this info on QOS


Just a quick comment, here you're talking about both partitions and 
QOS's with respect to preemption, I think for this you need to pick just 
one of those options and only use those configs. For instance we just 
use QOS's for preemption and our exempt time works in that case.


Hope this helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Unable to delete account

2023-03-07 Thread Simon Gao
Thank you Kilian.

We are running Slurm 20.11.8.  Looks like this is exactly what's going on.

Simon


On Mon, Mar 6, 2023 at 3:00 PM Kilian Cavalotti
 wrote:
>
> Hi Simon,
>
> On Mon, Mar 6, 2023 at 1:34 PM Simon Gao  wrote:
> > We are experiencing an issue with deleting any Slurm account.
> >
> > When running a command like: sacctmgr delete account ,
> > following errors are returned and the command failed.
> >
> > # sacctmgr delete account 
> >  Database is busy or waiting for lock from other user.
> > sacctmgr: error: Getting response to message type: DBD_REMOVE_ACCOUNTS
> > sacctmgr: error: DBD_REMOVE_ACCTS failure: No error
> >  Error with request: No error
>
> You haven't specified the version you're using, but that looks like
> https://bugs.schedmd.com/show_bug.cgi?id=11742, fixed in version
> 21.08.
>
> Cheers,
> --
> Kilian
>



Re: [slurm-users] [ext] Re: Cleanup of job_container/tmpfs

2023-03-07 Thread Niels Carl W. Hansen

That was exactly the bit I was missing. Thank you very much, Magnus!

Best
Niels Carl



On 3/7/23 3:13 PM, Hagdorn, Magnus Karl Moritz wrote:

I just upgrade slurm to 23.02 on our test cluster to try out the new
job_container/tmpfs stuff. I can confirm it works with autofs (hurrah!)
but you need to set the Shared=true option in the job_container.conf
file.
Cheers
magnus

On Tue, 2023-03-07 at 09:19 +0100, Ole Holm Nielsen wrote:

Hi Brian,

Presumably the users' home directory is NFS automounted using autofs,
and
therefore it doesn't exist when the job starts.

The job_container/tmpfs plugin ought to work correctly with autofs,
but
maybe this is still broken in 23.02?

/Ole


On 3/6/23 21:06, Brian Andrus wrote:

That looks like the users' home directory doesn't exist on the
node.

If you are not using a shared home for the nodes, your onboarding
process
should be looked at to ensure it can handle any issues that may
arise.

If you are using a shared home, you should do the above and have
the node
ensure the shared filesystems are mounted before allowing jobs.

-Brian Andrus

On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote:

Hi all

Seems there still are some issues with the autofs -
job_container/tmpfs
functionality in Slurm 23.02.
If the required directories aren't mounted on the allocated
node(s)
before jobstart, we get:

slurmstepd: error: couldn't chdir to `/users/lutest': No such
file or
directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/users/lutest': No such
file or
directory: going to /tmp instead

An easy workaround however, is to include this line in the slurm
prolog
on the slurmd -nodes:

/usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true

-but there might exist a better way to solve the problem?





[slurm-users] PreemptExemptTime

2023-03-07 Thread Groner, Rob
I found a thread about this topic that's a year old and at that time seemed to 
give no hope, I'm just wondering if the situation has changed.  My testing so 
far isn't encouraging.

In the thread (here: https://groups.google.com/g/slurm-users/c/yhnSVBoohik) it 
talks about wanting to give lower priority jobs some amount of guaranteed run 
time.  That's what we're trying to do.

Over global settings are PreemptMode=SUSPEND,GANG and 
PreemptType=preempt/partition_prio.  We have a high priority partition that 
nothing should ever preempt, and an open partition that is always preemptable.  
In between is a burst partition.  It can be preempted if the high priority 
partition needs the resources.  That's the partition we'd like to guarantee a 1 
hour run time on.  Looking at the sacctmgr man page, it gives this info on QOS:

PreemptExemptTime
  Specifies a minimum run time for jobs of this QOS before they are 
considered for preemption. This QOS option takes precedence over the global 
PreemptExemptTime. This  is  only honored for PreemptMode=REQUEUE and 
PreemptMode=CANCEL.

This sounds like exactly what we want.  So I went into the burst QOS we have 
available on the burst partition and I set a preemptExemptTime of 30 seconds 
and a preemptMode of cancel, and tested.  Whenever something of a higher 
priority came along, my job was immediately cancelled, no exempt time was 
utliized.

Am I not understanding how this is supposed to work, or am I asking for an 
impossible slurm configuration?

Thanks,

Rob





Re: [slurm-users] [ext] Re: Cleanup of job_container/tmpfs

2023-03-07 Thread Hagdorn, Magnus Karl Moritz
I just upgrade slurm to 23.02 on our test cluster to try out the new
job_container/tmpfs stuff. I can confirm it works with autofs (hurrah!)
but you need to set the Shared=true option in the job_container.conf
file.
Cheers
magnus

On Tue, 2023-03-07 at 09:19 +0100, Ole Holm Nielsen wrote:
> Hi Brian,
> 
> Presumably the users' home directory is NFS automounted using autofs,
> and 
> therefore it doesn't exist when the job starts.
> 
> The job_container/tmpfs plugin ought to work correctly with autofs,
> but 
> maybe this is still broken in 23.02?
> 
> /Ole
> 
> 
> On 3/6/23 21:06, Brian Andrus wrote:
> > That looks like the users' home directory doesn't exist on the
> > node.
> > 
> > If you are not using a shared home for the nodes, your onboarding
> > process 
> > should be looked at to ensure it can handle any issues that may
> > arise.
> > 
> > If you are using a shared home, you should do the above and have
> > the node 
> > ensure the shared filesystems are mounted before allowing jobs.
> > 
> > -Brian Andrus
> > 
> > On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote:
> > > Hi all
> > > 
> > > Seems there still are some issues with the autofs -
> > > job_container/tmpfs 
> > > functionality in Slurm 23.02.
> > > If the required directories aren't mounted on the allocated
> > > node(s) 
> > > before jobstart, we get:
> > > 
> > > slurmstepd: error: couldn't chdir to `/users/lutest': No such
> > > file or 
> > > directory: going to /tmp instead
> > > slurmstepd: error: couldn't chdir to `/users/lutest': No such
> > > file or 
> > > directory: going to /tmp instead
> > > 
> > > An easy workaround however, is to include this line in the slurm
> > > prolog 
> > > on the slurmd -nodes:
> > > 
> > > /usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true
> > > 
> > > -but there might exist a better way to solve the problem?
> 

-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-07 Thread Ole Holm Nielsen

Hi Brian,

Presumably the users' home directory is NFS automounted using autofs, and 
therefore it doesn't exist when the job starts.


The job_container/tmpfs plugin ought to work correctly with autofs, but 
maybe this is still broken in 23.02?


/Ole


On 3/6/23 21:06, Brian Andrus wrote:

That looks like the users' home directory doesn't exist on the node.

If you are not using a shared home for the nodes, your onboarding process 
should be looked at to ensure it can handle any issues that may arise.


If you are using a shared home, you should do the above and have the node 
ensure the shared filesystems are mounted before allowing jobs.


-Brian Andrus

On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote:

Hi all

Seems there still are some issues with the autofs - job_container/tmpfs 
functionality in Slurm 23.02.
If the required directories aren't mounted on the allocated node(s) 
before jobstart, we get:


slurmstepd: error: couldn't chdir to `/users/lutest': No such file or 
directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/users/lutest': No such file or 
directory: going to /tmp instead


An easy workaround however, is to include this line in the slurm prolog 
on the slurmd -nodes:


/usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true

-but there might exist a better way to solve the problem?