[slurm-users] Inaccurate Preemption Notification?

2023-04-24 Thread Jason Simms
Hello all,

A user received an email from Slurm that one of his jobs was preempted.
Normally when a job is preempted, the logs will show something like this:

[2023-03-30T08:19:16.535] [25538.batch] error: *** JOB 25538 ON node07
CANCELLED AT 2023-03-30T08:19:16 DUE TO PREEMPTION ***
[2023-03-30T08:19:16.573] [25538.1] error: *** STEP 25538.1 ON node07
CANCELLED AT 2023-03-30T08:19:16 DUE TO PREEMPTION ***

There was no such entry for this job; what was in the log for the job was
this:

[2023-04-24T17:06:24.105] [26446.batch] error: *** JOB 26446 ON node07
CANCELLED AT 2023-04-24T17:06:24 ***
[2023-04-24T17:06:24.105] [26446.1] error: *** STEP 26446.1 ON node07
CANCELLED AT 2023-04-24T17:06:24 ***
[2023-04-24T17:06:24.155] [26446.extern] done with job
[2023-04-24T17:06:25.161] [26446.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15
[2023-04-24T17:06:25.163] [26446.batch] done with job
[2023-04-24T17:06:27.462] [26446.1] error: Failed to send
MESSAGE_TASK_EXIT: Connection refused
[2023-04-24T17:06:27.464] [26446.1] done with job

It's unclear to me whether this was actually preempted, but perhaps there
is a different way it logs preemption for MPI jobs. I do not, however,
believe that it was preempted, because he was running on a partition to
which the account he was using was the only account permitted to use that
partition, and in any case, that partition has the highest partition
priority. Moreover, the job immediately restarted (after a requeue, with a
new job id) on the same partition.

Any thoughts as to whether this job was actually preempted, and if not, why
the email notification would say it was?

Warmest regards,
Jason

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms


Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-24 Thread Hoot Thompson
See below…...

> On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen  
> wrote:
> 
> On 24-04-2023 18:33, Hoot Thompson wrote:
>> In my reading of the Slurm documentation, it seems that exceeding the limits 
>> set in GrpTRESMins should result in terminating a running job. However, in 
>> testing this, The ‘current value’ of the GrpTRESMins only updates upon job 
>> completion and is not updated as the job progresses. Therefore jobs aren’t 
>> being stopped. On the positive side, no new jobs are started if the limit is 
>> exceeded. Here’s the documentation that is confusing me…..
> 
> I think the jobs resource usage will only be added to the Slurm database upon 
> job completion.  I believe that Slurm doesn't update the resource usage 
> continually as you seem to expect.
> 
>> If any limit is reached, all running jobs with that TRES in this group will 
>> be killed, and no new jobs will be allowed to run.
>> Perhaps there is a setting or misconfiguration on my part.
> 
> The sacctmgr manual page states:
> 
>> GrpTRESMins=TRES=[,TRES=,...]
>> The total number of TRES minutes that can possibly be used by past, present 
>> and future jobs running from this association and its children.  To clear a 
>> previously set value use the modify command with a new value of -1 for each 
>> TRES id.
>> NOTE: This limit is not enforced if set on the root association of a 
>> cluster.  So even though it may appear in sacctmgr output, it will not be 
>> enforced.
>> ALSO NOTE: This limit only applies when using the Priority Multifactor 
>> plugin.  The time is decayed using the value of PriorityDecayHalfLife or 
>> PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is 
>> reached all associated jobs running will be killed and all future jobs 
>> submitted with associations in the group will be delayed until they are able 
>> to run inside the limit.
> 
> Can you please confirm that you have configured the "Priority Multifactor" 
> plugin?
Here’s relevant items from slurm.conf


# Activate the Multifactor Job Priority Plugin with decay
PriorityType=priority/multifactor
 
# apply no decay
PriorityDecayHalfLife=0
 
# reset usage after 1 month
PriorityUsageResetPeriod=MONTHLY
 
# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO
 
# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0
 
# This next group determines the weighting of each of the
# components of the Multifactor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=1
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor


> 
> Your jobs should not be able to start if the user's GrpTRESMins has been 
> exceeded.  Hence they won't be killed!

Yes, this works fine
> 
> Can you explain step by step what you observe?  It may be that the above 
> documentation of killing jobs is in error, in which case we should make a bug 
> report to SchedMD.

I set the GrpTRESMins limit to a very small number and then ran a sleep job 
that exceeded the limit. The job continued to run past the limits until I 
killed it. It was the only job in the queue. And if it makes any difference, 
this testing is being done in AWS on a parallel cluster.
> 
> /Ole
> 
> 
> 
> 



Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-24 Thread Ole Holm Nielsen

On 24-04-2023 18:33, Hoot Thompson wrote:
In my reading of the Slurm documentation, it seems that exceeding the 
limits set in GrpTRESMins should result in terminating a running job. 
However, in testing this, The ‘current value’ of the GrpTRESMins only 
updates upon job completion and is not updated as the job progresses. 
Therefore jobs aren’t being stopped. On the positive side, no new jobs 
are started if the limit is exceeded. Here’s the documentation that is 
confusing me…..


I think the jobs resource usage will only be added to the Slurm database 
upon job completion.  I believe that Slurm doesn't update the resource 
usage continually as you seem to expect.


If any limit is reached, all running jobs with that TRES in this group 
will be killed, and no new jobs will be allowed to run.


Perhaps there is a setting or misconfiguration on my part.


The sacctmgr manual page states:


GrpTRESMins=TRES=[,TRES=,...]
The total number of TRES minutes that can possibly be used by past, present and 
future jobs running from this association and its children.  To clear a 
previously set value use the modify command with a new value of -1 for each 
TRES id.

NOTE: This limit is not enforced if set on the root association of a cluster.  
So even though it may appear in sacctmgr output, it will not be enforced.

ALSO NOTE: This limit only applies when using the Priority Multifactor plugin.  
The time is decayed using the value of PriorityDecayHalfLife or 
PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is reached 
all associated jobs running will be killed and all future jobs submitted with 
associations in the group will be delayed until they are able to run inside the 
limit.


Can you please confirm that you have configured the "Priority 
Multifactor" plugin?


Your jobs should not be able to start if the user's GrpTRESMins has been 
exceeded.  Hence they won't be killed!


Can you explain step by step what you observe?  It may be that the above 
documentation of killing jobs is in error, in which case we should make 
a bug report to SchedMD.


/Ole






[slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-24 Thread Hoot Thompson
In my reading of the Slurm documentation, it seems that exceeding the limits 
set in GrpTRESMins should result in terminating a running job. However, in 
testing this, The ‘current value’ of the GrpTRESMins only updates upon job 
completion and is not updated as the job progresses. Therefore jobs aren’t 
being stopped. On the positive side, no new jobs are started if the limit is 
exceeded. Here’s the documentation that is confusing me…..


If any limit is reached, all running jobs with that TRES in this group will be 
killed, and no new jobs will be allowed to run. 

Perhaps there is a setting or misconfiguration on my part.

Thanks in advance!

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Ole Holm Nielsen

On 4/24/23 08:56, Purvesh Parmar wrote:
Thank you.. will try this and get back. Any other step being missed here 
for migration?


I don't know if any steps are missing, because I never tried moving a 
cluster like you want to do.


/Ole

On Mon, 24 Apr 2023 at 12:08, Ole Holm Nielsen > wrote:


On 4/24/23 08:09, Purvesh Parmar wrote:
 > thank you, however, because this is change in the data center, the
names
 > of the servers contain datacenter names as well in its hostname and in
 > fqdn as well, hence i have to change both, hostnames as well as ip
 > addresses, compulsorily, to given hostnames as per new DC names.

Could your data center be persuaded to introduce DNS CNAME aliases for
the
old names to point to the new DC names?

If you're forced to use new DNS names only, then it's simple to change
DNS
names of compute nodes and partitions in slurm.conf:

NodeName=...
PartitionName=xxx Nodes=...

as well as the slurmdb server name:

AccountingStorageHost=...

What I have never tried before is to change the DNS name of the slurmctld
host:

ControlMachine=...

The critical aspect here is that you need to stop all batch jobs, plus
slurmdbd and slurmctld.  Then you can backup (tar-ball) and transfer the
Slurm state directories:

StateSaveLocation=/var/spool/slurmctld

However, I don't know if the name of the ControlMachine is hard-coded in
the StateSaveLocation files?

I strongly suggest that you try to make a test migration of the
cluster to
the new DC to find out if it works or not.  Then you can always make
multiple attempts without breaking anything.

Best regards,
Ole


 > On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen
mailto:ole.h.niel...@fysik.dtu.dk>
 > >> wrote:
 >
 >     On 4/24/23 06:58, Purvesh Parmar wrote:
 >      > thank you, but its change of hostnames as well, apart from ip
 >     addresses
 >      > as well of the slurm server, database serverver name and slurmd
 >     compute
 >      > nodes as well.
 >
 >     I suggest that you talk to your networking people and request
that the
 >     old
 >     DNS names be created in the new network's DNS for your Slurm
cluster.
 >     Then Ryan's solution will work.  Changing DNS names is a very
simple
 >     matter!
 >
 >     My 2 cents,
 >     Ole
 >
 >
 >      > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
 >     mailto:novos...@rutgers.edu>
>
 >      > 
      >
 >      >     I think it’s easier than all of this. Are you actually
changing
 >     names
 >      >     of all of these things, or just IP addresses? It they all
 >     resolve to
 >      >     an IP now and you can bring everything down and change the
 >     hosts files
 >      >     or DNS, it seems to me that if the names aren’t changing,
 >     that’s that.
 >      >     I know that “scontrol show cluster” will show the wrong IP
 >     address but
 >      >     I think that updates itself.
 >      >
 >      >     The names of the servers are in slurm.conf, but again,
if the names
 >      >     don’t change, that won’t matter. If you have IPs there, you
 >     will need
 >      >     to change them.
 >      >
 >      >     Sent from my iPhone
 >      >
 >      >      > On Apr 23, 2023, at 14:01, Purvesh Parmar
 >     mailto:purveshp0...@gmail.com>
>
 >      >     
 >           >      > 
 >      >      > Hello,
 >      >      >
 >      >      > We have slurm 21.08 on ubuntu 20. We have a cluster
of 8 nodes.
 >      >     Entire slurm communication happens over 192.168.5.x
network (LAN).
 >      >     However as per requirement, now we are migrating the
cluster to
 >     other
 >      >     premises and there we have 172.16.1.x (LAN). I have to
migrate the
 >      >     entire network including SLURMDBD (mariadb), SLURMCTLD,
SLURMD.
 >     ALso
 >      >     the cluster network is also changing from 192.168.5.x to
172.16.1.x
 >      >     and each node will be assigned the ip address from the
172.16.1.x
 >      >     network.
 >      >      > The cluster has been running for the last 3 months
and it is
 >      >     required to maintain the old usage stats 

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Purvesh Parmar
Thank you.. will try this and get back. Any other step being missed here
for migration?


Thankyou,


Purvesh

On Mon, 24 Apr 2023 at 12:08, Ole Holm Nielsen 
wrote:

> On 4/24/23 08:09, Purvesh Parmar wrote:
> > thank you, however, because this is change in the data center, the names
> > of the servers contain datacenter names as well in its hostname and in
> > fqdn as well, hence i have to change both, hostnames as well as ip
> > addresses, compulsorily, to given hostnames as per new DC names.
>
> Could your data center be persuaded to introduce DNS CNAME aliases for the
> old names to point to the new DC names?
>
> If you're forced to use new DNS names only, then it's simple to change DNS
> names of compute nodes and partitions in slurm.conf:
>
> NodeName=...
> PartitionName=xxx Nodes=...
>
> as well as the slurmdb server name:
>
> AccountingStorageHost=...
>
> What I have never tried before is to change the DNS name of the slurmctld
> host:
>
> ControlMachine=...
>
> The critical aspect here is that you need to stop all batch jobs, plus
> slurmdbd and slurmctld.  Then you can backup (tar-ball) and transfer the
> Slurm state directories:
>
> StateSaveLocation=/var/spool/slurmctld
>
> However, I don't know if the name of the ControlMachine is hard-coded in
> the StateSaveLocation files?
>
> I strongly suggest that you try to make a test migration of the cluster to
> the new DC to find out if it works or not.  Then you can always make
> multiple attempts without breaking anything.
>
> Best regards,
> Ole
>
>
> > On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen <
> ole.h.niel...@fysik.dtu.dk
> > > wrote:
> >
> > On 4/24/23 06:58, Purvesh Parmar wrote:
> >  > thank you, but its change of hostnames as well, apart from ip
> > addresses
> >  > as well of the slurm server, database serverver name and slurmd
> > compute
> >  > nodes as well.
> >
> > I suggest that you talk to your networking people and request that
> the
> > old
> > DNS names be created in the new network's DNS for your Slurm cluster.
> > Then Ryan's solution will work.  Changing DNS names is a very simple
> > matter!
> >
> > My 2 cents,
> > Ole
> >
> >
> >  > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
> > mailto:novos...@rutgers.edu>
> >  > >>
> wrote:
> >  >
> >  > I think it’s easier than all of this. Are you actually
> changing
> > names
> >  > of all of these things, or just IP addresses? It they all
> > resolve to
> >  > an IP now and you can bring everything down and change the
> > hosts files
> >  > or DNS, it seems to me that if the names aren’t changing,
> > that’s that.
> >  > I know that “scontrol show cluster” will show the wrong IP
> > address but
> >  > I think that updates itself.
> >  >
> >  > The names of the servers are in slurm.conf, but again, if the
> names
> >  > don’t change, that won’t matter. If you have IPs there, you
> > will need
> >  > to change them.
> >  >
> >  > Sent from my iPhone
> >  >
> >  >  > On Apr 23, 2023, at 14:01, Purvesh Parmar
> > mailto:purveshp0...@gmail.com>
> >  >  > >> wrote:
> >  >  > 
> >  >  > Hello,
> >  >  >
> >  >  > We have slurm 21.08 on ubuntu 20. We have a cluster of 8
> nodes.
> >  > Entire slurm communication happens over 192.168.5.x network
> (LAN).
> >  > However as per requirement, now we are migrating the cluster
> to
> > other
> >  > premises and there we have 172.16.1.x (LAN). I have to
> migrate the
> >  > entire network including SLURMDBD (mariadb), SLURMCTLD,
> SLURMD.
> > ALso
> >  > the cluster network is also changing from 192.168.5.x to
> 172.16.1.x
> >  > and each node will be assigned the ip address from the
> 172.16.1.x
> >  > network.
> >  >  > The cluster has been running for the last 3 months and it
> is
> >  > required to maintain the old usage stats as well.
> >  >  >
> >  >  >
> >  >  >  Is the procedure correct as below :
> >  >  >
> >  >  > 1) Stop slurm
> >  >  > 2) suspend all the queued jobs
> >  >  > 3) backup slurm database
> >  >  > 4) change the slurm & munge configuration i.e. munge conf,
> > mariadb
> >  > conf, slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute
> > nodes),
> >  > gres.conf, service file
> >  >  > 5) Later, do the update in the slurm database by executing
> below
> >  > command
> >  >  > sacctmgr modify node where node=old_name set name=new_name
> >  >  > for all the nodes.
> >  >  > ALso, I think, slurm server name and slurmdbd 

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Ole Holm Nielsen

On 4/24/23 08:09, Purvesh Parmar wrote:
thank you, however, because this is change in the data center, the names 
of the servers contain datacenter names as well in its hostname and in 
fqdn as well, hence i have to change both, hostnames as well as ip 
addresses, compulsorily, to given hostnames as per new DC names.


Could your data center be persuaded to introduce DNS CNAME aliases for the 
old names to point to the new DC names?


If you're forced to use new DNS names only, then it's simple to change DNS 
names of compute nodes and partitions in slurm.conf:


NodeName=...
PartitionName=xxx Nodes=...

as well as the slurmdb server name:

AccountingStorageHost=...

What I have never tried before is to change the DNS name of the slurmctld 
host:


ControlMachine=...

The critical aspect here is that you need to stop all batch jobs, plus 
slurmdbd and slurmctld.  Then you can backup (tar-ball) and transfer the 
Slurm state directories:


StateSaveLocation=/var/spool/slurmctld

However, I don't know if the name of the ControlMachine is hard-coded in 
the StateSaveLocation files?


I strongly suggest that you try to make a test migration of the cluster to 
the new DC to find out if it works or not.  Then you can always make 
multiple attempts without breaking anything.


Best regards,
Ole


On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen > wrote:


On 4/24/23 06:58, Purvesh Parmar wrote:
 > thank you, but its change of hostnames as well, apart from ip
addresses
 > as well of the slurm server, database serverver name and slurmd
compute
 > nodes as well.

I suggest that you talk to your networking people and request that the
old
DNS names be created in the new network's DNS for your Slurm cluster.
Then Ryan's solution will work.  Changing DNS names is a very simple
matter!

My 2 cents,
Ole


 > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
mailto:novos...@rutgers.edu>
 > >> wrote:
 >
 >     I think it’s easier than all of this. Are you actually changing
names
 >     of all of these things, or just IP addresses? It they all
resolve to
 >     an IP now and you can bring everything down and change the
hosts files
 >     or DNS, it seems to me that if the names aren’t changing,
that’s that.
 >     I know that “scontrol show cluster” will show the wrong IP
address but
 >     I think that updates itself.
 >
 >     The names of the servers are in slurm.conf, but again, if the names
 >     don’t change, that won’t matter. If you have IPs there, you
will need
 >     to change them.
 >
 >     Sent from my iPhone
 >
 >      > On Apr 23, 2023, at 14:01, Purvesh Parmar
mailto:purveshp0...@gmail.com>
 >     >> wrote:
 >      > 
 >      > Hello,
 >      >
 >      > We have slurm 21.08 on ubuntu 20. We have a cluster of 8 nodes.
 >     Entire slurm communication happens over 192.168.5.x network (LAN).
 >     However as per requirement, now we are migrating the cluster to
other
 >     premises and there we have 172.16.1.x (LAN). I have to migrate the
 >     entire network including SLURMDBD (mariadb), SLURMCTLD, SLURMD.
ALso
 >     the cluster network is also changing from 192.168.5.x to 172.16.1.x
 >     and each node will be assigned the ip address from the 172.16.1.x
 >     network.
 >      > The cluster has been running for the last 3 months and it is
 >     required to maintain the old usage stats as well.
 >      >
 >      >
 >      >  Is the procedure correct as below :
 >      >
 >      > 1) Stop slurm
 >      > 2) suspend all the queued jobs
 >      > 3) backup slurm database
 >      > 4) change the slurm & munge configuration i.e. munge conf,
mariadb
 >     conf, slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute
nodes),
 >     gres.conf, service file
 >      > 5) Later, do the update in the slurm database by executing below
 >     command
 >      > sacctmgr modify node where node=old_name set name=new_name
 >      > for all the nodes.
 >      > ALso, I think, slurm server name and slurmdbd server names
are also
 >     required to be updated. How to do it, still checking
 >      > 6) Finally, start slurmdbd, slurmctld on server and slurmd on
 >     compute nodes
 >      >
 >      > Please help and guide for above.
 >      >
 >      > Regards,
 >      >
 >      > Purvesh Parmar
 >      > INHAIT