Re: [slurm-users] Array jobs vs Fairshare

2020-10-21 Thread Riebs, Andy
Thanks for the additional information, Stephan!

At this point, I’ll have to ask for anyone with more job array experience than 
I have (because I have none!) to speak up.

Remember that we’re all in this together(*), so any help that anyone can offer 
will be good!

Andy

(*) Well, actually, I’m retiring at the end of the week, so I’m not sure that 
I’ll have a lot of Slurm in my life, going forward ☺

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Stephan Schott
Sent: Wednesday, October 21, 2020 9:40 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Array jobs vs Fairshare

And I forgot to mention, things are running in a Qlustar cluster based on 
Ubuntu 18.04.4 LTS Bionic. 

El mié., 21 oct. 2020 a las 15:38, Stephan Schott 
(mailto:schot...@hhu.de>>) escribió:
Oh, sure, sorry.
We are using slurm 18.08.8, with a backfill scheduler. The jobs are being 
assigned to the same partition, which limits gpus and cpus to 1 via QOS. Here 
some of the main flags:

SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty 
--preserve-env --mpi=none $SHELL"
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=Sched
MinJobAge=300
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PreemptType=preempt/qos
PreemptMode=requeue
PriorityType=priority/multifactor
PriorityFlags=FAIR_TREE
PriorityFavorSmall=YES
FairShareDampeningFactor=5
PriorityWeightAge=1000
PriorityWeightFairshare=5000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=5000
PriorityWeightTRES=gres/gpu=1000
AccountingStorageEnforce=limits,qos,nosteps
AccountingStorageTRES=gres/gpu
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup

Any ideas?

Cheers,

El mié., 21 oct. 2020 a las 15:17, Riebs, Andy 
(mailto:andy.ri...@hpe.com>>) escribió:
Also, of course, any of the information that you can provide about how the 
system is configured: scheduler choices, QOS options, and the like, would also 
help in answering your question.

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of Riebs, Andy
Sent: Wednesday, October 21, 2020 9:02 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Array jobs vs Fairshare

Stephan (et al.),

There are probably 6 versions of Slurm in common use today, across multiple 
versions each of Debian/Ubuntu, SuSE/SLES, and RedHat/CentOS/Fedora. You are 
more likely to get a good answer if you offer some hints about what you are 
running!

Regards,
Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Stephan Schott
Sent: Wednesday, October 21, 2020 8:37 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] Array jobs vs Fairshare

Hi everyone,
I am having doubts regarding array jobs. To me it seems that the 
JobArrayTaskLimit has precedence over the Fairshare, as users with a way lower 
priority seem to get constant allocations for their array jobs, compared to 
users with "normal" jobs. Can someone confirm this?
Cheers,

--
Stephan Schott Verdugo
Biochemist

Heinrich-Heine-Universitaet Duesseldorf
Institut fuer Pharm. und Med. Chemie
Universitaetsstr. 1
40225 Duesseldorf
Germany


--
Stephan Schott Verdugo
Biochemist

Heinrich-Heine-Universitaet Duesseldorf
Institut fuer Pharm. und Med. Chemie
Universitaetsstr. 1
40225 Duesseldorf
Germany


--
Stephan Schott Verdugo
Biochemist

Heinrich-Heine-Universitaet Duesseldorf
Institut fuer Pharm. und Med. Chemie
Universitaetsstr. 1
40225 Duesseldorf
Germany


Re: [slurm-users] Array jobs vs Fairshare

2020-10-21 Thread Riebs, Andy
Also, of course, any of the information that you can provide about how the 
system is configured: scheduler choices, QOS options, and the like, would also 
help in answering your question.

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Riebs, Andy
Sent: Wednesday, October 21, 2020 9:02 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Array jobs vs Fairshare

Stephan (et al.),

There are probably 6 versions of Slurm in common use today, across multiple 
versions each of Debian/Ubuntu, SuSE/SLES, and RedHat/CentOS/Fedora. You are 
more likely to get a good answer if you offer some hints about what you are 
running!

Regards,
Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Stephan Schott
Sent: Wednesday, October 21, 2020 8:37 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] Array jobs vs Fairshare

Hi everyone,
I am having doubts regarding array jobs. To me it seems that the 
JobArrayTaskLimit has precedence over the Fairshare, as users with a way lower 
priority seem to get constant allocations for their array jobs, compared to 
users with "normal" jobs. Can someone confirm this?
Cheers,

--
Stephan Schott Verdugo
Biochemist

Heinrich-Heine-Universitaet Duesseldorf
Institut fuer Pharm. und Med. Chemie
Universitaetsstr. 1
40225 Duesseldorf
Germany


Re: [slurm-users] Array jobs vs Fairshare

2020-10-21 Thread Riebs, Andy
Stephan (et al.),

There are probably 6 versions of Slurm in common use today, across multiple 
versions each of Debian/Ubuntu, SuSE/SLES, and RedHat/CentOS/Fedora. You are 
more likely to get a good answer if you offer some hints about what you are 
running!

Regards,
Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Stephan Schott
Sent: Wednesday, October 21, 2020 8:37 AM
To: Slurm User Community List 
Subject: [slurm-users] Array jobs vs Fairshare

Hi everyone,
I am having doubts regarding array jobs. To me it seems that the 
JobArrayTaskLimit has precedence over the Fairshare, as users with a way lower 
priority seem to get constant allocations for their array jobs, compared to 
users with "normal" jobs. Can someone confirm this?
Cheers,

--
Stephan Schott Verdugo
Biochemist

Heinrich-Heine-Universitaet Duesseldorf
Institut fuer Pharm. und Med. Chemie
Universitaetsstr. 1
40225 Duesseldorf
Germany


Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-06 Thread Riebs, Andy
> The problem is with a single, specific, node: str957-bl0-03 . The same
> job script works if being allocated to another node, even with more
> ranks (tested up to 224/4 on mtx-* nodes).

Ahhh... here's where the details help. So it appears that the problem is on a 
single node, and probably not a general configuration or system problem. I 
suggest starting with  something like this to help figure out why node bl0-03 
is different

$ sudo ssh str957- bl0-02 lscpu
$ sudo ssh str957- bl0-03 lscpu

Andy

-Original Message-
From: Diego Zuccato [mailto:diego.zucc...@unibo.it] 
Sent: Tuesday, October 6, 2020 3:13 AM
To: Riebs, Andy ; Slurm User Community List 

Subject: Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

Il 05/10/20 14:18, Riebs, Andy ha scritto:

Tks for considering my query.

> You need to provide some hints! What we know so far:
> 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x 
> backtrace.
Correct.

> 2. Your decision to address this to the Slurm mailing list suggests that you 
> think that Slurm might be involved.
At least I couldn't replicate launching manually (it always says "no
slots available" unless I use mpirun -np 16 ...). I'm no MPI expert
(actually less than a noob!) so I can't rule out it's unrelated to
Slurm. I mostly hope that on this list I can find someone with enough
experience with both Slurm and MPI.

> 3. You have something (a job? a program?) that segfaults when you go from 30 
> to 32 processes.
Multiple programs, actually.

> a. What operating system?
Debian 10.5 . Only extension is PBIS-open to authenticate users from AD.

> b. Are you seeing this while running Slurm? What version?
18.04, Debian packages

> c. What version of Open MPI?
openmpi-bin/stable,now 3.1.3-11 amd64

> d. Are you building your own PMI-x, or are you using what's provided by Open 
> MPI and Slurm?
Using Debian packages

> e. What does your hardware configuration look like -- particularly, what cpu 
> type(s), and how many cores/node?
The node uses dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz for a total
of 32 threads (hyperthreading is enabled: 2 sockets, 8 cores per socket,
2 threads per core).

> f. What does you Slurm configuration look like (assuming you're seeing this 
> with Slurm)? I suggest purging your configuration files of node names and IP 
> addresses, and including them with your query.
Here it is:
-8<--
SlurmCtldHost=str957-cluster(*.*.*.*)
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
#DisableRootJobs=NO
EnforcePartLimits=YES
JobSubmitPlugins=lua
MpiDefault=none
MpiParams=ports=12000-12999
ReturnToService=2
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
TmpFS=/mnt/local_data/
UsePAM=1
GetEnvTimeout=20
InactiveLimit=0
KillWait=120
MinJobAge=300
SlurmctldTimeout=20
SlurmdTimeout=30
FastSchedule=0
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityFlags=MAX_TRES
PriorityType=priority/multifactor
PreemptMode=CANCEL
PreemptType=preempt/partition_prio
AccountingStorageEnforce=safe,qos
AccountingStorageHost=str957-cluster
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=6819
#AccountingStorageTRES=
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
AccountingStoreJobComment=YES
AcctGatherNodeFreq=300
ClusterName=oph
JobCompLoc=/var/spool/slurm/jobscompleted.txt
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=DEFAULT Sockets=2 ThreadsPerCore=2 State=UNKNOWN
NodeName=str957-bl0-0[1-2] CoresPerSocket=6 Feature=ib,blade,intel
NodeName=str957-bl0-0[3-5] CoresPerSocket=8 Feature=ib,blade,intel
NodeName=str957-bl0-[15-16] CoresPerSocket=4 Feature=ib,nonblade,intel
NodeName=str957-bl0-[17-18] CoresPerSocket=6 ThreadsPerCore=1
Feature=nonblade,amd
NodeName=str957-bl0-[19-20] Sockets=4 CoresPerSocket=8 ThreadsPerCore=1
Feature=nonblade,amd
NodeName=str957-mtx-[00-15] CoresPerSocket=14 Feature=ib,nonblade,intel
-8<--

> g. What does your command line look like? Especially, are you trying to run 
> 32 processes on a single node? Spreading them out across 2 or more nodes?
The problem is with a single, specific, node: str957-bl0-03 . The same
job script works if being allocated to another node, even with more
ranks (tested up to 224/4 on mtx-* nodes).

> h. Can you reproduce the problem if you substitute `hostname` or `true` for 
> the program in the command line? What about a simple MPI-enabled "hello 
> world?"I'll try ASAP w/ a simple 'hostname'. But I expect it 

Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-05 Thread Riebs, Andy
You need to provide some hints! What we know so far:

1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x 
backtrace.
2. Your decision to address this to the Slurm mailing list suggests that you 
think that Slurm might be involved.
3. You have something (a job? a program?) that segfaults when you go from 30 to 
32 processes.

At a minimum, it would help your readers' understanding, and ability to help, 
to know:

a. What operating system?
b. Are you seeing this while running Slurm? What version?
c. What version of Open MPI?
d. Are you building your own PMI-x, or are you using what's provided by Open 
MPI and Slurm?
e. What does your hardware configuration look like -- particularly, what cpu 
type(s), and how many cores/node?
f. What does you Slurm configuration look like (assuming you're seeing this 
with Slurm)? I suggest purging your configuration files of node names and IP 
addresses, and including them with your query.
g. What does your command line look like? Especially, are you trying to run 32 
processes on a single node? Spreading them out across 2 or more nodes?
h. Can you reproduce the problem if you substitute `hostname` or `true` for the 
program in the command line? What about a simple MPI-enabled "hello world?"

Andy

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Monday, October 5, 2020 7:05 AM
To: Slurm User Community List 
Subject: [slurm-users] Segfault with 32 processes, OK with 30 ???

Hello all.

I'm seeing (again) this weird issue.
The same executable, launched with 32 processes crashes immediately,
while it runs flawlessy with only 30 processes.

The reported error is:
[str957-bl0-03:05271] *** Process received signal ***
[str957-bl0-03:05271] Signal: Segmentation fault (11)
[str957-bl0-03:05271] Signal code: Address not mapped (1)
[str957-bl0-03:05271] Failing at address: 0x7f3826fb4008
[str957-bl0-03:05271] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f3825df6730]
[str957-bl0-03:05271] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f3824553936]
[str957-bl0-03:05271] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f382452a733]
[str957-bl0-03:05271] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f38245535b4]
[str957-bl0-03:05271] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f382467946e]
[str957-bl0-03:05271] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f382463188d]
[str957-bl0-03:05271] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f38245edd7c]
[str957-bl0-03:05271] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f38246e9fe4]
[str957-bl0-03:05271] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f3826fb9656]
[str957-bl0-03:05271] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f3825b8011a]
[str957-bl0-03:05271] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f3825e50e62]
[str957-bl0-03:05271] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f3825e7f17e]
[str957-bl0-03:05271] [12] ./C-GenIC(+0x23b9)[0x55bf9fa8e3b9]
[str957-bl0-03:05271] [13]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f3825c4709b]
[str957-bl0-03:05271] [14] ./C-GenIC(+0x251a)[0x55bf9fa8e51a]
[str957-bl0-03:05271] *** End of error message ***


In the past, just installing gdb to try to debug it made the problem
disappear: obviously it was not a solution...

Any hint?

TIA

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Riebs, Andy
Relu,

There are a number of ways to run an open source project. In the case of Slurm, 
the code is managed by SchedMD. As a rule, one presumes that they have plenty 
on their plate, and little time to respond to the mailing list. Hence the 
suggestion that one get a support contract to get their attention. I’m not 
complaining, it’s just the way it works.

This mailing list is handled 99% by users like you and me. If you’ve got a 
great idea, particularly if you have an implementation, one of the best ways to 
handle it is to describe your innovation here, asking for feedback if you 
choose, and then offer the patch here on the mailing list or, as Ryan suggests, 
post it in the Bugzilla.

Andy


From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Wednesday, September 30, 2020 11:35 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] How to contact slurm developers

I’ve previously seen code contributed back in that way. See bug 1611 as an 
example (happened to have looked at that just yesterday).
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 11:29, Relu Patrascu 
mailto:r...@cs.toronto.edu>> wrote:

Thanks Ryan, I'll try the bugs site. And indeed, one person in our organization 
has already said "let's pay for support, maybe they'll listen." :) It's a 
little bit funny to me that we don't actually need support, but get it hoping 
that they might consider adding a feature which we think would benefit everyone.

We have actually modified the code on both v 19 and 20 to do what we would 
like, preemption within the same QOS, but we think that the community would 
benefit from this feature, hence our request to have it in the release version.
Relu

On 2020-09-30 11:02, Ryan Novosielski wrote:
Depends on the issue I think, but the bugs site is often a way to request 
enhancements, etc. Of course, requests coming from an entity with a support 
contact carry more weight.
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 10:57, Relu Patrascu 
 wrote:
Hi all,

I posted recently on this mailing list a feature request and got no reply from 
the developers. Is there a better way to contact the slurm developers or we 
should just accept that they are not interested in community feedback?

Regards,

Relu



Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Riebs, Andy
Check for Ethernet problems. This happens often enough that I have the 
following definition in my .bashrc file to help track these down:

alias flaky_eth='su -c "ssh slurmctld-node grep responding 
/var/log/slurm/slurmctld.log"'

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
???
Sent: Tuesday, July 21, 2020 8:41 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] lots of job failed due to node failure

Hi,all
We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of job 
failed due to node failure; check slumctld.log we found  nodes are set to down 
stat then resumed quikly.
some log info:
[2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
[2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN
[2020-07-20T00:26:23.725] error: Nodes j1802 not responding
[2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN
[2020-07-20T00:26:46.602] Node j1608 now responding
[2020-07-20T00:26:49.449] Node j1802 now responding

Anyone hit this issue beforce ?
Any suggestions will help.

Regards.


Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Riebs, Andy
Frankly, it's hard to tell what you might be doing wrong if you don't tell us 
what you're doing!

That notwithstanding, the "--uid" message suggests that something in your 
process is trying to submit a job with the "--uid" option, but you don't have 
sufficient privs to use it.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Sidhu, Khushwant
Sent: Monday, July 20, 2020 10:50 AM
To: Slurm User Community List 
Subject: [slurm-users] slurm & rstudio

Hi,

I'm trying to use rstudio & slurm to submit jobs but am getting the attached 
error.
Has anyone come across it or know what I'm doing wrong ?

Thanks

Regards

Khush

Disclaimer: This email and any attachments are sent in strictest confidence for 
the sole use of the addressee and may contain legally privileged, confidential, 
and proprietary data. If you are not the intended recipient, please advise the 
sender by replying promptly to this email and then delete and destroy this 
email and any attachments without any further use, copying or forwarding.


Re: [slurm-users] Meaning of "defunct" in description of Slurm parameters

2020-07-20 Thread Riebs, Andy
Ummm... unless I'm missing something obvious, though the choice of the term 
"defunct" might not be my choice (I would have expected "deprecated"), it seems 
quite clear that the new "SlurmctldHost" parameter has subsumed the 4 that 
you've listed. I wasn't privy to the decision to the discussion about adding 
the new parameter, so the value isn't enormously clear, except for the option 
of adding a second backup host (for a total of 3 "ControlMachine" candidates).

Andy

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Kevin Buckley
Sent: Sunday, July 19, 2020 11:50 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Meaning of "defunct" in description of Slurm parameters

At https://slurm.schedmd.com/slurm.conf.html 

we read

BackupAddr
 Defunct option, see SlurmctldHost.

BackupController
 Defunct option, see SlurmctldHost. ...

ControlAddr
 Defunct option, see SlurmctldHost.

ControlMachine
 Defunct option, see SlurmctldHost.


but what does "Defunct" actually mean there?


A top-ranked internet search result suggests that it means


  "no longer existing or functioning."


but if that's the case, then if, say, your only defintion of
a "ControlMachine" in your 20.02's slurm.conf was in the
defunct ControlMachine parameter, how would your Slurm
instance know what name the "ControlMachine" had ?


Kevin
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



Re: [slurm-users] How to exclude nodes in sbatch/srun?

2020-06-22 Thread Riebs, Andy
In fairness to our friends at SchedMD, this was filed as an enhancement 
request, not a bug.

Since this is an open source project, there are 2 good ways to make it happen:


1.   Fund someone, like SchedMD, to make the change.

2.   Make the changes yourself, and submit the changes.

Alternatively, I like the “NOxxx features” approach a lot!

/s/ Speaking for myself

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Paul Edmon
Sent: Monday, June 22, 2020 9:33 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] [External] How to exclude nodes in sbatch/srun?


For the record we filed a bug on this years ago: 
https://bugs.schedmd.com/show_bug.cgi?id=3875
  Hasn't been fixed yet though everyone seems to agree it is a good idea.

Florian's suggestion is probably the best stopgap until this feature is 
implemented.

-Paul Edmon-
On 6/22/2020 7:11 AM, Florian Zillner wrote:
Durai,

To overcome this, we use noXXX features like below. Users can then request 
“8268” to select nodes with 8268s on EDR without GPUs for example.

# scontrol show node node5000 |grep AvailableFeatures
   
AvailableFeatures=192GB,2933MHz,SD530,Platinum,8268,rack25,EDR,sb7890_0416,enc2514,24C,SNCoff,noGPU,DISK,SSD

Cheers,
Florian


From: slurm-users 

 On Behalf Of Durai Arasan
Sent: Montag, 22. Juni 2020 11:02
To: Slurm User Community List 

Subject: [External] [slurm-users] How to exclude nodes in sbatch/srun?

Hi,

The sbatch/srun commands have the "--constraint" option to select nodes with 
certain features. With this you can specify AND, OR, matching OR operators. But 
there is no NOT operator. How do you exclude nodes with a certain feature in 
the "--constraint" option? Or is there another option that can do it?

Thanks,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen



Re: [slurm-users] Slurm and shared file systems

2020-06-19 Thread Riebs, Andy
David,

I've been using Slurm for nearly 20 years, and while I can imagine some clever 
work-arounds, like staging your job in /var/tmp on all of the nodes before 
trying to run it, it's hard to imagine a cluster serving a useful purpose 
without a shared user file system, whether or not Slurm is involved.

Having said that, I hope that someone comes up with a real use case to help me 
see something that I don't currently see!

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
David Baker
Sent: Friday, June 19, 2020 8:05 AM
To: Slurm User Community List 
Subject: [slurm-users] Slurm and shared file systems

Hello,

We are currently helping a research group to set up their own Slurm cluster. 
They have asked a very interesting question about Slurm and file systems. That 
is, they are posing the question -- do you need a shared user file store on a 
Slurm cluster?

So, in the extreme case where this is no shared file store for users can slurm 
operate properly over a cluster? I have seen commands like sbcast to move a 
file from the submission node to a compute node, however that command can only 
transfer one file at a time. Furthermore what would happen to the standard 
output files? I'm going to guess that there must be a shared file system, 
however it would be good if someone could please confirm this.

Best regards,
David




Re: [slurm-users] unable to start slurmd process.

2020-06-13 Thread Riebs, Andy
Navin, thanks for the update, and congrats on finding the problem!

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Saturday, June 13, 2020 1:21 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

Hi Team,

After my Analysis i found that the user used the qdel command which is a plugin 
with slurm and the job is not killed properly and it makes the slurmstepd 
process in a kind of hung state. so when i was trying to start the slurmd the 
process was not getting started.after killing those processes. slurmd started 
without any issues.

Regards
Navin.




On Thu, Jun 11, 2020 at 9:23 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Short of getting on the system and kicking the tires myself, I’m fresh out of 
ideas. Does “sinfo -R” offer any hints?

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 11:31 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

i am able to get the output scontrol show node oled3
also the oled3 is pinging fine

and scontrol ping output showing like

Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN

so all looks ok to me.

REgards
Navin.



On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
So there seems to be a failure to communicate between slurmctld and the oled3 
slurmd.

From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon.

From the head node, try “scontrol show node oled3”, and then ping the address 
that is shown for “NodeAddr=”

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 10:40 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

i collected the log from slurmctld and it says below

[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 
RPC:REQUEST_TERMINATE_JOB : Communication connection failure
[2020-06-11T07:14:50.210] error: Nodes oled3 not responding
[2020-06-11T07:15:54.313] error: Nodes oled3 not responding
[2020-06-11T07:17:34.407] error: Nodes oled3 not responding
[2020-06-11T07:19:14.637] error: Nodes oled3 not responding
[2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required
[2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
[2020-06-11T07:20:43.788] requeu

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Short of getting on the system and kicking the tires myself, I’m fresh out of 
ideas. Does “sinfo -R” offer any hints?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 11:31 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

i am able to get the output scontrol show node oled3
also the oled3 is pinging fine

and scontrol ping output showing like

Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN

so all looks ok to me.

REgards
Navin.



On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
So there seems to be a failure to communicate between slurmctld and the oled3 
slurmd.

From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon.

From the head node, try “scontrol show node oled3”, and then ping the address 
that is shown for “NodeAddr=”

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 10:40 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

i collected the log from slurmctld and it says below

[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 
RPC:REQUEST_TERMINATE_JOB : Communication connection failure
[2020-06-11T07:14:50.210] error: Nodes oled3 not responding
[2020-06-11T07:15:54.313] error: Nodes oled3 not responding
[2020-06-11T07:17:34.407] error: Nodes oled3 not responding
[2020-06-11T07:19:14.637] error: Nodes oled3 not responding
[2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required
[2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
[2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3
[2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3
[2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN

sinfo says

OLED*   up   infinite  1 drain* oled3

while checking the node i feel node is healthy.

Regards
Navin

On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to 
interpret it not reporting anything but the “log file” and “munge” messages. 
When you have it running attached to your window, is there any chance that 
sinfo or scontrol suggest that the node is actually all right? Perhaps 
something in /etc/sysconfig/slurm or the like is messed up?

If that’s not the case, I think my next step would be to follow up on so

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to 
interpret it not reporting anything but the “log file” and “munge” messages. 
When you have it running attached to your window, is there any chance that 
sinfo or scontrol suggest that the node is actually all right? Perhaps 
something in /etc/sysconfig/slurm or the like is messed up?

If that’s not the case, I think my next step would be to follow up on someone 
else’s suggestion, and scan the slurmctld.log file for the problem node name.

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 9:26 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

Sorry Andy I missed to add.
1st i tried the  slurmd -Dvvv and it is not written anything
slurmd: debug:  Log file re-opened
slurmd: debug:  Munge authentication plugin loaded

After that I waited for 10-20 minutes but no output and finally i pressed 
Ctrl^c.

My doubt is in slurm.conf file:

ControlMachine=deda1x1466
ControlAddr=192.168.150.253

The deda1x1466 is having a different interface with different IP which compute 
node is unable to ping but IP is pingable.
could be one of the reason?

but other nodes having the same config and there i am able to start the slurmd. 
so bit of confusion.

Regards
Navin.








Regards
Navin.






On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
If you omitted the “-D” that I suggested, then the daemon would have detached 
and logged nothing on the screen. In this case, you can still go to the slurmd 
log (use “scontrol show config | grep -I log” if you’re not sure where the logs 
are stored).

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 9:01 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

I tried by executing the debug mode but there also it is not writing anything.

i waited for about 5-10 minutes

deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v

No output on terminal.

The OS is SLES12-SP4 . All firewall services are disabled.

The recent change is the local hostname earlier it was with local hostname 
node1,node2,etc but we have moved to dns based hostname which is deda

NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
Sockets=2 CoresPerSocket=10 State=UNKNOWN
other than this it is fine but after that i have done several time slurmd 
process started on the node and it works fine but now i am seeing this issue 
today.

Regards
Navin.









On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -D

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
If you omitted the “-D” that I suggested, then the daemon would have detached 
and logged nothing on the screen. In this case, you can still go to the slurmd 
log (use “scontrol show config | grep -I log” if you’re not sure where the logs 
are stored).

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 9:01 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

I tried by executing the debug mode but there also it is not writing anything.

i waited for about 5-10 minutes

deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v

No output on terminal.

The OS is SLES12-SP4 . All firewall services are disabled.

The recent change is the local hostname earlier it was with local hostname 
node1,node2,etc but we have moved to dns based hostname which is deda

NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
Sockets=2 CoresPerSocket=10 State=UNKNOWN
other than this it is fine but after that i have done several time slurmd 
process started on the node and it works fine but now i am seeing this issue 
today.

Regards
Navin.









On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -D

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -D

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] Intermittent problem at 32 CPUs

2020-06-05 Thread Riebs, Andy
Diego,

I'm *guessing* that you are tripping over the use of "--tasks 32" on a 
heterogeneous cluster, though your comment about the node without InfiniBand 
troubles me. If you drain that node, or exclude it in your command line, that 
might correct the problem. I wonder if OMPI and PMIx have decided that IB is 
the way to go, and are failing when they try to set up on the node without IB.

If that's not it, I'd try
0. Check sacct for the node lists for the successful and unsuccessful runs -- a 
problem node might jump out.
1. Running your job with explicit node lists. Again, you may find a problem 
node this way.

HTH!
Andy

p.s. If this doesn't fix it, please include the Slurm and OMPI versions, and a 
copy of your slurm.conf file (with identifying information like node names 
removed) in your next note to this list.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Friday, June 5, 2020 9:08 AM
To: Slurm User Community List 
Subject: [slurm-users] Intermittent problem at 32 CPUs

Hello all.

I already tried for some weeks to debug this problem, but it seems I'm
still missing something.
I have a small, (very) heterogeneous cluster. After upgrading to Debian
10 and packaged versions of Slurm and IB drivers/tools, I noticed that
*sometimes* jobs requesting 32 or more threads fail with an error like:
-8<--
[str957-bl0-19:30411] *** Process received signal ***
[str957-bl0-19:30411] Signal: Segmentation fault (11)
[str957-bl0-19:30411] Signal code: Address not mapped (1)
[str957-bl0-19:30411] Failing at address: 0x7fb206380008
[str957-bl0-19:30411] [ 0]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7fb205eb7840]
[str957-bl0-19:30411] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7fb200ac2936]
[str957-bl0-19:30411] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7fb200a92733]
[str957-bl0-19:30411] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7fb200ac25b4]
[str957-bl0-19:30411] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7fb200bba46e]
[str957-bl0-19:30411] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7fb200b7288d]
[str957-bl0-19:30411] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7fb200b2ed7c]
[str957-bl0-19:30411] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7fb200c35fe4]
[str957-bl0-19:30411] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7fb201462656]
[str957-bl0-19:30411] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7fb202a9211a]
[str957-bl0-19:30411] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7fb203f23e62]
[str957-bl0-19:30411] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Init_thread+0x55)[0x7fb203f522d5]
-8<--
Just changing --ntasks=32 to --ntasks=30 (or less) lets it run w/o problems.
*Sometimes* it works even with --ntasks=32.
But the most absurd thing I've seen is this (just changing the step in
the batch job):
-8<--
mpirun ./mpitest => KO
gdb -batch -n -ex 'set pagination off' -ex run -ex bt -ex 'bt full' -ex
'thread apply all bt full' --args mpirun --mca btl openib --mca mtl psm2
./mpitest-debug => OK
mpirun --mca btl openib --mca mtl psm2 ./mpitest-debug => OK
mpirun --mca mtl psm2 ./mpitest-debug => OK
mpirun ./mpitest-debug => OK
mpirun ./mpitest => OK?!?!?!?!
-8<--

At the end, *the same* command that consistently failed, started to run.
The currently problematic node is one w/o InfiniBand, so that can
probably be ruled out.

Any hints?

TIA.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Riebs, Andy
Geoffrey,

A lot depends on what you mean by “failure on the current machine”. If it’s a 
failure that Slurm recognizes as a failure, Slurm can be configured to remove 
the node from the partition, and you can follow Rodrigo’s suggestions for the 
requeue options.

If the user job simply decides it’s unhappy with the node, but Slurm doesn’t 
see a problem, they could have the job resubmit itself via sbatch, with the 
problem node(s) excluded. (Or, a horrible option, you could grant sudo 
privileges for the user to run scontrol so that the user could drain the 
problem nodes -- but the situations where this would be a good solution are 
rare!)

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Rodrigo Santibáñez
Sent: Thursday, June 4, 2020 4:16 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Change ExcNodeList on a running job

Hello,

Jobs can be requeue if something wrong happens, and the node with failure 
excluded by the controller.

--requeue
Specifies that the batch job should eligible to being requeue. The job may be 
requeued explicitly by a system administrator, after node failure, or upon 
preemption by a higher priority job. When a job is requeued, the batch script 
is initiated from its beginning. Also see the --no-requeue option. The 
JobRequeue configuration parameter controls the default behavior on the cluster.

Also, jobs can be run selecting a specific node or excluding nodes

-w, --nodelist=
Request a specific list of hosts. The job will contain all of these hosts and 
possibly additional hosts as needed to satisfy resource requirements. The list 
may be specified as a comma-separated list of hosts, a range of hosts 
(host[1-5,7,...] for example), or a filename. The host list will be assumed to 
be a filename if it contains a "/" character. If you specify a minimum node or 
processor count larger than can be satisfied by the supplied host list, 
additional resources will be allocated on other nodes as needed. Duplicate node 
names in the list will be ignored. The order of the node names in the list is 
not important; the node names will be sorted by Slurm.

-x, --exclude=
Explicitly exclude certain nodes from the resources granted to the job.

does this help?

El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. 
(mailto:geoffrey.ran...@jhuapl.edu>>) escribió:

Hello
   We are moving from Univa(sge) to slurm and one of our users has jobs that if 
they detect a failure on the current machine they add that machine to their 
exclude list and requeue themselves. The user wants to emulate that behavior in 
slurm.

It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList $NEWExcNodeList” 
won’t work on a running job, but it does work on a job pending in the queue. 
This means the job can’t do this step and requeue itself to avoid running on 
the same host as before.

Our user wants his jobs to be able to exclude the current node and requeue 
itself.
Is there some way to accomplish this in slurm?
Is there a requeue counter of some sort so a job can see if it has requeued 
itself more than X times and give up?

Thanks.


Re: [slurm-users] Node suspend / Power saving - for *idle* nodes only?

2020-05-15 Thread Riebs, Andy
And if you're willing to buy a support contract with SchedMD, and/or provide a 
fix, it will be fixed. Otherwise, you'll have to accept that you've got a large 
group of users, just like you, who are willing to share their expertise and 
experience, even if it's not our "day job" -- or even our "night job" :)

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Florian Zillner
Sent: Friday, May 15, 2020 9:27 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] [External] Re: Node suspend / Power saving - for 
*idle* nodes only?

FWIW this is a known bug: 
https://bugs.schedmd.com/show_bug.cgi?id=5348
5348  Suspending Nodes which are not in IDLE 
mode
SchedMD - Slurm development and support. Providing support for some of the 
largest clusters in the world.
bugs.schedmd.com



From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Florian Zillner mailto:fzill...@lenovo.com>>
Sent: Thursday, 14 May 2020 15:43
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [External] Re: Node suspend / Power saving - for 
*idle* nodes only?

Well, the documentation is rather clear on this: "SuspendTime: Nodes becomes 
eligible for power saving mode after being idle or down for this number of 
seconds."
A drained node is neither idle nor down in my mind.

Thanks,
Florian


From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Steffen Grunewald 
mailto:steffen.grunew...@aei.mpg.de>>
Sent: Thursday, 14 May 2020 15:34
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [External] Re: [slurm-users] Node suspend / Power saving - for *idle* 
nodes only?

On Thu, 2020-05-14 at 13:10:04 +, Florian Zillner wrote:
> Hi,
>
> I'm experimenting with slurm's power saving feature and shutdown of "idle" 
> nodes works in general, also the power up works when "idle~" nodes are 
> requested.
> So far so good, but slurm is also shutting down nodes that are not explicitly 
> "idle". Previously I drained a node to debug something on it and slurm shut 
> it down when the SuspendTimeout was reached.

Perhaps you should have put that node in maint mode?

- S


Re: [slurm-users] Do not upgrade mysql to 5.7.30!

2020-05-07 Thread Riebs, Andy
Alternatively, you could switch to MariaDB; I've been using that for years.

Andy

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Marcus Wagner
Sent: Thursday, May 7, 2020 8:55 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Do not upgrade mysql to 5.7.30!

Definitively not up to now, just checked the sources of 20.02.2, the 
same problem there.

Seems, someone with a contract needs to open a ticket.


Best
Marcus

Am 07.05.2020 um 10:50 schrieb Bill Broadley:
> On 5/6/20 11:30 AM, Dustin Lang wrote:
>> Hi,
>>
>> Ubuntu has made mysql 5.7.30 the default version.  At least with 
>> Ubuntu 16.04, this causes severe problems with Slurm dbd (v 17.x, 
>> 18.x, and 19.x; not sure about 20). 
> 
> I can confirm that kills slurmdbd on ubuntu 18.04 as well.  I had 
> compiled slurm from source using version 19.05.3-2.
> 
> Is any released version of slurm known to work with mysql 5.7.30?
> 
> 



Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Riebs, Andy
A couple of quick checks to see if the problem is munge:

1.   On the problem node, try
$ echo foo | munge | unmunge

2.   If (1) works, try this from the node running slurmctld to the problem 
node
slurm-node$ echo foo | ssh node munge | unmunge

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Dean Schulze
Sent: Friday, April 17, 2020 3:40 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Munge decode failing on new node

There is no ntp service running on any of my nodes, and all but this one is 
working.  I haven't heard that ntp is a requirement for slurm, just that the 
time be synchronized across the cluster.  And it is.

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy 
mailto:mini...@gmail.com>> wrote:
I’d check ntp as your encoding time seems odd to me

On Wed, 15 Apr 2020 at 19:59, Dean Schulze 
mailto:dean.w.schu...@gmail.com>> wrote:
I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with

sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes.

Here is what is in the slurmd.log:

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

I've checked in the munged.log and all it says is

Invalid credential

Thanks for your help
--
--
Carles Fenoy


Re: [slurm-users] Munge decode failing on new node

2020-04-15 Thread Riebs, Andy
Two trivial things to check:

1.   Permissions on /etc/munge and /etc/munge.key

2.   Is munged running on the problem node?

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Dean Schulze
Sent: Wednesday, April 15, 2020 1:57 PM
To: Slurm User Community List 
Subject: [slurm-users] Munge decode failing on new node

I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with

sudo cksum /etc/munge/munge.key

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes.

Here is what is in the slurmd.log:

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

I've checked in the munged.log and all it says is

Invalid credential

Thanks for your help


Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Riebs, Andy
When you say “distinct compute nodes,” are they at least on the same network 
fabric?

If so, the first thing I’d try would be to create a new partition that 
encompasses all of the nodes of the other two partitions.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of CB
Sent: Monday, March 23, 2020 11:32 AM
To: Slurm User Community List 
Subject: [slurm-users] Running an MPI job across two partitions

Hi,

I'm running Slurm 19.05 version.

Is there any way to launch an MPI job on a group of distributed  nodes from two 
or more partitions, where each partition has distinct compute nodes?

I've looked at the heterogeneous job support but it creates two-separate jobs.

If there is no such capability with the current Slurm, I'd like to hear any 
recommendations or suggestions.

Thanks,
Chansup


Re: [slurm-users] SLURM with OpenMPI

2019-12-15 Thread Riebs, Andy
Agreed -- I do this frequently. (Be sure you've exported those variables, 
though!)

Andy

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Paul Edmon
Sent: Sunday, December 15, 2019 2:05 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] SLURM with OpenMPI

Yes they should be.

-Paul Edmon-

On 12/15/2019 10:28 AM, Raymond Muno wrote:
> We are new to SLURM, migrating over from SGE.
>
> When launching OpenMPI jobs (version 4.0.2 in this case) via srun, are 
> the MCA parameters followed when they are set via environmental 
> variables, e.g. OMPI_MCA_param?
>
> -Ray Muno
>



Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Riebs, Andy
At the risk of stating the obvious… these seem like the sort of questions that 
could be answered with a 2 minute test. Better yet, not just answered, but with 
answers specific to your configuration ☺

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Alex Chekholko
Sent: Monday, December 9, 2019 12:53 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Timeout and Epilogue

Hi,

I had asked a similar question recently (maybe a year ago) and also got 
crickets.  I think in our case we were not able to ensure that the epilog 
always ran for different types of job failures, so we just had the users add 
some more cleanup code to the end of their jobs _and_ also run separate cleanup 
jobs.

Regards,
Alex


On Wed, Dec 4, 2019 at 7:29 PM Brian Andrus 
mailto:toomuc...@gmail.com>> wrote:
Quick question:

Is the epilogue script run if a job exceeds its time limits and is being
canceled?

What about just cancelled?

I need to be able to clean up some job-specific files regardless of how
the job ends and I'm not sure epilogue is sufficient.

Brian Andrus



Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-23 Thread Riebs, Andy
Excellent points raised here!  Two other things to do when you see "kill task 
failed":

1. Check "dmesg -T" on the suspect node to look for significant system events, 
like file system problems, communication problems, etc., around the time that 
the problem was logged
2. Check /var/log/slurm (or whatever is appropriate on you system)for core 
files that correspond to the time reported for "kill task failed"

Andy

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Marcus Boden
Sent: Wednesday, October 23, 2019 2:34 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Nodes going into drain because of "Kill task failed"

you can also use the UnkillableStepProgram to debug things:

> UnkillableStepProgram
> If the processes in a job step are determined to be unkillable for a 
> period of time specified by the UnkillableStepTimeout variable, the program 
> specified by UnkillableStepProgram will be executed. This program can be used 
> to take special actions to clean up the unkillable processes and/or notify 
> computer administrators. The program will be run SlurmdUser (usually "root") 
> on the compute node. By default no program is run.
> UnkillableStepTimeout
> The length of time, in seconds, that Slurm will wait before deciding that 
> processes in a job step are unkillable (after they have been signaled with 
> SIGKILL) and execute UnkillableStepProgram as described above. The default 
> timeout value is 60 seconds. If exceeded, the compute node will be drained to 
> prevent future jobs from being scheduled on the node.

this allows you to find out what causes the problem at the time, when
the Problem occurs. You could for example use lsof to see if there are
any files open due to a hanging fs and mail the output to yourself.

Best,
Marcus


On 19-10-22 20:49, Paul Edmon wrote:
> It can also happen if you have a stalled out filesystem or stuck processes. 
> I've gotten in the habit of doing a daily patrol for them to clean them up. 
> Most of them time you can just reopen the node but sometimes this indicates
> something is wedged.
> 
> -Paul Edmon-
> 
> On 10/22/2019 5:22 PM, Riebs, Andy wrote:
> >   A common reason for seeing this is if a process is dropping core -- the 
> > kernel will ignore job kill requests until that is complete, so the job 
> > isn't being killed as quickly as Slurm would like. I typically recommend 
> > increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to 
> > avoid this.
> > 
> > Andy
> > 
> > -Original Message-
> > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf 
> > Of Will Dennis
> > Sent: Tuesday, October 22, 2019 4:59 PM
> > To: slurm-users@lists.schedmd.com
> > Subject: [slurm-users] Nodes going into drain because of "Kill task failed"
> > 
> > Hi all,
> > 
> > I have a number of nodes on one of my 17.11.7 clusters in drain mode on 
> > account of reason: "Kill task failed”
> > 
> > I see the following in slurmd.log —
> > 
> > [2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 
> > CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
> > [2019-10-17T20:06:43.029] [34443.0] Sent signal 15 to 34443.0
> > [2019-10-17T20:06:43.029] Job 34443: timeout: sent SIGTERM to 1 active steps
> > [2019-10-17T20:06:43.031] [34443.0] Sent signal 18 to 34443.0
> > [2019-10-17T20:06:43.032] [34443.0] Sent signal 15 to 34443.0
> > [2019-10-17T20:06:43.036] [34443.0] task 0 (8741) exited. Killed by signal 
> > 15.
> > [2019-10-17T20:06:43.036] [34443.0] Step 34443.0 hit memory limit at least 
> > once during execution. This may or may not result in some failure.
> > [2019-10-17T20:07:13.048] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:15.051] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:16.053] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:17.055] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:18.057] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:19.059] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:20.061] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:21.063] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:22.065] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:23.066] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:24.069] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:34.071] [34443.0] Sent SIGKILL signal to 34443.0
> > [2019-10-17T20:07:44.000] [34443.0] error: *** 

[slurm-users] Using the OpenSHMEM reference implementation with Slurm

2019-09-09 Thread Riebs, Andy
Has anyone tried to use the Open SHMEM 1.4 reference implementation (see 
https://github.com/openshmem-org/osss-ucx) with Slurm?  It appears to me that 
the Slurm PMI-x implementation needs a few more calls ("publish" and "lookup"), 
but I'd be delighted to be proven wrong!

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024



Re: [slurm-users] sacct thinks slurmctld is not up

2019-07-18 Thread Riebs, Andy
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to 
hear if there’s a proper fix.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Brian Andrus
Sent: Thursday, July 18, 2019 11:01 AM
To: Slurm User Community List 
Subject: [slurm-users] sacct thinks slurmctld is not up

All,

I have slurmdbd running and everything is (mostly) happy. It's been working 
well for months, but fairly regularly, when I do 'sacctmgr show runaway jobs', 
I get:

sacctmgr: error: Slurmctld running on cluster orion is not up, can't check 
running jobs

if I do 'sacctmgr show cluster', it lists the cluster but has no IP in the 
ControlHost field.

slurmctld is most definitely running (on the same system even), but the only 
fix I find is to restart slurmctld. Then I can check and there is an IP in the 
ControlHost field and I am able to check for runawayjobs.

Is this a known issue? Is there a better fix than restarting slurmctld?

Brian Andrus


Re: [slurm-users] Counting total number of cores specified in the sbatch file

2019-06-08 Thread Riebs, Andy
A quick & easy way to see what your options might be for Slurm environment 
variables is to try a job like this:

$ srun --nodes 2 --ntasks-per-node 6 --pty env | grep SLURM

Or, perhaps, use the “env | grep SLURM” in your batch script.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Brian Andrus
Sent: Saturday, June 8, 2019 1:29 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Counting total number of cores specified in the 
sbatch file


If you are using mpi, it should be aware automatically if everything was 
compiled with support (eg mpirun).

If you are looking to just get the total tasks, $SLURM_NTASKS is probably what 
you are looking for



Brian Andrus


On 6/8/2019 2:46 AM, Mahmood Naderan wrote:
Hi,
A genetic program uses -num_threads in command line for parallel run. I use the 
following directives in slurm batch file

#SBATCH --ntasks-per-node=6
#SBATCH --nodes=2
#SBATCH --mem-per-cpu=2G

for 12 processes and 24GB of memory. Is there any slurm variable that counts 
all threads from the directives? So, I can use

-num_threads $SLURM_COUNT

where SLURM_COUNT is 12. Any idea?

Regards,
Mahmood



Re: [slurm-users] final stages of cloud infrastructure set up

2019-05-19 Thread Riebs, Andy
Just looking at this quickly, have you tried specifying “hint=multithread” as 
an sbatch parameter?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
nathan norton
Sent: Saturday, May 18, 2019 6:03 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] final stages of cloud infrastructure set up

Hi,
I am in the process of setting up Slurm using Amazon cloud infrastructure. All 
is going well, I can elastically start and stop nodes when jobs run.  I am 
running into a few small teething issues, that are probably due to me not 
understanding some of the terminology here. At a high level all the nodes given 
to end users in the cloud are hyper threaded, so I want to use my nodes as 
hyper threaded nodes.  All nodes are running centos7 latest. I would also like 
the jobs to be run in a cgroup and not migrate around after it starts. As I 
said before I think most of it is working except for the few issues below here.

My use case is, I have an in house built binary application that is single 
threaded and does no  message passing or anything like that. The application is 
not memory bound it is only compute bound.

So on a node I would like to be able to run 16 instances in parallel. As can be 
seen below if I launch the single app via srun it runs on each thread on a CPU. 
 However if I run the via sbatch command as can be seen it only runs on CPU 0-7 
instead of CPU 0-15.

Another question would be how would be the best way to retry failed jobs, I can 
rerun the batch again, but I only want to rerun a single step in the batch?

Please see below for the output of various commands as well as my slurm.conf 
file as well.

Many thanks
Nathan.

__
btuser@bt_slurm_login001[domain ]% slurmd -V
slurm 18.08.6-2
__
btuser@bt_slurm_login001[domain ]% cat nathan.batch.sh
#!/bin/bash
#SBATCH --job-name=nathan_test
#SBATCH --ntasks=1
#SBATCH --array=1-32
#SBATCH --ntasks-per-core=2
hostname
srun --hint=multithread -n1  --exclusive   --cpu_bind=threads cat 
/proc/self/status | grep -i cpus_allowed_list


btuser@bt_slurm_login001[domain ]%
btuser@bt_slurm_login001[domain ]% sbatch Nathan.batch.sh
Submitted batch job 106491
btuser@bt_slurm_login001[domain ]% cat slurm-106491_*
btuser@bt_slurm_login001[domain ]% cat slurm-106491_*
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  1
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  2
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  3
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  4
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  5
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  6
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  7
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  1
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  3
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  4
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  5
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  6
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:  7
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  0
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  2
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  3
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  5
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  6
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:  7
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  3
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  5
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  6
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:  7
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:  0
__
btuser@bt_slurm_login001[domain ]% srun -n32  --exclusive   --cpu_bind=threads 
cat /proc/self/status | grep -i cpus_allowed_list
Cpus_allowed_list:  12
Cpus_allowed_list:  13
Cpus_allowed_list:  15
Cpus_allowed_list:  0
Cpus_allowed_list:  8
Cpus_allowed_list:  1
Cpus_allowed_list:  9
Cpus_allowed_list:  2
Cpus_allowed_list:  10
Cpus_allowed_list:  11
Cpus_allowed_list:  4
Cpus_allowed_list:  5
Cpus_allowed_list:  6
Cpus_allowed_list:  14
Cpus_allowed_list:  7
Cpus_allowed_list:  3
Cpus_allowed_list:  0
Cpus_allowed_list:  8
Cpus_allowed_list:  1
Cpus_allowed_list:  9
Cpus_allowed_list:  2
Cpus_allowed_list:  10
Cpus_allowed_list:  3
Cpus_allowed_list:  11
Cpus_allowed_list:  4
Cpus_allowed_list:  12
Cpus_allowed_list:  5

Re: [slurm-users] job startup timeouts?

2019-05-02 Thread Riebs, Andy
This proved to be a scaling problem in PMIX; thanks to Artem Polyakov for 
tracking this down (and submitting a 
fix<https://bugs.schedmd.com/show_bug.cgi?id=6932>).

Thanks for all the suggestions folks!

Andy

From: Riebs, Andy
Sent: Friday, April 26, 2019 11:24 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] job startup timeouts?

Hi John,

> It's a DNS problem, isn't it?   Seriously though - how long does srun 
> hostname take for a single system?

We're running nscd on all nodes, with an extremely stable list of 
users/accounts, so I think we should be good here.

"time srun hostname" reports on the order of 0.2 seconds, so at least single 
node requests are handled expediently!

Andy

From: John Hearns <mailto:hear...@googlemail.com>
Sent: Friday, April 26, 2019 10:56AM
To: Slurm User Community List 
<mailto:slurm-users@lists.schedmd.com>
Cc:
Subject: Re: [slurm-users] job startup timeouts?
It's a DNS problem, isn't it?   Seriously though - how long does srun hostname 
take for a single system?


On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen 
mailto:dmjacob...@lbl.gov>> wrote:
We have 12,000 nodes in our system, 9,600 of which are KNL.  We can
start a parallel application within a few seconds in most cases (when
the machine is dedicated to this task), even at full scale.  So I
don't think there is anything intrinsic to Slurm that would
necessarily be limiting you, though we have seen cases in the past
where arbitrary task distribution has caused contoller slow-down
issues as the detailed scheme was parsed.

Do you know if all the slurmstepd's are starting quickly on the
compute nodes?  How is the OS/Slurm/executable delivered to the node?

Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov>

- __o
-- _ '\<,_
--(_)/  (_)______


On Fri, Apr 26, 2019 at 7:40 AM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
>
> Thanks for the quick response Doug!
>
> Unfortunately, I can't be specific about the cluster size, other than to say 
> it's got more than a thousand nodes.
>
> In a separate test that I had missed, even "srun hostname" took 5 minutes to 
> run. So there was no remote file system or MPI involvement.
>
> Andy
>
> -Original Message-
> From: slurm-users 
> [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
>  On Behalf Of Douglas Jacobsen
> Sent: Friday, April 26, 2019 9:24 AM
> To: Slurm User Community List 
> mailto:slurm-users@lists.schedmd.com>>
> Subject: Re: [slurm-users] job startup timeouts?
>
> How large is very large?  Where is the executable being started?  In
> the parallel filesystem/NFS?  If that is the case you may be able to
> trim start times by using sbcast to transfer the executable (and its
> dependencies if dynamically linked) into a node-local resource, such
> as /tmp or /dev/shm depending on your local configuration.
> 
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> Acting Group Lead, Computational Systems Group
> National Energy Research Scientific Computing Center
> dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov>
>
> - __o
> -- _ '\<,_
> --(_)/  (_)__
>
>
> On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs 
> mailto:andy.ri...@hpe.com>> wrote:
> >
> > Hi All,
> >
> > We've got a very large x86_64 cluster with lots of cores on each node, and 
> > hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on 
> > CentOS 7.6.
> >
> > We have a job that reports
> >
> > srun: error: timeout waiting for task launch, started 0 of xx tasks
> > srun: Job step 291963.0 aborted before step completely launched.
> >
> > when we try to run it at large scale. We anticipate that it could take as 
> > long as 15 minutes for the job to launch, based on our experience with 
> > smaller numbers of nodes.
> >
> > Is there a timeout setting that we're missing that can be changed to 
> > accommodate a lengthy startup time like this?
> >
> > Andy
> >
> > --
> >
> > Andy Riebs
> > andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>
> > Hewlett-Packard Enterprise
> > High Performance Computing Software Engineering
> > +1 404 648 9024
> > My opinions are not necessarily those of HPE
> > May the source be with you!
>



Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Riebs, Andy
Thanks for the quick response Doug!

Unfortunately, I can't be specific about the cluster size, other than to say 
it's got more than a thousand nodes.

In a separate test that I had missed, even "srun hostname" took 5 minutes to 
run. So there was no remote file system or MPI involvement.

Andy

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Douglas Jacobsen
Sent: Friday, April 26, 2019 9:24 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] job startup timeouts?

How large is very large?  Where is the executable being started?  In
the parallel filesystem/NFS?  If that is the case you may be able to
trim start times by using sbcast to transfer the executable (and its
dependencies if dynamically linked) into a node-local resource, such
as /tmp or /dev/shm depending on your local configuration.

Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs  wrote:
>
> Hi All,
>
> We've got a very large x86_64 cluster with lots of cores on each node, and 
> hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on 
> CentOS 7.6.
>
> We have a job that reports
>
> srun: error: timeout waiting for task launch, started 0 of xx tasks
> srun: Job step 291963.0 aborted before step completely launched.
>
> when we try to run it at large scale. We anticipate that it could take as 
> long as 15 minutes for the job to launch, based on our experience with 
> smaller numbers of nodes.
>
> Is there a timeout setting that we're missing that can be changed to 
> accommodate a lengthy startup time like this?
>
> Andy
>
> --
>
> Andy Riebs
> andy.ri...@hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HPE
> May the source be with you!



Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-02-01 Thread Riebs, Andy
Given the extreme amount of output that will be generated for potentially a 
couple hundred job runs, I was hoping that someone would say “Seen it, here’s 
how to fix it.” Guess I’ll have to go with the “high output” route.

Thanks Doug!

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Doug Meyer
Sent: Thursday, January 31, 2019 8:46 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires 
through job.

Doug

On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs 
mailto:andy.ri...@hpe.com>> wrote:
Hi All,

Just checking to see if this sounds familiar to anyone.

Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)

We typically run about 100 tests/night, selected from a handful of favorites. 
For roughly 1 in 300 test runs, we see one of two mysterious failures:

1. The 5 minute cancellation

A job will be rolling along, generating it's expected output, and then this 
message appears:
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 
***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
sacct reports
   JobID   Start End ExitCode  State
 --- ---  --
3418 2019-01-29T05:54:07 2019-01-29T05:59:16  0:9 FAILED
These failures consistently happen at just about 5 minutes into the run when 
they happen.

2. The random cancellation

As above, a job will be generating the expected output, and then we see
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 
***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
But this time, sacct reports
   JobID   Start End ExitCode  State
 --- ---  --
3531 2019-01-30T07:21:25 2019-01-30T07:35:50  0:0  COMPLETED
3531.0   2019-01-30T07:21:27 2019-01-30T07:35:56 0:15  CANCELLED
I think we've seen these cancellations pop up as soon as a minute or two into 
the test run, up to perhaps 20 minutes into the run.

The only thing slightly unusual in our job submissions is that we use srun's 
"--immediate=120" so that the scripts can respond appropriately if a node goes 
down.

With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in the 
slurmctld or slurmd logs.

Any thoughts on what might be happening, or what I might try next?

Andy



--

Andy Riebs

andy.ri...@hpe.com

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

My opinions are not necessarily those of HPE

May the source be with you!


Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Riebs, Andy
The /etc/munge/ munge.key is different on the systems. Try
md5sum /etc/munge/munge.key on both systems to see if they are the same...

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
+1 404 648 9024


From: slurm-users  on behalf of Eric F. 
Alemany 
Sent: Monday, May 7, 2018 7:40:53 PM
To: Slurm User Community List
Subject: Re: [slurm-users] Nodes are down after 2-3 minutes.

Hi Chris,

I followed the link as well as the instruction on “Securing the installation” 
and “Testing the installation”

The only thing that i am not able to do is:  Check if a credential can be 
remotely decoded


eric@radoncmaster:/etc/munge$ munge -n | ssh 
e...@radonc01.stanford.edu unmunge
unmunge: Error: Invalid credential


Did i do something wrong somewhere ?

Thank you for your help
_

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969  No Texting
Fax:1-650-723-7382



On May 7, 2018, at 3:42 PM, Chris Samuel 
> wrote:

On Tuesday, 8 May 2018 8:38:47 AM AEST Eric F. Alemany wrote:

I thought i did but I will do it again

If that doesn't work then check the "Securing the Installation" and "Testing
the Installation" parts of the munge docs here (ignore the installation part):

https://github.com/dun/munge/wiki/Installation-Guide

Good luck!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC