Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-13 Thread Ole Holm Nielsen

Hi Janna,

If you're running an old Slurm version, there may be bugs already resolved 
in the later versions.  You can search for bugs with ReqNodeNotAvail in 
the title:

https://bugs.schedmd.com/buglist.cgi?quicksearch=ReqNodeNotAvail

For example, this one might be relevant:
https://bugs.schedmd.com/show_bug.cgi?id=9257

Upgrade to Slurm 20.02 is highly recommended.

/Ole

On 7/12/20 3:36 PM, Ole Holm Nielsen wrote:

In case your Arp cache is the problem, there is some advice in the Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks 



I think there are other causes for ReqNodeNotAvail, for example, the node 
being allocated for other jobs.  The "scontrol show node/job" should 
reveal more details.


/Ole


On 11-07-2020 06:00, mercan wrote:

Hi Janna;

It sounds like a Arp cache table problem to me. If your slurm head node 
can reachable ~1000 or more network devices (all connected network 
cards, switches etc., even they are reachable by different ports of the 
server), you need to increse some network settings at headnode and 
servers which can reach same amount of network device :


http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm 



Also some advices for big cluster at slurm documentation:

https://slurm.schedmd.com/big_sys.html

Regards,

Ahmet M.


11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:


Hi All,

I’ve got an intermittent situation with gpu nodes that sinfo says are 
available and idle, but squeue reports as “ReqNodeNotAvail”.  We’ve 
cycled the nodes to restart services but it hasn’t helped.  Any 
suggestions for resolving this or digging into it more deeply?




Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-12 Thread Ole Holm Nielsen
In case your Arp cache is the problem, there is some advice in the Wiki 
page:

https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

I think there are other causes for ReqNodeNotAvail, for example, the 
node being allocated for other jobs.  The "scontrol show node/job" 
should reveal more details.


/Ole


On 11-07-2020 06:00, mercan wrote:

Hi Janna;

It sounds like a Arp cache table problem to me. If your slurm head node 
can reachable ~1000 or more network devices (all connected network 
cards, switches etc., even they are reachable by different ports of the 
server), you need to increse some network settings at headnode and 
servers which can reach same amount of network device :


http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm 



Also some advices for big cluster at slurm documentation:

https://slurm.schedmd.com/big_sys.html

Regards,

Ahmet M.


11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:


Hi All,

I’ve got an intermittent situation with gpu nodes that sinfo says are 
available and idle, but squeue reports as “ReqNodeNotAvail”.  We’ve 
cycled the nodes to restart services but it hasn’t helped.  Any 
suggestions for resolving this or digging into it more deeply?




Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-10 Thread mercan

Hi Janna;

It sounds like a Arp cache table problem to me. If your slurm head node 
can reachable ~1000 or more network devices (all connected network 
cards, switches etc., even they are reachable by different ports of the 
server), you need to increse some network settings at headnode and 
servers which can reach same amount of network device :


http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm

Also some advices for big cluster at slurm documentation:

https://slurm.schedmd.com/big_sys.html

Regards,

Ahmet M.


11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:


Hi All,

I’ve got an intermittent situation with gpu nodes that sinfo says are 
available and idle, but squeue reports as “ReqNodeNotAvail”.  We’ve 
cycled the nodes to restart services but it hasn’t helped.  Any 
suggestions for resolving this or digging into it more deeply?


Thanks,

Janna

*Janna Nugent, MS*

Sr. Computational Genomics Specialist

Research Computing Services

Northwestern University

www.it.northwestern.edu/research/ 



janna.nug...@northwestern.edu





Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-10 Thread Chris Samuel
On Friday, 10 July 2020 3:34:44 PM PDT Janna Ore Nugent wrote:

> I’ve got an intermittent situation with gpu nodes that sinfo says are
> available and idle, but squeue reports as “ReqNodeNotAvail”.  We’ve cycled
> the nodes to restart services but it hasn’t helped.  Any suggestions for
> resolving this or digging into it more deeply?

What does "scontrol show job $JOB" say for an affected job, and what does 
"scontrol show node $NODE" look like for one of these nodes?

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-10 Thread Janna Ore Nugent
Hi All,

I’ve got an intermittent situation with gpu nodes that sinfo says are available 
and idle, but squeue reports as “ReqNodeNotAvail”.  We’ve cycled the nodes to 
restart services but it hasn’t helped.  Any suggestions for resolving this or 
digging into it more deeply?

Thanks,
Janna

Janna Nugent, MS
Sr. Computational Genomics Specialist
Research Computing Services
Northwestern University
www.it.northwestern.edu/research/
janna.nug...@northwestern.edu