[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Uwe Sauter


> Yes, this is possible, but I would say it's discouraged to do so.
> With RHEL/CentOS 7 you really should be using firewalld, and forget about the 
> old iptables.  Here's a nice introduction:
> https://www.certdepot.net/rhel7-get-started-firewalld/
> 
> Having worked with firewalld for a while now, I find it more flexible to use. 
> Admittedly, there is a bit of a learning
> curve.

I disagree. Firewalld might be better in dynamic environments where you need to 
automatically configure your firewall
but with static services I find iptables much easier.

But this discussion is as mute as the question whether SysVInit or Systemd is 
the better. Only with iptables vs.
firewalld you still have a choice.

>> The crucial part is to ensure that either firewalld *or* iptables is running 
>> but not both. Or you could run without
>> firewall at
>> all *if* you trust your network…
> 
> Agreed.  The compute node network *has to be* trusted in order for Slurm to 
> work.
> 
> /Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


On 07/06/2017 04:31 PM, Uwe Sauter wrote:


Alternatively you can

   systemctl disable firewalld.service

   systemctl mask firewalld.service

   yum install iptables-services

   systemctl enable iptables.service ip6tables.service

and configure configure iptables in /etc/sysconfig/iptables and 
/etc/sysconfig/ip6tables, then

   systemctl start iptables.service ip6tables.service


Yes, this is possible, but I would say it's discouraged to do so.
With RHEL/CentOS 7 you really should be using firewalld, and forget 
about the old iptables.  Here's a nice introduction: 
https://www.certdepot.net/rhel7-get-started-firewalld/


Having worked with firewalld for a while now, I find it more flexible to 
use. Admittedly, there is a bit of a learning curve.



The crucial part is to ensure that either firewalld *or* iptables is running 
but not both. Or you could run without firewall at
all *if* you trust your network…


Agreed.  The compute node network *has to be* trusted in order for Slurm 
to work.


/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Uwe Sauter

Alternatively you can

  systemctl disable firewalld.service

  systemctl mask firewalld.service

  yum install iptables-services

  systemctl enable iptables.service ip6tables.service

and configure configure iptables in /etc/sysconfig/iptables and 
/etc/sysconfig/ip6tables, then

  systemctl start iptables.service ip6tables.service



The crucial part is to ensure that either firewalld *or* iptables is running 
but not both. Or you could run without firewall at
all *if* you trust your network…




Am 06.07.2017 um 14:12 schrieb Ole Holm Nielsen:
> 
> Firewall problems, like I suggested initially!  Nmap is a great tool for 
> probing open ports!
> 
> The iptables *must not* be configured on CentOS 7, you *must* use firewalld.  
> See
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>  for Slurm configurations.
> 
> /Ole
> 
> On 07/06/2017 01:22 PM, Said Mohamed Said wrote:
>> John and Others,
>>
>>
>> Thank you very much for your support. The problem is finally solved.
>>
>>
>> After Installing nmap, it let me realize that some ports were blocked even 
>> with firewall daemon stopped and disabled. Turned out
>> that iptables was on and enabled. After stopping iptables everything work 
>> just fine.
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>>
>> ----------------
>> *From:* John Hearns <hear...@googlemail.com>
>> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>> Said, you are not out of ideas.
>>
>> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your 
>> compute node and see which ports are open on the
>> controller node
>>
>> Also do we have a DNS name resolution problem here?
>> I alwasy remember sun Gridengine as being notoriously sensitive to name 
>> resolution, and that was my first question when any SGE
>> problem was reported.
>> So a couple of questions:
>>
>> On the ocntroller node and on the compute node run this:
>> hostname
>> hostname -f
>>
>> Do the cluster controller node or the compute nodes have more than one 
>> network interface.
>> I bet the cluster controller node does!   From the compute node, do an 
>> nslookup or a dig  and see what the COMPUTE NODE thinks
>> are hte names of both of those interfaces.
>>
>> Also as Rajul says - how are you making sure that both controller and 
>> compute nodes have the same slurm.conf file
>> Actually if the slurm.conf files are different this will eb logged when the 
>> compute node starts up, but let us check everything.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp 
>> <mailto:said.moha...@oist.jp>> wrote:
>>
>> Even after reinstalling everything from the beginning the problem is
>> still there. Right now I am out of Ideas.
>>
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>>
>> 
>> *From:* Said Mohamed Said
>> *Sent:* Thursday, July 6, 2017 2:23:05 PM
>> *To:* slurm-dev
>> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> Thank you all for your suggestions, the only thing I can do for now
>> is to uninstall and install from the beginning and I will use the
>> most recent version of slurm on both nodes.
>>
>> For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>> I will let you know if that can solve the issue.
>> 
>> *From:* Rajul Kumar <kumar.r...@husky.neu.edu
>> <mailto:kumar.r...@husky.neu.edu>>
>> *Sent:* Thursday, July 6, 2017 12:41:51 AM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>> Sorry for the typo
>> It's generally when one of the controller or compute can reach the
>> other one but it's *not* happening vice-versa.
>>
>>
>> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
>> <kumar.r...@husky.neu.edu <mailto:kumar.r...@husky.neu.edu>> wrote:
>>
>> I came across the same problem sometime back. It's generally
>> when one of the controller or compute can reach to other one but
>> it's happening vice-versa.
>>

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Said Mohamed Said
Yes Nielsen, my biggest mistake was assuming that iptables wont be used in 
CentoOS 7.
But I am relieved that I found the solution with the help of all of you. I 
really Appreciate.

Best Regards,
Said.

From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
Sent: Thursday, July 6, 2017 9:11:54 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


Firewall problems, like I suggested initially!  Nmap is a great tool for
probing open ports!

The iptables *must not* be configured on CentOS 7, you *must* use
firewalld.  See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
for Slurm configurations.

/Ole

On 07/06/2017 01:22 PM, Said Mohamed Said wrote:
> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked
> even with firewall daemon stopped and disabled. Turned out that iptables
> was on and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
>
> 
> *From:* John Hearns <hear...@googlemail.com>
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp
> <mailto:said.moha...@oist.jp>> wrote:
>
> Even after reinstalling everything from the beginning the problem is
> still there. Right now I am out of Ideas.
>
>
>
>
> Best Regards,
>
>
> Said.
>
>     
> *From:* Said Mohamed Said
> *Sent:* Thursday, July 6, 2017 2:23:05 PM
> *To:* slurm-dev
> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Thank you all for your suggestions, the only thing I can do for now
> is to uninstall and install from the beginning and I will use the
> most recent version of slurm on both nodes.
>
> For Felix who asked, the OS is CentOS 7.3 on both machines.
>
> I will let you know if that can solve the issue.
> ----
> *From:* Rajul Kumar <kumar.r...@husky.neu.edu
> <mailto:kumar.r...@husky.neu.edu>>
> *Sent:* Thursday, July 6, 2017 12:41:51 AM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
> Sorry for the typo
> It's generally when one of the controller or compute can reach the
> other one but it's *not* happening vice-versa.
>
>
> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
> <kumar.r...@husky.neu.edu <mailto:kumar.r...@husky.neu.edu>> wrote:
>
> I came across the same problem sometime back. It's generally
> when one of the controller or compute can reach to other one but
> it's happening vice-versa.
>
> Have a look at the following points:
> - controller and compute can ping to each other
> - both share the same slurm.conf
> - slurm.conf has the location of both controller and compute
> - slurm services are running on the compute node when the
> controller says it's down
> - TCP connections are not being dropped
> - Ports are accessible that are to be used for communication,
> specifically response ports
> - Check the routing rules if any
> - Clocks are synced across
> - Hope there isn't any version mis

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


Firewall problems, like I suggested initially!  Nmap is a great tool for 
probing open ports!


The iptables *must not* be configured on CentOS 7, you *must* use 
firewalld.  See 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons 
for Slurm configurations.


/Ole

On 07/06/2017 01:22 PM, Said Mohamed Said wrote:

John and Others,


Thank you very much for your support. The problem is finally solved.


After Installing nmap, it let me realize that some ports were blocked 
even with firewall daemon stopped and disabled. Turned out that iptables 
was on and enabled. After stopping iptables everything work just fine.




Best Regards,


Said.


*From:* John Hearns <hear...@googlemail.com>
*Sent:* Thursday, July 6, 2017 6:47:48 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on 
your compute node and see which ports are open on the controller node


Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name 
resolution, and that was my first question when any SGE problem was 
reported.

So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one 
network interface.
I bet the cluster controller node does!   From the compute node, do an 
nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of 
both of those interfaces.


Also as Rajul says - how are you making sure that both controller and 
compute nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when 
the compute node starts up, but let us check everything.










On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp 
<mailto:said.moha...@oist.jp>> wrote:


Even after reinstalling everything from the beginning the problem is
still there. Right now I am out of Ideas.




Best Regards,


Said.


*From:* Said Mohamed Said
*Sent:* Thursday, July 6, 2017 2:23:05 PM
*To:* slurm-dev
    *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP

Thank you all for your suggestions, the only thing I can do for now
is to uninstall and install from the beginning and I will use the
most recent version of slurm on both nodes.

For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

*From:* Rajul Kumar <kumar.r...@husky.neu.edu
<mailto:kumar.r...@husky.neu.edu>>
*Sent:* Thursday, July 6, 2017 12:41:51 AM
*To:* slurm-dev
    *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Sorry for the typo
It's generally when one of the controller or compute can reach the
other one but it's *not* happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
<kumar.r...@husky.neu.edu <mailto:kumar.r...@husky.neu.edu>> wrote:

I came across the same problem sometime back. It's generally
when one of the controller or compute can reach to other one but
it's happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the
controller says it's down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication,
specifically response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look
(doesn't recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns
<hear...@googlemail.com <mailto:hear...@googlemail.com>> wrote:

Said,
a problem like this always has a simple cause. We share
your frustration, and several people her have offered help.
So please do not get discouraged. We have all been in your
situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages
closely
b) start at the lowest level, ie here the network and do NOT
assume that any component is working
c) look at all the log files closely
d) start daeomon sprocesses 

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns
Said,
thankyou for letting us know.
I'm going to blame this one on systemd.  Just because I can.


On 6 July 2017 at 13:22, Said Mohamed Said <said.moha...@oist.jp> wrote:

> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked even
> with firewall daemon stopped and disabled. Turned out that iptables was on
> and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
> --
> *From:* John Hearns <hear...@googlemail.com>
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp> wrote:
>
>> Even after reinstalling everything from the beginning the problem is
>> still there. Right now I am out of Ideas.
>>
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>> --
>> *From:* Said Mohamed Said
>> *Sent:* Thursday, July 6, 2017 2:23:05 PM
>> *To:* slurm-dev
>> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>
>> Thank you all for your suggestions, the only thing I can do for now is to
>> uninstall and install from the beginning and I will use the most recent
>> version of slurm on both nodes.
>>
>> For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>> I will let you know if that can solve the issue.
>> --
>> *From:* Rajul Kumar <kumar.r...@husky.neu.edu>
>> *Sent:* Thursday, July 6, 2017 12:41:51 AM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> Sorry for the typo
>> It's generally when one of the controller or compute can reach the other
>> one but it's *not* happening vice-versa.
>>
>>
>> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <kumar.r...@husky.neu.edu>
>> wrote:
>>
>>> I came across the same problem sometime back. It's generally when one of
>>> the controller or compute can reach to other one but it's happening
>>> vice-versa.
>>>
>>> Have a look at the following points:
>>> - controller and compute can ping to each other
>>> - both share the same slurm.conf
>>> - slurm.conf has the location of both controller and compute
>>> - slurm services are running on the compute node when the controller
>>> says it's down
>>> - TCP connections are not being dropped
>>> - Ports are accessible that are to be used for communication,
>>> specifically response ports
>>> - Check the routing rules if any
>>> - Clocks are synced across
>>> - Hope there isn't any version mismatch but still have a look (doesn't
>>> recognize the nodes for major version differences)
>>>
>>> Hope this helps.
>>>
>>> Best,
>>> Rajul
>>>
>>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com>
>>> wrote:
>>>
>>>> Said,
>>>>a problem like this always has a simple cause. We share your
>>>> frustration, and several people her have offered help.
>>>> So please do not get discouraged. We have all been in your situation!
>>>>
>>>> The only way to handle problems like this is
>>>> a) start at the beginning and read the manuals and web

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Said Mohamed Said
John and Others,


Thank you very much for your support. The problem is finally solved.


After Installing nmap, it let me realize that some ports were blocked even with 
firewall daemon stopped and disabled. Turned out that iptables was on and 
enabled. After stopping iptables everything work just fine.



Best Regards,


Said.


From: John Hearns <hear...@googlemail.com>
Sent: Thursday, July 6, 2017 6:47:48 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your 
compute node and see which ports are open on the controller node

Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name 
resolution, and that was my first question when any SGE problem was reported.
So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one network 
interface.
I bet the cluster controller node does!   From the compute node, do an nslookup 
or a dig  and see what the COMPUTE NODE thinks are hte names of both of those 
interfaces.

Also as Rajul says - how are you making sure that both controller and compute 
nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when the 
compute node starts up, but let us check everything.









On 6 July 2017 at 11:37, Said Mohamed Said 
<said.moha...@oist.jp<mailto:said.moha...@oist.jp>> wrote:

Even after reinstalling everything from the beginning the problem is still 
there. Right now I am out of Ideas.




Best Regards,


Said.


From: Said Mohamed Said
Sent: Thursday, July 6, 2017 2:23:05 PM
To: slurm-dev
Subject: Re: [slurm-dev] Re: SLURM ERROR! NEED HELP


Thank you all for your suggestions, the only thing I can do for now is to 
uninstall and install from the beginning and I will use the most recent version 
of slurm on both nodes.

For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

From: Rajul Kumar <kumar.r...@husky.neu.edu<mailto:kumar.r...@husky.neu.edu>>
Sent: Thursday, July 6, 2017 12:41:51 AM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

Sorry for the typo
It's generally when one of the controller or compute can reach the other one 
but it's not happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar 
<kumar.r...@husky.neu.edu<mailto:kumar.r...@husky.neu.edu>> wrote:
I came across the same problem sometime back. It's generally when one of the 
controller or compute can reach to other one but it's happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller says it's 
down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication, specifically 
response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look (doesn't 
recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns 
<hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote:
Said,
   a problem like this always has a simple cause. We share your frustration, 
and several people her have offered help.
So please do not get discouraged. We have all been in your situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT assume that any 
component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose' flags set
e) then start on more low-level diagnostics, such as tcpdump of network 
adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam Huffmans suggestion 
of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command and also 'ntpq 
-p'
Are you SURE the master node and the node OBU-N6   are both connecting to an 
NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said 
<said.moha...@oist.jp<mailto:said.moha...@oist.jp>> wrote:
Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said

From: Feli

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns
Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your
compute node and see which ports are open on the controller node

Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name
resolution, and that was my first question when any SGE problem was
reported.
So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one
network interface.
I bet the cluster controller node does!   From the compute node, do an
nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
both of those interfaces.

Also as Rajul says - how are you making sure that both controller and
compute nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when the
compute node starts up, but let us check everything.









On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp> wrote:

> Even after reinstalling everything from the beginning the problem is still
> there. Right now I am out of Ideas.
>
>
>
>
> Best Regards,
>
>
> Said.
> --
> *From:* Said Mohamed Said
> *Sent:* Thursday, July 6, 2017 2:23:05 PM
> *To:* slurm-dev
> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> Thank you all for your suggestions, the only thing I can do for now is to
> uninstall and install from the beginning and I will use the most recent
> version of slurm on both nodes.
>
> For Felix who asked, the OS is CentOS 7.3 on both machines.
>
> I will let you know if that can solve the issue.
> --
> *From:* Rajul Kumar <kumar.r...@husky.neu.edu>
> *Sent:* Thursday, July 6, 2017 12:41:51 AM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Sorry for the typo
> It's generally when one of the controller or compute can reach the other
> one but it's *not* happening vice-versa.
>
>
> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <kumar.r...@husky.neu.edu>
> wrote:
>
>> I came across the same problem sometime back. It's generally when one of
>> the controller or compute can reach to other one but it's happening
>> vice-versa.
>>
>> Have a look at the following points:
>> - controller and compute can ping to each other
>> - both share the same slurm.conf
>> - slurm.conf has the location of both controller and compute
>> - slurm services are running on the compute node when the controller says
>> it's down
>> - TCP connections are not being dropped
>> - Ports are accessible that are to be used for communication,
>> specifically response ports
>> - Check the routing rules if any
>> - Clocks are synced across
>> - Hope there isn't any version mismatch but still have a look (doesn't
>> recognize the nodes for major version differences)
>>
>> Hope this helps.
>>
>> Best,
>> Rajul
>>
>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com>
>> wrote:
>>
>>> Said,
>>>a problem like this always has a simple cause. We share your
>>> frustration, and several people her have offered help.
>>> So please do not get discouraged. We have all been in your situation!
>>>
>>> The only way to handle problems like this is
>>> a) start at the beginning and read the manuals and webpages closely
>>> b) start at the lowest level, ie here the network and do NOT assume that
>>> any component is working
>>> c) look at all the log files closely
>>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
>>> e) then start on more low-level diagnostics, such as tcpdump of network
>>> adapters and straces of the processes and gstacks
>>>
>>>
>>> you have been doing steps a b and c very well
>>> I suggest staying with these - I myself am going for Adam Huffmans
>>> suggestion of the NTP clock times.
>>> Are you SURE that on all nodes you have run the 'date' command and also
>>> 'ntpq -p'
>>> Are you SURE the master node and the node OBU-N6   are both connecting
>>> to an NTP server?   ntpq -p will tell you that
>>>
>>>
>>> And do not lose heart.  This is how we all learn.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5 July 2017 at 16:23, Said Mohamed Said <said.moha...@oist.j

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


Have you followed my Wiki for installing Slurm on CentOS 7?
This has worked for us: https://wiki.fysik.dtu.dk/niflheim/SLURM

If your problems are caused by your network setup, then it's almost 
impossible for external people to help you...


/Ole

On 07/06/2017 11:38 AM, Said Mohamed Said wrote:
Even after reinstalling everything from the beginning the problem is 
still there. Right now I am out of Ideas.





Best Regards,


Said.


*From:* Said Mohamed Said
*Sent:* Thursday, July 6, 2017 2:23:05 PM
*To:* slurm-dev
*Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP

Thank you all for your suggestions, the only thing I can do for now is 
to uninstall and install from the beginning and I will use the most 
recent version of slurm on both nodes.


For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

*From:* Rajul Kumar <kumar.r...@husky.neu.edu>
*Sent:* Thursday, July 6, 2017 12:41:51 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Sorry for the typo
It's generally when one of the controller or compute can reach the other 
one but it's *not* happening vice-versa.



On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <kumar.r...@husky.neu.edu 
<mailto:kumar.r...@husky.neu.edu>> wrote:


I came across the same problem sometime back. It's generally when
one of the controller or compute can reach to other one but it's
happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller
says it's down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication,
specifically response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look
(doesn't recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com
<mailto:hear...@googlemail.com>> wrote:

Said,
a problem like this always has a simple cause. We share your
frustration, and several people her have offered help.
So please do not get discouraged. We have all been in your
situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT
assume that any component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose'
flags set
e) then start on more low-level diagnostics, such as tcpdump of
network adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam
Huffmans suggestion of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command
and also 'ntpq -p'
Are you SURE the master node and the node OBU-N6   are both
connecting to an NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said <said.moha...@oist.jp
<mailto:said.moha...@oist.jp>> wrote:

Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said


*From:* Felix Willenborg <felix.willenb...@uni-oldenburg.de
<mailto:felix.willenb...@uni-oldenburg.de>>
*Sent:* Wednesday, July 5, 2017 9:07:05 PM

        *To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
When the nodes change to the down state, what is 'sinfo -R'
saying? Sometimes it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:

Thank you Adam, For NTP I did that as well before posting
but didn't fix the issue.

Regards,
Said


*From:* Adam Huffman <adam.huff...@gmail.com>
<mailto:adam.huff...@gmail.com>
*Sent:* Wednesday, July 

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Said Mohamed Said
Even after reinstalling everything from the beginning the problem is still 
there. Right now I am out of Ideas.




Best Regards,


Said.


From: Said Mohamed Said
Sent: Thursday, July 6, 2017 2:23:05 PM
To: slurm-dev
Subject: Re: [slurm-dev] Re: SLURM ERROR! NEED HELP


Thank you all for your suggestions, the only thing I can do for now is to 
uninstall and install from the beginning and I will use the most recent version 
of slurm on both nodes.

For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

From: Rajul Kumar <kumar.r...@husky.neu.edu>
Sent: Thursday, July 6, 2017 12:41:51 AM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

Sorry for the typo
It's generally when one of the controller or compute can reach the other one 
but it's not happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar 
<kumar.r...@husky.neu.edu<mailto:kumar.r...@husky.neu.edu>> wrote:
I came across the same problem sometime back. It's generally when one of the 
controller or compute can reach to other one but it's happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller says it's 
down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication, specifically 
response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look (doesn't 
recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns 
<hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote:
Said,
   a problem like this always has a simple cause. We share your frustration, 
and several people her have offered help.
So please do not get discouraged. We have all been in your situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT assume that any 
component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose' flags set
e) then start on more low-level diagnostics, such as tcpdump of network 
adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam Huffmans suggestion 
of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command and also 'ntpq 
-p'
Are you SURE the master node and the node OBU-N6   are both connecting to an 
NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said 
<said.moha...@oist.jp<mailto:said.moha...@oist.jp>> wrote:
Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said

From: Felix Willenborg 
<felix.willenb...@uni-oldenburg.de<mailto:felix.willenb...@uni-oldenburg.de>>
Sent: Wednesday, July 5, 2017 9:07:05 PM

To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

When the nodes change to the down state, what is 'sinfo -R' saying? Sometimes 
it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
Thank you Adam, For NTP I did that as well before posting but didn't fix the 
issue.

Regards,
Said

From: Adam Huffman <adam.huff...@gmail.com><mailto:adam.huff...@gmail.com>
Sent: Wednesday, July 5, 2017 8:11:03 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


I've seen something similar when node clocks were skewed.

Worth checking that NTP is running and they're all synchronised.

On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said 
<said.moha...@oist.jp><mailto:said.moha...@oist.jp> wrote:
> Thank you all for suggestions. I turned off firewall on both machines but
> still no luck. I can confirm that No managed switch is preventing the nodes
> from communicating. If you check the log file, there is communication for
> about 4mins and then the node state goes down.
> Any other idea?
> 
> From: Ole Holm Nielsen 
> <ole.h.niel...@fysik.dtu.dk><mailto:ole.h.niel...@fysik.dtu.dk>
> Sent: Wednesday, July 5, 2017 7:07:15 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> in my network I encountered that managed switches were preventing
&

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Rajul Kumar
Sorry for the typo
It's generally when one of the controller or compute can reach the other
one but it's *not* happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <kumar.r...@husky.neu.edu>
wrote:

> I came across the same problem sometime back. It's generally when one of
> the controller or compute can reach to other one but it's happening
> vice-versa.
>
> Have a look at the following points:
> - controller and compute can ping to each other
> - both share the same slurm.conf
> - slurm.conf has the location of both controller and compute
> - slurm services are running on the compute node when the controller says
> it's down
> - TCP connections are not being dropped
> - Ports are accessible that are to be used for communication, specifically
> response ports
> - Check the routing rules if any
> - Clocks are synced across
> - Hope there isn't any version mismatch but still have a look (doesn't
> recognize the nodes for major version differences)
>
> Hope this helps.
>
> Best,
> Rajul
>
> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com>
> wrote:
>
>> Said,
>>a problem like this always has a simple cause. We share your
>> frustration, and several people her have offered help.
>> So please do not get discouraged. We have all been in your situation!
>>
>> The only way to handle problems like this is
>> a) start at the beginning and read the manuals and webpages closely
>> b) start at the lowest level, ie here the network and do NOT assume that
>> any component is working
>> c) look at all the log files closely
>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
>> e) then start on more low-level diagnostics, such as tcpdump of network
>> adapters and straces of the processes and gstacks
>>
>>
>> you have been doing steps a b and c very well
>> I suggest staying with these - I myself am going for Adam Huffmans
>> suggestion of the NTP clock times.
>> Are you SURE that on all nodes you have run the 'date' command and also
>> 'ntpq -p'
>> Are you SURE the master node and the node OBU-N6   are both connecting to
>> an NTP server?   ntpq -p will tell you that
>>
>>
>> And do not lose heart.  This is how we all learn.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 5 July 2017 at 16:23, Said Mohamed Said <said.moha...@oist.jp> wrote:
>>
>>> Sinfo -R gives "NODE IS NOT RESPONDING"
>>> ping gives successful results from both nodes
>>>
>>> I really can not figure out what is causing the problem.
>>>
>>> Regards,
>>> Said
>>> --
>>> *From:* Felix Willenborg <felix.willenb...@uni-oldenburg.de>
>>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>>>
>>> *To:* slurm-dev
>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>
>>> When the nodes change to the down state, what is 'sinfo -R' saying?
>>> Sometimes it gives you a reason for that.
>>>
>>> Best,
>>> Felix
>>>
>>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>>
>>> Thank you Adam, For NTP I did that as well before posting but didn't fix
>>> the issue.
>>>
>>> Regards,
>>> Said
>>> --
>>> *From:* Adam Huffman <adam.huff...@gmail.com> <adam.huff...@gmail.com>
>>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>>> *To:* slurm-dev
>>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>>
>>>
>>> I've seen something similar when node clocks were skewed.
>>>
>>> Worth checking that NTP is running and they're all synchronised.
>>>
>>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
>>> <said.moha...@oist.jp> <said.moha...@oist.jp> wrote:
>>> > Thank you all for suggestions. I turned off firewall on both machines
>>> but
>>> > still no luck. I can confirm that No managed switch is preventing the
>>> nodes
>>> > from communicating. If you check the log file, there is communication
>>> for
>>> > about 4mins and then the node state goes down.
>>> > Any other idea?
>>> > 
>>> > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
>>> <ole.h.niel...@fysik.dtu.dk>
>>> > Sent: Wednesday, July 5, 2017 7:07:15 PM
>>> >

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Rajul Kumar
I came across the same problem sometime back. It's generally when one of
the controller or compute can reach to other one but it's happening
vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller says
it's down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication, specifically
response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look (doesn't
recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com> wrote:

> Said,
>a problem like this always has a simple cause. We share your
> frustration, and several people her have offered help.
> So please do not get discouraged. We have all been in your situation!
>
> The only way to handle problems like this is
> a) start at the beginning and read the manuals and webpages closely
> b) start at the lowest level, ie here the network and do NOT assume that
> any component is working
> c) look at all the log files closely
> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
> e) then start on more low-level diagnostics, such as tcpdump of network
> adapters and straces of the processes and gstacks
>
>
> you have been doing steps a b and c very well
> I suggest staying with these - I myself am going for Adam Huffmans
> suggestion of the NTP clock times.
> Are you SURE that on all nodes you have run the 'date' command and also
> 'ntpq -p'
> Are you SURE the master node and the node OBU-N6   are both connecting to
> an NTP server?   ntpq -p will tell you that
>
>
> And do not lose heart.  This is how we all learn.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 5 July 2017 at 16:23, Said Mohamed Said <said.moha...@oist.jp> wrote:
>
>> Sinfo -R gives "NODE IS NOT RESPONDING"
>> ping gives successful results from both nodes
>>
>> I really can not figure out what is causing the problem.
>>
>> Regards,
>> Said
>> ------
>> *From:* Felix Willenborg <felix.willenb...@uni-oldenburg.de>
>> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>>
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> When the nodes change to the down state, what is 'sinfo -R' saying?
>> Sometimes it gives you a reason for that.
>>
>> Best,
>> Felix
>>
>> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>
>> Thank you Adam, For NTP I did that as well before posting but didn't fix
>> the issue.
>>
>> Regards,
>> Said
>> --
>> *From:* Adam Huffman <adam.huff...@gmail.com> <adam.huff...@gmail.com>
>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>
>> I've seen something similar when node clocks were skewed.
>>
>> Worth checking that NTP is running and they're all synchronised.
>>
>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <said.moha...@oist.jp>
>> <said.moha...@oist.jp> wrote:
>> > Thank you all for suggestions. I turned off firewall on both machines
>> but
>> > still no luck. I can confirm that No managed switch is preventing the
>> nodes
>> > from communicating. If you check the log file, there is communication
>> for
>> > about 4mins and then the node state goes down.
>> > Any other idea?
>> > 
>> > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
>> <ole.h.niel...@fysik.dtu.dk>
>> > Sent: Wednesday, July 5, 2017 7:07:15 PM
>> > To: slurm-dev
>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>> >
>> >
>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> >> in my network I encountered that managed switches were preventing
>> >> necessary network communication between the nodes, on which SLURM
>> >> relies. You should check if you're using managed switches to connect
>> >> nodes to the network and if so, if they're blocking communication on
>> >> slurm ports.
>> >
>> > Managed switches should permit IP layer 2 traffic just like unmanaged
>> > switches!  We only have managed Ethernet switches, and they work without
>> > problems.
>> >
>> > Perhaps you meant that Ethernet switches may perform some firewall
>> > functions by themselves?
>> >
>> > Firewalls must be off between Slurm compute nodes as well as the
>> > controller host.  See
>> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#confi
>> gure-firewall-for-slurm-daemons
>> >
>> > /Ole
>>
>>
>>
>


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread John Hearns
Said,
   a problem like this always has a simple cause. We share your
frustration, and several people her have offered help.
So please do not get discouraged. We have all been in your situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT assume that
any component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose' flags set
e) then start on more low-level diagnostics, such as tcpdump of network
adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam Huffmans
suggestion of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command and also
'ntpq -p'
Are you SURE the master node and the node OBU-N6   are both connecting to
an NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said <said.moha...@oist.jp> wrote:

> Sinfo -R gives "NODE IS NOT RESPONDING"
> ping gives successful results from both nodes
>
> I really can not figure out what is causing the problem.
>
> Regards,
> Said
> --
> *From:* Felix Willenborg <felix.willenb...@uni-oldenburg.de>
> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> When the nodes change to the down state, what is 'sinfo -R' saying?
> Sometimes it gives you a reason for that.
>
> Best,
> Felix
>
> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>
> Thank you Adam, For NTP I did that as well before posting but didn't fix
> the issue.
>
> Regards,
> Said
> --
> *From:* Adam Huffman <adam.huff...@gmail.com> <adam.huff...@gmail.com>
> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> I've seen something similar when node clocks were skewed.
>
> Worth checking that NTP is running and they're all synchronised.
>
> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <said.moha...@oist.jp>
> <said.moha...@oist.jp> wrote:
> > Thank you all for suggestions. I turned off firewall on both machines but
> > still no luck. I can confirm that No managed switch is preventing the
> nodes
> > from communicating. If you check the log file, there is communication for
> > about 4mins and then the node state goes down.
> > Any other idea?
> > ________________
> > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
> <ole.h.niel...@fysik.dtu.dk>
> > Sent: Wednesday, July 5, 2017 7:07:15 PM
> > To: slurm-dev
> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
> >
> >
> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
> >> in my network I encountered that managed switches were preventing
> >> necessary network communication between the nodes, on which SLURM
> >> relies. You should check if you're using managed switches to connect
> >> nodes to the network and if so, if they're blocking communication on
> >> slurm ports.
> >
> > Managed switches should permit IP layer 2 traffic just like unmanaged
> > switches!  We only have managed Ethernet switches, and they work without
> > problems.
> >
> > Perhaps you meant that Ethernet switches may perform some firewall
> > functions by themselves?
> >
> > Firewalls must be off between Slurm compute nodes as well as the
> > controller host.  See
> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#
> configure-firewall-for-slurm-daemons
> >
> > /Ole
>
>
>


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Felix Willenborg
Which OS are you using on both nodes and how particulary did you turn
off the firewall?

Best,
Felix

Am 05.07.2017 um 16:23 schrieb Said Mohamed Said:
> Sinfo -R gives "NODE IS NOT RESPONDING"
> ping gives successful results from both nodes
>
> I really can not figure out what is causing the problem.
>
> Regards,
> Said
> 
> *From:* Felix Willenborg <felix.willenb...@uni-oldenburg.de>
> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>  
> When the nodes change to the down state, what is 'sinfo -R' saying?
> Sometimes it gives you a reason for that.
>
> Best,
> Felix
>
> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>> Thank you Adam, For NTP I did that as well before posting but didn't
>> fix the issue.
>>
>> Regards,
>> Said
>> 
>> *From:* Adam Huffman <adam.huff...@gmail.com>
>> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>  
>>
>> I've seen something similar when node clocks were skewed.
>>
>> Worth checking that NTP is running and they're all synchronised.
>>
>> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
>> <said.moha...@oist.jp> wrote:
>> > Thank you all for suggestions. I turned off firewall on both
>> machines but
>> > still no luck. I can confirm that No managed switch is preventing
>> the nodes
>> > from communicating. If you check the log file, there is
>> communication for
>> > about 4mins and then the node state goes down.
>> > Any other idea?
>> > 
>> > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
>> > Sent: Wednesday, July 5, 2017 7:07:15 PM
>> > To: slurm-dev
>> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>> >
>> >
>> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> >> in my network I encountered that managed switches were preventing
>> >> necessary network communication between the nodes, on which SLURM
>> >> relies. You should check if you're using managed switches to connect
>> >> nodes to the network and if so, if they're blocking communication on
>> >> slurm ports.
>> >
>> > Managed switches should permit IP layer 2 traffic just like unmanaged
>> > switches!  We only have managed Ethernet switches, and they work
>> without
>> > problems.
>> >
>> > Perhaps you meant that Ethernet switches may perform some firewall
>> > functions by themselves?
>> >
>> > Firewalls must be off between Slurm compute nodes as well as the
>> > controller host.  See
>> >
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>> >
>> > /Ole
>



[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Said Mohamed Said
Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said

From: Felix Willenborg <felix.willenb...@uni-oldenburg.de>
Sent: Wednesday, July 5, 2017 9:07:05 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

When the nodes change to the down state, what is 'sinfo -R' saying? Sometimes 
it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
Thank you Adam, For NTP I did that as well before posting but didn't fix the 
issue.

Regards,
Said

From: Adam Huffman <adam.huff...@gmail.com><mailto:adam.huff...@gmail.com>
Sent: Wednesday, July 5, 2017 8:11:03 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


I've seen something similar when node clocks were skewed.

Worth checking that NTP is running and they're all synchronised.

On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said 
<said.moha...@oist.jp><mailto:said.moha...@oist.jp> wrote:
> Thank you all for suggestions. I turned off firewall on both machines but
> still no luck. I can confirm that No managed switch is preventing the nodes
> from communicating. If you check the log file, there is communication for
> about 4mins and then the node state goes down.
> Any other idea?
> 
> From: Ole Holm Nielsen 
> <ole.h.niel...@fysik.dtu.dk><mailto:ole.h.niel...@fysik.dtu.dk>
> Sent: Wednesday, July 5, 2017 7:07:15 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> in my network I encountered that managed switches were preventing
>> necessary network communication between the nodes, on which SLURM
>> relies. You should check if you're using managed switches to connect
>> nodes to the network and if so, if they're blocking communication on
>> slurm ports.
>
> Managed switches should permit IP layer 2 traffic just like unmanaged
> switches!  We only have managed Ethernet switches, and they work without
> problems.
>
> Perhaps you meant that Ethernet switches may perform some firewall
> functions by themselves?
>
> Firewalls must be off between Slurm compute nodes as well as the
> controller host.  See
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>
> /Ole



[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Mehdi Denou
Did you try to ping the compute node from the controller node and the 
other way around ?



On 07/05/2017 01:07 PM, Said Mohamed Said wrote:
Thank you all for suggestions. I turned off firewall on both machines 
but still no luck. I can confirm that No managed switch is preventing 
the nodes from communicating. If you check the log file, there is 
communication for about 4mins and then the node state goes down.

Any other idea?

*From:* Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
*Sent:* Wednesday, July 5, 2017 7:07:15 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP

On 07/05/2017 11:40 AM, Felix Willenborg wrote:
> in my network I encountered that managed switches were preventing
> necessary network communication between the nodes, on which SLURM
> relies. You should check if you're using managed switches to connect
> nodes to the network and if so, if they're blocking communication on
> slurm ports.

Managed switches should permit IP layer 2 traffic just like unmanaged
switches!  We only have managed Ethernet switches, and they work without
problems.

Perhaps you meant that Ethernet switches may perform some firewall
functions by themselves?

Firewalls must be off between Slurm compute nodes as well as the
controller host.  See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

/Ole


--
---
Mehdi Denou
Bull/Atos international HPC support
+336 45 57 66 56



[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Felix Willenborg
When the nodes change to the down state, what is 'sinfo -R' saying?
Sometimes it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
> Thank you Adam, For NTP I did that as well before posting but didn't
> fix the issue.
>
> Regards,
> Said
> 
> *From:* Adam Huffman <adam.huff...@gmail.com>
> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>  
>
> I've seen something similar when node clocks were skewed.
>
> Worth checking that NTP is running and they're all synchronised.
>
> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
> <said.moha...@oist.jp> wrote:
> > Thank you all for suggestions. I turned off firewall on both
> machines but
> > still no luck. I can confirm that No managed switch is preventing
> the nodes
> > from communicating. If you check the log file, there is
> communication for
> > about 4mins and then the node state goes down.
> > Any other idea?
> > 
> > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
> > Sent: Wednesday, July 5, 2017 7:07:15 PM
> > To: slurm-dev
> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
> >
> >
> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
> >> in my network I encountered that managed switches were preventing
> >> necessary network communication between the nodes, on which SLURM
> >> relies. You should check if you're using managed switches to connect
> >> nodes to the network and if so, if they're blocking communication on
> >> slurm ports.
> >
> > Managed switches should permit IP layer 2 traffic just like unmanaged
> > switches!  We only have managed Ethernet switches, and they work without
> > problems.
> >
> > Perhaps you meant that Ethernet switches may perform some firewall
> > functions by themselves?
> >
> > Firewalls must be off between Slurm compute nodes as well as the
> > controller host.  See
> >
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
> >
> > /Ole



[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Said Mohamed Said
Thank you Adam, For NTP I did that as well before posting but didn't fix the 
issue.

Regards,
Said

From: Adam Huffman <adam.huff...@gmail.com>
Sent: Wednesday, July 5, 2017 8:11:03 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


I've seen something similar when node clocks were skewed.

Worth checking that NTP is running and they're all synchronised.

On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said <said.moha...@oist.jp> wrote:
> Thank you all for suggestions. I turned off firewall on both machines but
> still no luck. I can confirm that No managed switch is preventing the nodes
> from communicating. If you check the log file, there is communication for
> about 4mins and then the node state goes down.
> Any other idea?
> 
> From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
> Sent: Wednesday, July 5, 2017 7:07:15 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> in my network I encountered that managed switches were preventing
>> necessary network communication between the nodes, on which SLURM
>> relies. You should check if you're using managed switches to connect
>> nodes to the network and if so, if they're blocking communication on
>> slurm ports.
>
> Managed switches should permit IP layer 2 traffic just like unmanaged
> switches!  We only have managed Ethernet switches, and they work without
> problems.
>
> Perhaps you meant that Ethernet switches may perform some firewall
> functions by themselves?
>
> Firewalls must be off between Slurm compute nodes as well as the
> controller host.  See
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>
> /Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Said Mohamed Said
Thank you all for suggestions. I turned off firewall on both machines but still 
no luck. I can confirm that No managed switch is preventing the nodes from 
communicating. If you check the log file, there is communication for about 
4mins and then the node state goes down.
Any other idea?

From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
Sent: Wednesday, July 5, 2017 7:07:15 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


On 07/05/2017 11:40 AM, Felix Willenborg wrote:
> in my network I encountered that managed switches were preventing
> necessary network communication between the nodes, on which SLURM
> relies. You should check if you're using managed switches to connect
> nodes to the network and if so, if they're blocking communication on
> slurm ports.

Managed switches should permit IP layer 2 traffic just like unmanaged
switches!  We only have managed Ethernet switches, and they work without
problems.

Perhaps you meant that Ethernet switches may perform some firewall
functions by themselves?

Firewalls must be off between Slurm compute nodes as well as the
controller host.  See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen


On 07/05/2017 11:40 AM, Felix Willenborg wrote:

in my network I encountered that managed switches were preventing
necessary network communication between the nodes, on which SLURM
relies. You should check if you're using managed switches to connect
nodes to the network and if so, if they're blocking communication on
slurm ports.


Managed switches should permit IP layer 2 traffic just like unmanaged 
switches!  We only have managed Ethernet switches, and they work without 
problems.


Perhaps you meant that Ethernet switches may perform some firewall 
functions by themselves?


Firewalls must be off between Slurm compute nodes as well as the 
controller host.  See 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons


/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Sean McGrath

On Wed, Jul 05, 2017 at 03:27:18AM -0600, Ole Holm Nielsen wrote:

> Could it be that you have enabled the firewall on the compute nodes? 
> The firewall must be turned off (this requirement isn't documented 

>From my experience, I'll +1 that. Looks like a firewall or network level issue
to me.

Best

Sean


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Felix Willenborg

Hi,

in my network I encountered that managed switches were preventing
necessary network communication between the nodes, on which SLURM
relies. You should check if you're using managed switches to connect
nodes to the network and if so, if they're blocking communication on
slurm ports.

Best,
Felix

Am 05.07.2017 um 11:30 schrieb Ole Holm Nielsen:
>
> On 07/05/2017 11:25 AM, Ole Holm Nielsen wrote:
>> Could it be that you have enabled the firewall on the compute nodes?
>> The firewall must be turned off (this requirement isn't documented
>> anywhere).
>>
>> You may want to go through my Slurm deployment Wiki at
>> https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see
>> if anything obvious is missing in your configuration.
>
> Correction to the web page: https://wiki.fysik.dtu.dk/niflheim/SLURM
>
> Sorry,
> Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen


On 07/05/2017 11:25 AM, Ole Holm Nielsen wrote:
Could it be that you have enabled the firewall on the compute nodes? The 
firewall must be turned off (this requirement isn't documented anywhere).


You may want to go through my Slurm deployment Wiki at 
https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see if 
anything obvious is missing in your configuration.


Correction to the web page: https://wiki.fysik.dtu.dk/niflheim/SLURM

Sorry,
Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen


Hi Said,

Could it be that you have enabled the firewall on the compute nodes? 
The firewall must be turned off (this requirement isn't documented 
anywhere).


You may want to go through my Slurm deployment Wiki at 
https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see if 
anything obvious is missing in your configuration.


Best regards,
Ole

On 07/05/2017 11:17 AM, Said Mohamed Said wrote:

Dear Sir/Madam


I am configuring slurm for academic use in my University but I have 
encountered the following problem which I could not found the solution 
from the Internet.



I followed all troubleshooting suggestions from your website with no luck.


Whenever I start slurmd daemon in one of compute node, it starts with 
IDLE state but goes DOWN after 4 minutes with the reason=Node not 
responding.


I am using slurm version 17.02 on both nodes.


tail /var/log/slurmd.log on fault node gives;


***

[2017-07-05T16:56:55.118] Resource spec: Reserved system memory limit 
not configured for this node

[2017-07-05T16:56:55.120] slurmd version 17.02.2 started
[2017-07-05T16:56:55.121] slurmd started on Wed, 05 Jul 2017 16:56:55 +0900
[2017-07-05T16:56:55.121] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 
Memory=128661 TmpDisk=262012 Uptime=169125 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)

[2017-07-05T16:59:20.513] Slurmd shutdown completing
[2017-07-05T16:59:20.548] Message aggregation disabled
[2017-07-05T16:59:20.549] Resource spec: Reserved system memory limit 
not configured for this node

[2017-07-05T16:59:20.552] slurmd version 17.02.2 started
[2017-07-05T16:59:20.552] slurmd started on Wed, 05 Jul 2017 16:59:20 +0900
[2017-07-05T16:59:20.553] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 
Memory=128661 TmpDisk=262012 Uptime=169270 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)
*** 





tail /var/log/slurmctld.log on controller node gives;



[2017-07-05T17:54:56.422] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0

[2017-07-05T17:55:09.004] Node OBU-N6 now responding
[2017-07-05T17:55:09.004] node OBU-N6 returned to service
[2017-07-05T17:59:52.677] error: Nodes OBU-N6 not responding
[2017-07-05T18:03:15.857] error: Nodes OBU-N6 not responding, setting DOWN




The following is my slurm.conf file content;


**

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#

# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/linear
TreeWidth=50
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/mysql
#AccountingStorageType=accounting_storage/filetxt
#JobCompType=jobcomp/filetxt
#AccountingStorageLoc=/var/log/slurm/accounting
#JobCompLoc=/var/log/slurm/job_completions
ClusterName=obu
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=OBU-N5 NodeAddr=10.251.17.170 CPUs=24 Sockets=2 
CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
NodeName=OBU-N6 NodeAddr=10.251.17.171 CPUs=24 Sockets=2 
CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
PartitionName=slurm-partition Nodes=OBU-N[5-6] Default=YES 
MaxTime=INFINITE State=UP


**



I can ssh successfully from each node and munge daemon runs on each machine.


Your help will be greatly appreciated,


Sincerely,


Said.




--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620