Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-21 Thread Matthieu Hautreux
Glad to hear that you make it work.

Regards
Matthieu


2018-05-21 21:21 GMT+02:00 Sean Caron :

> Just wanted to follow up. In addition to passing all traffic to the SLURM
> controller, opened port 6818/TCP to all other compute nodes and this seems
> to have resolved the issue. Thanks again, Matthieu!
>
> Best,
>
> Sean
>
>
> On Thu, May 17, 2018 at 8:06 PM, Sean Caron  wrote:
>
>> Awesome tip. Thanks so much, Matthieu. I hadn't considered that. I will
>> give that a shot and see what happens.
>>
>> Best,
>>
>> Sean
>>
>>
>> On Thu, May 17, 2018 at 4:49 PM, Matthieu Hautreux <
>> matthieu.hautr...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Communications in Slurm are not only performed from controller to slurmd
>>> and from slurmd to controller. You need to ensure that your login nodes can
>>> reach the controller and the slurmd nodes as well as ensure that slurmd on
>>> the various nodes can contact each other. This last requirement is because
>>> of the tree logic used in slurm communication :
>>>
>>> - to ensure scalability, slurmctld use a communication tree (see
>>> TreeWidth in "man slurm.conf"), used for example to periodically check that
>>> all the nodes are working properly
>>> - the same exact logic is used by srun when it contacts the various
>>> slurmd involved in its step
>>> - reversed tree communications are performed among slurmds of steps at
>>> their end to send accounting data and other stuff to the controller
>>>
>>> - only some communications are point-to-point between slurmd and
>>> controller, especially the "registering call" performed at slurmd startup.
>>>
>>> When slurmd can not contact each other because of network failures
>>> (partitioning) or too restrictive filtering, then you see the kind of
>>> flapping that you have. This is because point-to-point communication at
>>> slurmd registering make them appears to the controller, tree checks make
>>> some of them dissapear, retries can lead to point to point communications
>>> to some nodes when the amount of destination nodes contacted by the
>>> controller at the same time is lower than the configured TreeWidth, thus
>>> nodes suddenly reappear... until the next check... and so on.
>>>
>>> Two options for you :
>>>
>>> - be less restrictive in your filtering rules
>>> - set TreeWidth to 1 in slurm.conf but you will loose the
>>> performance/scalability of slurm internals communication
>>>
>>> If your cluster is large, I would recommend to use the first one.
>>>
>>> HTH
>>> Matthieu
>>>
>>> PS : you can look at that presentation for a few details on the
>>> communication logic :
>>> https://slurm.schedmd.com/SUG14/message_aggregation.pdf
>>>
>>>
>>>
>>> 2018-05-17 22:21 GMT+02:00 Sean Caron :
>>>
 Sorry, how do you mean? The environment is very basic. Compute nodes
 and SLURM controller are on an RFC1918 subnet. Gateways are dual homed with
 one leg on a public IP and one leg on the RFC1918 cluster network. It used
 to be that nodes that only had a leg on the RFC1918 network (compute nodes
 and the SLURM controller) had no firewall at all and nodes that were dual
 homed basically were set to just permit all traffic from the cluster side
 NIC (i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).

 Now we're trying to go back to the gateways and compute nodes and
 actually codify, instead of just passing all traffic from the cluster side
 NIC, what ports and protocols are actually in use, or at least, what
 server-to-server communication is expected and normative, and then define a
 rule set to permit those while dropping other traffic not explicitly
 whitelisted.

 The compute and gateway nodes work fine with SLURM even when iptables
 is enabled and the policy is "permit all traffic from that NIC" but once we
 tighten it down just a little bit to "permit all traffic to and from the
 SLURM controller" we see these weird instances of node state flapping. It's
 not clear to me why this is the case since from the standpoint of node to
 controller communications, these policies are logically very similar, but
 there it is. The nodes shouldn't have to talk to anything else besides the
 SLURM controller for SLURM to work, so long as time is synched up between
 them and there are no issues with the nodes getting to slurm.conf.

 Best,

 Sean


 On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz 
 wrote:

> Does your SMS have a dedicated interface for node traffic?
>
> On 05/16/2018 04:00 PM, Sean Caron wrote:
>
>> I see some chatter on 6818/TCP from the compute node to the SLURM
>> controller, and from the SLURM controller to the compute node.
>>
>> The policy is to permit all packets inbound from SLURM controller
>> regardless of port and protocol, and perform no filtering whatsoever on 
>> any

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-21 Thread Sean Caron
Just wanted to follow up. In addition to passing all traffic to the SLURM
controller, opened port 6818/TCP to all other compute nodes and this seems
to have resolved the issue. Thanks again, Matthieu!

Best,

Sean


On Thu, May 17, 2018 at 8:06 PM, Sean Caron  wrote:

> Awesome tip. Thanks so much, Matthieu. I hadn't considered that. I will
> give that a shot and see what happens.
>
> Best,
>
> Sean
>
>
> On Thu, May 17, 2018 at 4:49 PM, Matthieu Hautreux <
> matthieu.hautr...@gmail.com> wrote:
>
>> Hi,
>>
>> Communications in Slurm are not only performed from controller to slurmd
>> and from slurmd to controller. You need to ensure that your login nodes can
>> reach the controller and the slurmd nodes as well as ensure that slurmd on
>> the various nodes can contact each other. This last requirement is because
>> of the tree logic used in slurm communication :
>>
>> - to ensure scalability, slurmctld use a communication tree (see
>> TreeWidth in "man slurm.conf"), used for example to periodically check that
>> all the nodes are working properly
>> - the same exact logic is used by srun when it contacts the various
>> slurmd involved in its step
>> - reversed tree communications are performed among slurmds of steps at
>> their end to send accounting data and other stuff to the controller
>>
>> - only some communications are point-to-point between slurmd and
>> controller, especially the "registering call" performed at slurmd startup.
>>
>> When slurmd can not contact each other because of network failures
>> (partitioning) or too restrictive filtering, then you see the kind of
>> flapping that you have. This is because point-to-point communication at
>> slurmd registering make them appears to the controller, tree checks make
>> some of them dissapear, retries can lead to point to point communications
>> to some nodes when the amount of destination nodes contacted by the
>> controller at the same time is lower than the configured TreeWidth, thus
>> nodes suddenly reappear... until the next check... and so on.
>>
>> Two options for you :
>>
>> - be less restrictive in your filtering rules
>> - set TreeWidth to 1 in slurm.conf but you will loose the
>> performance/scalability of slurm internals communication
>>
>> If your cluster is large, I would recommend to use the first one.
>>
>> HTH
>> Matthieu
>>
>> PS : you can look at that presentation for a few details on the
>> communication logic :
>> https://slurm.schedmd.com/SUG14/message_aggregation.pdf
>>
>>
>>
>> 2018-05-17 22:21 GMT+02:00 Sean Caron :
>>
>>> Sorry, how do you mean? The environment is very basic. Compute nodes and
>>> SLURM controller are on an RFC1918 subnet. Gateways are dual homed with one
>>> leg on a public IP and one leg on the RFC1918 cluster network. It used to
>>> be that nodes that only had a leg on the RFC1918 network (compute nodes and
>>> the SLURM controller) had no firewall at all and nodes that were dual homed
>>> basically were set to just permit all traffic from the cluster side NIC
>>> (i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).
>>>
>>> Now we're trying to go back to the gateways and compute nodes and
>>> actually codify, instead of just passing all traffic from the cluster side
>>> NIC, what ports and protocols are actually in use, or at least, what
>>> server-to-server communication is expected and normative, and then define a
>>> rule set to permit those while dropping other traffic not explicitly
>>> whitelisted.
>>>
>>> The compute and gateway nodes work fine with SLURM even when iptables is
>>> enabled and the policy is "permit all traffic from that NIC" but once we
>>> tighten it down just a little bit to "permit all traffic to and from the
>>> SLURM controller" we see these weird instances of node state flapping. It's
>>> not clear to me why this is the case since from the standpoint of node to
>>> controller communications, these policies are logically very similar, but
>>> there it is. The nodes shouldn't have to talk to anything else besides the
>>> SLURM controller for SLURM to work, so long as time is synched up between
>>> them and there are no issues with the nodes getting to slurm.conf.
>>>
>>> Best,
>>>
>>> Sean
>>>
>>>
>>> On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz 
>>> wrote:
>>>
 Does your SMS have a dedicated interface for node traffic?

 On 05/16/2018 04:00 PM, Sean Caron wrote:

> I see some chatter on 6818/TCP from the compute node to the SLURM
> controller, and from the SLURM controller to the compute node.
>
> The policy is to permit all packets inbound from SLURM controller
> regardless of port and protocol, and perform no filtering whatsoever on 
> any
> output packets to anywhere. I wouldn't expect this to interfere.
>
> Anyway, it's not that it NEVER works once the firewall is switched on.
> It's that it flaps. The firewall is clearly passing enough traffic to have

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-17 Thread Sean Caron
Awesome tip. Thanks so much, Matthieu. I hadn't considered that. I will
give that a shot and see what happens.

Best,

Sean


On Thu, May 17, 2018 at 4:49 PM, Matthieu Hautreux <
matthieu.hautr...@gmail.com> wrote:

> Hi,
>
> Communications in Slurm are not only performed from controller to slurmd
> and from slurmd to controller. You need to ensure that your login nodes can
> reach the controller and the slurmd nodes as well as ensure that slurmd on
> the various nodes can contact each other. This last requirement is because
> of the tree logic used in slurm communication :
>
> - to ensure scalability, slurmctld use a communication tree (see TreeWidth
> in "man slurm.conf"), used for example to periodically check that all the
> nodes are working properly
> - the same exact logic is used by srun when it contacts the various slurmd
> involved in its step
> - reversed tree communications are performed among slurmds of steps at
> their end to send accounting data and other stuff to the controller
>
> - only some communications are point-to-point between slurmd and
> controller, especially the "registering call" performed at slurmd startup.
>
> When slurmd can not contact each other because of network failures
> (partitioning) or too restrictive filtering, then you see the kind of
> flapping that you have. This is because point-to-point communication at
> slurmd registering make them appears to the controller, tree checks make
> some of them dissapear, retries can lead to point to point communications
> to some nodes when the amount of destination nodes contacted by the
> controller at the same time is lower than the configured TreeWidth, thus
> nodes suddenly reappear... until the next check... and so on.
>
> Two options for you :
>
> - be less restrictive in your filtering rules
> - set TreeWidth to 1 in slurm.conf but you will loose the
> performance/scalability of slurm internals communication
>
> If your cluster is large, I would recommend to use the first one.
>
> HTH
> Matthieu
>
> PS : you can look at that presentation for a few details on the
> communication logic :
> https://slurm.schedmd.com/SUG14/message_aggregation.pdf
>
>
>
> 2018-05-17 22:21 GMT+02:00 Sean Caron :
>
>> Sorry, how do you mean? The environment is very basic. Compute nodes and
>> SLURM controller are on an RFC1918 subnet. Gateways are dual homed with one
>> leg on a public IP and one leg on the RFC1918 cluster network. It used to
>> be that nodes that only had a leg on the RFC1918 network (compute nodes and
>> the SLURM controller) had no firewall at all and nodes that were dual homed
>> basically were set to just permit all traffic from the cluster side NIC
>> (i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).
>>
>> Now we're trying to go back to the gateways and compute nodes and
>> actually codify, instead of just passing all traffic from the cluster side
>> NIC, what ports and protocols are actually in use, or at least, what
>> server-to-server communication is expected and normative, and then define a
>> rule set to permit those while dropping other traffic not explicitly
>> whitelisted.
>>
>> The compute and gateway nodes work fine with SLURM even when iptables is
>> enabled and the policy is "permit all traffic from that NIC" but once we
>> tighten it down just a little bit to "permit all traffic to and from the
>> SLURM controller" we see these weird instances of node state flapping. It's
>> not clear to me why this is the case since from the standpoint of node to
>> controller communications, these policies are logically very similar, but
>> there it is. The nodes shouldn't have to talk to anything else besides the
>> SLURM controller for SLURM to work, so long as time is synched up between
>> them and there are no issues with the nodes getting to slurm.conf.
>>
>> Best,
>>
>> Sean
>>
>>
>> On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz 
>> wrote:
>>
>>> Does your SMS have a dedicated interface for node traffic?
>>>
>>> On 05/16/2018 04:00 PM, Sean Caron wrote:
>>>
 I see some chatter on 6818/TCP from the compute node to the SLURM
 controller, and from the SLURM controller to the compute node.

 The policy is to permit all packets inbound from SLURM controller
 regardless of port and protocol, and perform no filtering whatsoever on any
 output packets to anywhere. I wouldn't expect this to interfere.

 Anyway, it's not that it NEVER works once the firewall is switched on.
 It's that it flaps. The firewall is clearly passing enough traffic to have
 the node marked as up some of the time. But why the periodic "not
 responding" ... "responding" cycles? Once it says "not responding" I can
 still scontrol ping from the compute node in question, and standard ICMP
 ping from one to the other works as well.

 Best,

 Sean


 On Wed, May 16, 2018 at 2:13 PM, Alex Chekholko 

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-17 Thread Matthieu Hautreux
Hi,

Communications in Slurm are not only performed from controller to slurmd
and from slurmd to controller. You need to ensure that your login nodes can
reach the controller and the slurmd nodes as well as ensure that slurmd on
the various nodes can contact each other. This last requirement is because
of the tree logic used in slurm communication :

- to ensure scalability, slurmctld use a communication tree (see TreeWidth
in "man slurm.conf"), used for example to periodically check that all the
nodes are working properly
- the same exact logic is used by srun when it contacts the various slurmd
involved in its step
- reversed tree communications are performed among slurmds of steps at
their end to send accounting data and other stuff to the controller

- only some communications are point-to-point between slurmd and
controller, especially the "registering call" performed at slurmd startup.

When slurmd can not contact each other because of network failures
(partitioning) or too restrictive filtering, then you see the kind of
flapping that you have. This is because point-to-point communication at
slurmd registering make them appears to the controller, tree checks make
some of them dissapear, retries can lead to point to point communications
to some nodes when the amount of destination nodes contacted by the
controller at the same time is lower than the configured TreeWidth, thus
nodes suddenly reappear... until the next check... and so on.

Two options for you :

- be less restrictive in your filtering rules
- set TreeWidth to 1 in slurm.conf but you will loose the
performance/scalability of slurm internals communication

If your cluster is large, I would recommend to use the first one.

HTH
Matthieu

PS : you can look at that presentation for a few details on the
communication logic :
https://slurm.schedmd.com/SUG14/message_aggregation.pdf



2018-05-17 22:21 GMT+02:00 Sean Caron :

> Sorry, how do you mean? The environment is very basic. Compute nodes and
> SLURM controller are on an RFC1918 subnet. Gateways are dual homed with one
> leg on a public IP and one leg on the RFC1918 cluster network. It used to
> be that nodes that only had a leg on the RFC1918 network (compute nodes and
> the SLURM controller) had no firewall at all and nodes that were dual homed
> basically were set to just permit all traffic from the cluster side NIC
> (i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).
>
> Now we're trying to go back to the gateways and compute nodes and actually
> codify, instead of just passing all traffic from the cluster side NIC, what
> ports and protocols are actually in use, or at least, what server-to-server
> communication is expected and normative, and then define a rule set to
> permit those while dropping other traffic not explicitly whitelisted.
>
> The compute and gateway nodes work fine with SLURM even when iptables is
> enabled and the policy is "permit all traffic from that NIC" but once we
> tighten it down just a little bit to "permit all traffic to and from the
> SLURM controller" we see these weird instances of node state flapping. It's
> not clear to me why this is the case since from the standpoint of node to
> controller communications, these policies are logically very similar, but
> there it is. The nodes shouldn't have to talk to anything else besides the
> SLURM controller for SLURM to work, so long as time is synched up between
> them and there are no issues with the nodes getting to slurm.conf.
>
> Best,
>
> Sean
>
>
> On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz 
> wrote:
>
>> Does your SMS have a dedicated interface for node traffic?
>>
>> On 05/16/2018 04:00 PM, Sean Caron wrote:
>>
>>> I see some chatter on 6818/TCP from the compute node to the SLURM
>>> controller, and from the SLURM controller to the compute node.
>>>
>>> The policy is to permit all packets inbound from SLURM controller
>>> regardless of port and protocol, and perform no filtering whatsoever on any
>>> output packets to anywhere. I wouldn't expect this to interfere.
>>>
>>> Anyway, it's not that it NEVER works once the firewall is switched on.
>>> It's that it flaps. The firewall is clearly passing enough traffic to have
>>> the node marked as up some of the time. But why the periodic "not
>>> responding" ... "responding" cycles? Once it says "not responding" I can
>>> still scontrol ping from the compute node in question, and standard ICMP
>>> ping from one to the other works as well.
>>>
>>> Best,
>>>
>>> Sean
>>>
>>>
>>> On Wed, May 16, 2018 at 2:13 PM, Alex Chekholko >> > wrote:
>>>
>>> Add a logging rule to your iptables and look at what traffic is
>>> actually being blocked?
>>>
>>> On Wed, May 16, 2018 at 11:11 AM Sean Caron >> > wrote:
>>>
>>> Hi all,
>>>
>>> Does anyone use SLURM in a scenario where there is an iptables
>>> 

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-17 Thread Sean Caron
Sorry, how do you mean? The environment is very basic. Compute nodes and
SLURM controller are on an RFC1918 subnet. Gateways are dual homed with one
leg on a public IP and one leg on the RFC1918 cluster network. It used to
be that nodes that only had a leg on the RFC1918 network (compute nodes and
the SLURM controller) had no firewall at all and nodes that were dual homed
basically were set to just permit all traffic from the cluster side NIC
(i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).

Now we're trying to go back to the gateways and compute nodes and actually
codify, instead of just passing all traffic from the cluster side NIC, what
ports and protocols are actually in use, or at least, what server-to-server
communication is expected and normative, and then define a rule set to
permit those while dropping other traffic not explicitly whitelisted.

The compute and gateway nodes work fine with SLURM even when iptables is
enabled and the policy is "permit all traffic from that NIC" but once we
tighten it down just a little bit to "permit all traffic to and from the
SLURM controller" we see these weird instances of node state flapping. It's
not clear to me why this is the case since from the standpoint of node to
controller communications, these policies are logically very similar, but
there it is. The nodes shouldn't have to talk to anything else besides the
SLURM controller for SLURM to work, so long as time is synched up between
them and there are no issues with the nodes getting to slurm.conf.

Best,

Sean


On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz 
wrote:

> Does your SMS have a dedicated interface for node traffic?
>
> On 05/16/2018 04:00 PM, Sean Caron wrote:
>
>> I see some chatter on 6818/TCP from the compute node to the SLURM
>> controller, and from the SLURM controller to the compute node.
>>
>> The policy is to permit all packets inbound from SLURM controller
>> regardless of port and protocol, and perform no filtering whatsoever on any
>> output packets to anywhere. I wouldn't expect this to interfere.
>>
>> Anyway, it's not that it NEVER works once the firewall is switched on.
>> It's that it flaps. The firewall is clearly passing enough traffic to have
>> the node marked as up some of the time. But why the periodic "not
>> responding" ... "responding" cycles? Once it says "not responding" I can
>> still scontrol ping from the compute node in question, and standard ICMP
>> ping from one to the other works as well.
>>
>> Best,
>>
>> Sean
>>
>>
>> On Wed, May 16, 2018 at 2:13 PM, Alex Chekholko > > wrote:
>>
>> Add a logging rule to your iptables and look at what traffic is
>> actually being blocked?
>>
>> On Wed, May 16, 2018 at 11:11 AM Sean Caron > > wrote:
>>
>> Hi all,
>>
>> Does anyone use SLURM in a scenario where there is an iptables
>> firewall on the compute nodes on the same network it uses to
>> communicate with the SLURM controller and DBD machine?
>>
>> I have the very basic situation where ...
>>
>> 1. There is no iptables firewall enabled at all on the SLURM
>> controller/DBD machine.
>>
>> 2. Compute nodes are set to permit all ports and protocols from
>> the SLURM controller with a rule like:
>>
>> -A INPUT -s IP.of.SLURM.controller/32 -j ACCEPT
>>
>> If I enable this on the compute nodes, they flap up in down in
>> "Not responding state". If I switch off the firewall on the
>> compute nodes, they work fine.
>>
>> When firewall is up on the compute nodes, SLURM controller can
>> ping compute nodes, no problem. I have no reason to believe all
>> ports and protocols are not being passed. Time is synched. No
>> trouble accessing slurm.conf on any of the clients.
>>
>> Has anyone seen this before? There seems to be very little
>> information about SLURM's interactions with iptables. I know
>> this is kind of a funky scenario but regulatory requirements
>> have me needing to tighten down our cluster network a little
>> bit. Is this like a latency issue, or ...?
>>
>> Thanks,
>>
>> Sean
>>
>>
>>
>


Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-16 Thread Sean Caron
I see some chatter on 6818/TCP from the compute node to the SLURM
controller, and from the SLURM controller to the compute node.

The policy is to permit all packets inbound from SLURM controller
regardless of port and protocol, and perform no filtering whatsoever on any
output packets to anywhere. I wouldn't expect this to interfere.

Anyway, it's not that it NEVER works once the firewall is switched on. It's
that it flaps. The firewall is clearly passing enough traffic to have the
node marked as up some of the time. But why the periodic "not responding"
... "responding" cycles? Once it says "not responding" I can still scontrol
ping from the compute node in question, and standard ICMP ping from one to
the other works as well.

Best,

Sean


On Wed, May 16, 2018 at 2:13 PM, Alex Chekholko  wrote:

> Add a logging rule to your iptables and look at what traffic is actually
> being blocked?
>
> On Wed, May 16, 2018 at 11:11 AM Sean Caron  wrote:
>
>> Hi all,
>>
>> Does anyone use SLURM in a scenario where there is an iptables firewall
>> on the compute nodes on the same network it uses to communicate with the
>> SLURM controller and DBD machine?
>>
>> I have the very basic situation where ...
>>
>> 1. There is no iptables firewall enabled at all on the SLURM
>> controller/DBD machine.
>>
>> 2. Compute nodes are set to permit all ports and protocols from the SLURM
>> controller with a rule like:
>>
>> -A INPUT -s IP.of.SLURM.controller/32 -j ACCEPT
>>
>> If I enable this on the compute nodes, they flap up in down in "Not
>> responding state". If I switch off the firewall on the compute nodes, they
>> work fine.
>>
>> When firewall is up on the compute nodes, SLURM controller can ping
>> compute nodes, no problem. I have no reason to believe all ports and
>> protocols are not being passed. Time is synched. No trouble accessing
>> slurm.conf on any of the clients.
>>
>> Has anyone seen this before? There seems to be very little information
>> about SLURM's interactions with iptables. I know this is kind of a funky
>> scenario but regulatory requirements have me needing to tighten down our
>> cluster network a little bit. Is this like a latency issue, or ...?
>>
>> Thanks,
>>
>> Sean
>>
>>


Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-16 Thread Alex Chekholko
Add a logging rule to your iptables and look at what traffic is actually
being blocked?

On Wed, May 16, 2018 at 11:11 AM Sean Caron  wrote:

> Hi all,
>
> Does anyone use SLURM in a scenario where there is an iptables firewall on
> the compute nodes on the same network it uses to communicate with the SLURM
> controller and DBD machine?
>
> I have the very basic situation where ...
>
> 1. There is no iptables firewall enabled at all on the SLURM
> controller/DBD machine.
>
> 2. Compute nodes are set to permit all ports and protocols from the SLURM
> controller with a rule like:
>
> -A INPUT -s IP.of.SLURM.controller/32 -j ACCEPT
>
> If I enable this on the compute nodes, they flap up in down in "Not
> responding state". If I switch off the firewall on the compute nodes, they
> work fine.
>
> When firewall is up on the compute nodes, SLURM controller can ping
> compute nodes, no problem. I have no reason to believe all ports and
> protocols are not being passed. Time is synched. No trouble accessing
> slurm.conf on any of the clients.
>
> Has anyone seen this before? There seems to be very little information
> about SLURM's interactions with iptables. I know this is kind of a funky
> scenario but regulatory requirements have me needing to tighten down our
> cluster network a little bit. Is this like a latency issue, or ...?
>
> Thanks,
>
> Sean
>
>


[slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-16 Thread Sean Caron
Hi all,

Does anyone use SLURM in a scenario where there is an iptables firewall on
the compute nodes on the same network it uses to communicate with the SLURM
controller and DBD machine?

I have the very basic situation where ...

1. There is no iptables firewall enabled at all on the SLURM controller/DBD
machine.

2. Compute nodes are set to permit all ports and protocols from the SLURM
controller with a rule like:

-A INPUT -s IP.of.SLURM.controller/32 -j ACCEPT

If I enable this on the compute nodes, they flap up in down in "Not
responding state". If I switch off the firewall on the compute nodes, they
work fine.

When firewall is up on the compute nodes, SLURM controller can ping compute
nodes, no problem. I have no reason to believe all ports and protocols are
not being passed. Time is synched. No trouble accessing slurm.conf on any
of the clients.

Has anyone seen this before? There seems to be very little information
about SLURM's interactions with iptables. I know this is kind of a funky
scenario but regulatory requirements have me needing to tighten down our
cluster network a little bit. Is this like a latency issue, or ...?

Thanks,

Sean