Awesome tip. Thanks so much, Matthieu. I hadn't considered that. I will
give that a shot and see what happens.
On Thu, May 17, 2018 at 4:49 PM, Matthieu Hautreux <
> Communications in Slurm are not only performed from controller to slurmd
> and from slurmd to controller. You need to ensure that your login nodes can
> reach the controller and the slurmd nodes as well as ensure that slurmd on
> the various nodes can contact each other. This last requirement is because
> of the tree logic used in slurm communication :
> - to ensure scalability, slurmctld use a communication tree (see TreeWidth
> in "man slurm.conf"), used for example to periodically check that all the
> nodes are working properly
> - the same exact logic is used by srun when it contacts the various slurmd
> involved in its step
> - reversed tree communications are performed among slurmds of steps at
> their end to send accounting data and other stuff to the controller
> - only some communications are point-to-point between slurmd and
> controller, especially the "registering call" performed at slurmd startup.
> When slurmd can not contact each other because of network failures
> (partitioning) or too restrictive filtering, then you see the kind of
> flapping that you have. This is because point-to-point communication at
> slurmd registering make them appears to the controller, tree checks make
> some of them dissapear, retries can lead to point to point communications
> to some nodes when the amount of destination nodes contacted by the
> controller at the same time is lower than the configured TreeWidth, thus
> nodes suddenly reappear... until the next check... and so on.
> Two options for you :
> - be less restrictive in your filtering rules
> - set TreeWidth to 1 in slurm.conf but you will loose the
> performance/scalability of slurm internals communication
> If your cluster is large, I would recommend to use the first one.
> PS : you can look at that presentation for a few details on the
> communication logic :
> 2018-05-17 22:21 GMT+02:00 Sean Caron <sca...@umich.edu>:
>> Sorry, how do you mean? The environment is very basic. Compute nodes and
>> SLURM controller are on an RFC1918 subnet. Gateways are dual homed with one
>> leg on a public IP and one leg on the RFC1918 cluster network. It used to
>> be that nodes that only had a leg on the RFC1918 network (compute nodes and
>> the SLURM controller) had no firewall at all and nodes that were dual homed
>> basically were set to just permit all traffic from the cluster side NIC
>> (i.e. iptables rule like -A INPUT -i ethX -j ACCEPT).
>> Now we're trying to go back to the gateways and compute nodes and
>> actually codify, instead of just passing all traffic from the cluster side
>> NIC, what ports and protocols are actually in use, or at least, what
>> server-to-server communication is expected and normative, and then define a
>> rule set to permit those while dropping other traffic not explicitly
>> The compute and gateway nodes work fine with SLURM even when iptables is
>> enabled and the policy is "permit all traffic from that NIC" but once we
>> tighten it down just a little bit to "permit all traffic to and from the
>> SLURM controller" we see these weird instances of node state flapping. It's
>> not clear to me why this is the case since from the standpoint of node to
>> controller communications, these policies are logically very similar, but
>> there it is. The nodes shouldn't have to talk to anything else besides the
>> SLURM controller for SLURM to work, so long as time is synched up between
>> them and there are no issues with the nodes getting to slurm.conf.
>> On Thu, May 17, 2018 at 1:21 PM, Patrick Goetz <pgo...@math.utexas.edu>
>>> Does your SMS have a dedicated interface for node traffic?
>>> On 05/16/2018 04:00 PM, Sean Caron wrote:
>>>> I see some chatter on 6818/TCP from the compute node to the SLURM
>>>> controller, and from the SLURM controller to the compute node.
>>>> The policy is to permit all packets inbound from SLURM controller
>>>> regardless of port and protocol, and perform no filtering whatsoever on any
>>>> output packets to anywhere. I wouldn't expect this to interfere.
>>>> Anyway, it's not that it NEVER works once the firewall is switched on.
>>>> It's that it flaps. The firewall is clearly passing enough traffic to have
>>>> the node marked as up some of the time. But why the periodic "not
>>>> responding" ... "responding" cycles? Once it says "not responding" I can
>>>> still scontrol ping from the compute node in question, and standard ICMP
>>>> ping from one to the other works as well.
>>>> On Wed, May 16, 2018 at 2:13 PM, Alex Chekholko <a...@calicolabs.com
>>>> <mailto:a...@calicolabs.com>> wrote:
>>>> Add a logging rule to your iptables and look at what traffic is
>>>> actually being blocked?
>>>> On Wed, May 16, 2018 at 11:11 AM Sean Caron <sca...@umich.edu
>>>> <mailto:sca...@umich.edu>> wrote:
>>>> Hi all,
>>>> Does anyone use SLURM in a scenario where there is an iptables
>>>> firewall on the compute nodes on the same network it uses to
>>>> communicate with the SLURM controller and DBD machine?
>>>> I have the very basic situation where ...
>>>> 1. There is no iptables firewall enabled at all on the SLURM
>>>> controller/DBD machine.
>>>> 2. Compute nodes are set to permit all ports and protocols from
>>>> the SLURM controller with a rule like:
>>>> -A INPUT -s IP.of.SLURM.controller/32 -j ACCEPT
>>>> If I enable this on the compute nodes, they flap up in down in
>>>> "Not responding state". If I switch off the firewall on the
>>>> compute nodes, they work fine.
>>>> When firewall is up on the compute nodes, SLURM controller can
>>>> ping compute nodes, no problem. I have no reason to believe all
>>>> ports and protocols are not being passed. Time is synched. No
>>>> trouble accessing slurm.conf on any of the clients.
>>>> Has anyone seen this before? There seems to be very little
>>>> information about SLURM's interactions with iptables. I know
>>>> this is kind of a funky scenario but regulatory requirements
>>>> have me needing to tighten down our cluster network a little
>>>> bit. Is this like a latency issue, or ...?