Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

2020-01-11 Thread Andrei Borzenkov
04.01.2020 01:42, Valentin Vidić пишет:
> On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote:
>> What you've used appears to be akin to what this chunk of manpage
>> suggests (amongst others):
>> https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man
>>
>> which is (yet another) indicator to me that xt_cluster extension
>> doesn't carry that functionality on its own (like CLUSTERIP target
>> did, as mentioned).
> 
...
> 
>> * But it doesn't explain the suggested destination MAC renormalization
>> * on INPUT, which is currently yet to be heard of for our purpose...
> 
> I did not use the INPUT rules from the xt_cluster documentation and
> to be honest don't understand the setup described there.
> 

ARP RFC says that on reply source and target hardware addresses are
swapped, so reply is supposed to carry original source MAC as target
MAC. AFAICT Linux ARP driver does not check it, but I guess it is good
practice to make sure received packet conforms to standard's requirement.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

2020-01-03 Thread Valentin Vidić
On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote:
> What you've used appears to be akin to what this chunk of manpage
> suggests (amongst others):
> https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man
> 
> which is (yet another) indicator to me that xt_cluster extension
> doesn't carry that functionality on its own (like CLUSTERIP target
> did, as mentioned).

Right, the old ipt_CLUSTERIP.c (908 lines) was a complete solution
with ARP mangling and /proc file to control node mapping. The new
xt_cluster.c (175 lines) is much more limited and only handles the
hashing part. The rest needs to be done externally, using iptables
and arptables commands (or some nft equivalents).

> 2. Is the following, for me viable explanation correct?
> 
> That arrangement is to prevent here unexpectedly leaky specific
> associations (I'd call "fixations") of the interface's true (hence
> non-multicast) MAC address with meant-to-be-shared IP address at hand,
> and hence cancelling the effect of link-multicasted frames (to which
> at most a single recipient would respond per the firewall matching
> rules), and therefore botching the "shared IP" concept altogether from
> the perspective of network members that would undesirably learn
> non-multicast address association for the particular
> meant-to-be-shared IP leaked like this.

Yes, I added VIP to both nodes as a normal address and they would both
reply to ARP requests with their normal MAC addresses. The arptables
command rewrites the ARP reply to use the multicast MAC for VIP instead.

> * But it doesn't explain the suggested destination MAC renormalization
> * on INPUT, which is currently yet to be heard of for our purpose...

I did not use the INPUT rules from the xt_cluster documentation and
to be honest don't understand the setup described there.

> 4. Shall not even existing IPaddr2 (whether in CLUSTERIP-based mode
>or not) actually verify that
>/proc/sys/net/netfilter/nf_conntrack_tcp_loose
>gets cleared, at least until told not to through configuration?
> 
> - looks like a good idea not to allow any after-cut packets
>   interaction (would only apply to anything outside of the
>   critical cluster infrastructure since it uses UDP), as
>   a matter of safety precautions (there are no liveness
>   aspects to wish for in such scenarios, which could
>   otherwise interfere, I think)

Did not use conntrack so not sure how important this setting is.

> 5. Here, I had a closer look at the code as well and have an option
>to try -- does this help?
> 
> It appears as if that response in the (solicited)  Neighbour
> Advertisement is -- in Linux kernel -- unconditionally always
> picked from the very first address configured on the device (not to
> be confused with "permanent address").  Hence it looks to me that
> the way to go would be, so as to achieve feature parity IPv4 vs. IPv6,
> to either:
> 
> - give up on the sole identity of the interface, so that it either
>   operates under selected multicast link layer address or doesn't
>   operate at all (rationale: better not to confuse the network with
>   occasional MAC flips?)
> 
> - stick with a new macvlan pseudointerface, surprise-surprise, yet
>   another virtualization/mimicking/independence-increasing layer :-)
> 
> No experience with macvlan on my side, but bridge mode looks appealing,
> and would retain the interface addressable through its standard MAC
> address as well.  And importantly, the newly created interface would
> have the correct (multicast) MAC address to respond with to the
> respective Neighbour Solicitations (which is exactly what's asked,
> IIUIC), and I expect it would be the one selected to respond to
> the very matching IP in question?
> 
> Still, this doesn't resolve any concern around point 3. above
> (assuming it's not bogus, to begin with).

Yes, macvlan or some other similar trick to bind the VIP with multicast
MAC could be used here to avoid the whole packet rewrite mess. The only
iptables rules required in this setup would be to select using
xt_cluster if the incoming packet is to be handled by the current node
or just ignored (DROP).

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

2020-01-02 Thread Jan Pokorný
On 27/12/19 15:04 +0100, Valentin Vidić wrote:
> On Wed, Dec 04, 2019 at 02:44:49PM +0100, Jan Pokorný wrote:
>> For the record, based on my feedback, iptables-extensions man page is
>> headed to (finally) align with the actual in-kernel deprecation
>> message:
>> https://lore.kernel.org/netfilter-devel/20191204130921.2914-1-p...@nwl.cc/
> 
> From a quick run of xt_cluster it seems to be working as expected
> for IPv4

FTR. when having "netfilter"/nftables backend available, you can either
make use of iptables-translate conversion utility, or deduce a similar
takeaway from 
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.txlate?h=v1.8.4
possibly allowing to ditch any dependency on iptables-* tooling, and on
xt_cluster.ko just as well!

As mentioned in a newer incarnation asking about xt_cluster:
https://lists.clusterlabs.org/pipermail/users/2020-January/026718.html

for the envisioned agent, it would be a way to (optionally) allow for
a rather lightweight operation in the future (where iptables may not
get installed by default with some Linux distros at all; well, even
firewalld-as-a-middleware variant controlled just via DBus calls might
be thinkable, meaning that "nft" tool wouldn't be required, too).

> It requires iptables rules and ARP reply rewrite like:
> 
> arptables -A OUTPUT -o eth1 --h-length 6 -j mangle --mangle-mac-s 
> 01:00:5e:00:01:01

pardon my ignorance but you currently appear to be the greatest expert
with practical experience on this list regarding the topic.

* * *

1. Is this based solely on experience with xt_cluster extension that
   led you to this ARP-level rewrite unique to using netfilter backend,
   or would the same actually be needed with true CLUSTERIP target?

Actually, I took a look at the code of CLUSTERIP extension, and it in
fact is used to do the very same ARP level mangling, even though, it is
slightly more precise, akin to (with stray in-line comments):

  arptables -A OUTPUT \
--h-type 1 \  # Ethernet
--proto-type 0x800 \  # IPv4
--h-length 6 \  # perhaps redundant to --h-type?
\ # cannot express limitation on the size of network address
\ # but that would perhaps be redundant to --proto-type
--opcode 2 \  # this time for Reply
-j mangle --mangle-mac-s CLUSTERMAC

  # see also
  # 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4095ebf1e641b0f37ee1cd04c903bb85cf4ed25b
  arptables -A OUTPUT \
--h-type 1 \  # Ethernet
--proto-type 0x800 \  # IPv4
--h-length 6 \  # perhaps redundant to --h-type?
\ # cannot express limitation on the size of network address
\ # but that would perhaps be redundant to --proto-type
--opcode 1 \  # this time for Request
-j mangle --mangle-mac-s CLUSTERMAC

What you've used appears to be akin to what this chunk of manpage
suggests (amongst others):
https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man

which is (yet another) indicator to me that xt_cluster extension
doesn't carry that functionality on its own (like CLUSTERIP target
did, as mentioned).

*
* Anyway, I'd like to understand why is this necessary in the first
* place, getting to my second question.
*

2. Is the following, for me viable explanation correct?

That arrangement is to prevent here unexpectedly leaky specific
associations (I'd call "fixations") of the interface's true (hence
non-multicast) MAC address with meant-to-be-shared IP address at hand,
and hence cancelling the effect of link-multicasted frames (to which
at most a single recipient would respond per the firewall matching
rules), and therefore botching the "shared IP" concept altogether from
the perspective of network members that would undesirably learn
non-multicast address association for the particular
meant-to-be-shared IP leaked like this.

*
* But it doesn't explain the suggested destination MAC renormalization
* on INPUT, which is currently yet to be heard of for our purpose...
*

3. Is, perhaps, the following plausible explanations sound?

- this is so as not spoil the local ARP cache/dependent interactions,
  such as when actual ARP request is sent with link layer address
  identical with the particular multicast MAC in use -- likely
  feasible with the other host on the network configured the same
  way and with already mentioned source-MAC rewriting on egress
  in place (so this would actually neutralize harmful network
  effect it could otherwise be causing)

- or any other reasons?

*
* Finally, when referring to the suggestive example above, there is
* one more question to ask...
*

4. Shall not even existing IPaddr2 (whether in CLUSTERIP-based mode
   or not) actually verify that
   /proc/sys/net/netfilter/nf_conntrack_tcp_loose
   gets cleared, at least until told not to through configuration?

- looks like a good idea not to allow any after-cut packets
  interaction (would only apply to anything outside of the
  critical cluster infrastructure since it uses UDP), as
  a matter