Re: CARP stopped working after upgrade from 11 to 12

2019-01-18 Thread Steven Hartland

On 18/01/2019 10:34, Thomas Steen Rasmussen wrote:

On 1/16/19 8:16 PM, Thomas Steen Rasmussen wrote:


On 1/16/19 6:56 PM, Steven Hartland wrote:



PS: are you going to file a PR ?



Yes here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235005



Hello all,

A quick follow up for the archives:

Steven Hartland smh@ found the issue and created 
https://reviews.freebsd.org/D18882 which resulted in this commit 
https://svnweb.freebsd.org/changeset/base/343130 from kp@ with a fix 
for the issue.


All is well in carp/pfsync land again. Thank you Steven, kp@ and Pete 
French for your help.


Best regards & have a nice weekend,

Thomas Steen Rasmussen


Glad to help, thanks for the bug report!

    Regards
    Steve
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-18 Thread Thomas Steen Rasmussen

On 1/16/19 8:16 PM, Thomas Steen Rasmussen wrote:


On 1/16/19 6:56 PM, Steven Hartland wrote:



PS: are you going to file a PR ?



Yes here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235005



Hello all,

A quick follow up for the archives:

Steven Hartland smh@ found the issue and created 
https://reviews.freebsd.org/D18882 which resulted in this commit 
https://svnweb.freebsd.org/changeset/base/343130 from kp@ with a fix for 
the issue.


All is well in carp/pfsync land again. Thank you Steven, kp@ and Pete 
French for your help.


Best regards & have a nice weekend,

Thomas Steen Rasmussen

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Thomas Steen Rasmussen



On 1/16/19 6:56 PM, Steven Hartland wrote:



PS: are you going to file a PR ?



Yes here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235005


You could also try setting net.pfsync.pfsync_buckets="1" in 
/boot/loader.conf which reading the code should ensure all items are 
processed in a single bucket so if its the bucketing split has the 
issue then this will fix. If the issue is more ingrained then it won't.


Setting net.pfsync.pfsync_buckets="1" in /boot/loader.conf doesn't fix 
it. That would have been a neat workaround though :)


/Thomas


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Steven Hartland



On 16/01/2019 17:33, Pete French wrote:

I have confirmed that pfsync is the culprit. Read on for details.

Excellent work. I;m home now, so won't get a chnace to out this into
practice until tomorrow unfortunately, but it's brilliant that you have
confirmed it.


I tried disabling pfsync and rebooting both nodes, they came up as
MASTER/SLAVE then.

This is very useful to know - I willprobably  try tomorrow running my
firewalls back up with pfsync disabled to see if it works for me too.


Then I tried enabling pfsync and starting it, and on the SLAVE node I
immediately got:

That kind of confirms it really doesnt it ?

So, is it possible to get r342051 backend out of STABLE for now ? This
is a bit 'gotcha' for anyone running a firewall pair with CARp after all.

-pete.

PS: are you going to file a PR ?
You could also try setting net.pfsync.pfsync_buckets="1" in 
/boot/loader.conf which reading the code should ensure all items are 
processed in a single bucket so if its the bucketing split has the issue 
then this will fix. If the issue is more ingrained then it won't.


    Regards
    Steve
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Pete French
> I have confirmed that pfsync is the culprit. Read on for details.

Excellent work. I;m home now, so won't get a chnace to out this into
practice until tomorrow unfortunately, but it's brilliant that you have
confirmed it.

> I tried disabling pfsync and rebooting both nodes, they came up as 
> MASTER/SLAVE then.

This is very useful to know - I willprobably  try tomorrow running my
firewalls back up with pfsync disabled to see if it works for me too.

> Then I tried enabling pfsync and starting it, and on the SLAVE node I 
> immediately got:

That kind of confirms it really doesnt it ?

So, is it possible to get r342051 backend out of STABLE for now ? This
is a bit 'gotcha' for anyone running a firewall pair with CARp after all.

-pete.

PS: are you going to file a PR ?
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Thomas Steen Rasmussen

On 1/16/19 3:53 PM, Steven Hartland wrote:

I have confirmed that pfsync is the culprit. Read on for details.

I can't see how any of those would impact carp unless pf is now 
incorrectly blocking carp packets, which seems unlikely from that commit.




Well I would agree, but nevertheless, here we are.



Questions:

 * Are you running a firewall?



Yes, pf, but it permits CARP packets, and MASTER/SLAVE works well up to 
and including r342050.


Rebuild to r342051 with the exact same configuration and now both nodes 
are MASTER.




 * What does sysctl net.inet.carp report?


net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 240
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.dscp: 56
net.inet.carp.allow: 1


 * What exactly does ifconfig report about your carp on both hosts?



with 12-STABLE r342050:

[tykling@fwclu2a ~]$ uname -a
FreeBSD fwclu2a 12.0-STABLE FreeBSD 12.0-STABLE r342050 GENERIC amd64
[tykling@fwclu2a ~]$ ifconfig | grep carp
    carp: MASTER vhid 1 advbase 1 advskew 100
    carp: MASTER vhid 1 advbase 1 advskew 100
    carp: MASTER vhid 1 advbase 1 advskew 100
[tykling@fwclu2a ~]$

[tykling@fwclu2b ~]$ uname -a
FreeBSD fwclu2b 12.0-STABLE FreeBSD 12.0-STABLE r342050 GENERIC amd64
[tykling@fwclu2b ~]$ ifconfig | grep carp
    carp: BACKUP vhid 1 advbase 1 advskew 200
    carp: BACKUP vhid 1 advbase 1 advskew 200
    carp: BACKUP vhid 1 advbase 1 advskew 200
[tykling@fwclu2b ~]$

and with 12-STABLE r342051:

[tykling@fwclu2a ~]$ uname -a
FreeBSD fwclu2a 12.0-STABLE FreeBSD 12.0-STABLE r342051 GENERIC amd64
[tykling@fwclu2a ~]$ ifconfig | grep carp
    carp: MASTER vhid 1 advbase 1 advskew 100
    carp: MASTER vhid 1 advbase 1 advskew 100
    carp: MASTER vhid 1 advbase 1 advskew 100
[tykling@fwclu2a ~]$

[tykling@fwclu2b ~]$ uname -a
FreeBSD fwclu2b 12.0-STABLE FreeBSD 12.0-STABLE r342051 GENERIC amd64
[tykling@fwclu2b ~]$ ifconfig | grep carp
    carp: MASTER vhid 1 advbase 1 advskew 200
    carp: MASTER vhid 1 advbase 1 advskew 200
    carp: MASTER vhid 1 advbase 1 advskew 200
[tykling@fwclu2b ~]$


 * Have you tried enabling more detailed carp logging using sysctl
   net.inet.carp.log?


It is at 1 and increasing it to 2 doesn't appear to log anything new.


I tried disabling pfsync and rebooting both nodes, they came up as 
MASTER/SLAVE then.


Then I tried enabling pfsync and starting it, and on the SLAVE node I 
immediately got:


Jan 16 16:34:56 fwclu2b kernel: carp: demoted by -240 to -240 (pfsync 
bulk done)
Jan 16 16:34:56 fwclu2b kernel: carp: 1@lagg2.52: BACKUP -> MASTER 
(preempting a slower master)
Jan 16 16:34:56 fwclu2b kernel: carp: 1@lagg2.51: BACKUP -> MASTER 
(preempting a slower master)
Jan 16 16:34:56 fwclu2b kernel: carp: 1@lagg3: BACKUP -> MASTER 
(preempting a slower master)


Stopping pfsync again does not make it go back to SLAVE.


Best regards,

Thomas Steen Rasmussen



___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Pete French
> I can't see how any of those would impact carp unless pf is now 
> incorrectly blocking carp packets, which seems unlikely from that commit.

Just looking at the code it does seem unlikely, true - but my working
system does not run pf+pfsync and the non working one does, so it is
suspiciously in the right "place". If Thomas can bisect it and show it works
before but nto after then it has to be in there somewhere I guess.

The dmesg "(preempting a slower master)" also makes me think
that it is reciving carp packets - though I havent checked the code to
see if it produces that if it cant see any other masters at all.

> Questions:
>
>   * Are you running a firewall?

Yes, pf. The boxes are basically our external firewall/router. I also
run a laod balancer on them - relayd before, but now haproxy after
yesterdays thread on here.

>   * What does sysctl net.inet.carp report?

$ sysctl net.inet.carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 240
net.inet.carp.demotion: -240
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.dscp: 56
net.inet.carp.allow: 1

>   * What exactly does ifconfig report about your carp on both hosts?

I only have carp enabled on one host for now, to pervent the downtime,
but ifconfig on the master is below. I am currently running with a separate
vhid for each address. I normally run with a separate vhid for each network
and address family though - i.e. 4 - but theres no difference in the
behaviour

em0: flags=8943 metric 0 mtu 
1500

options=81249b
ether 00:25:90:31:bf:a2
inet 10.32.10.1 netmask 0x broadcast 10.32.255.255 
inet 10.32.10.6 netmask 0x broadcast 10.32.255.255 vhid 1 
inet6 fe80::225:90ff:fe31:bfa2%em0 prefixlen 64 scopeid 0x1 
inet6 2a02:1658:1:2:e550::1 prefixlen 64 
inet6 2a02:1658:1:2:e550::6 prefixlen 64 vhid 2 
carp: MASTER vhid 1 advbase 1 advskew 10
carp: MASTER vhid 2 advbase 1 advskew 10
media: Ethernet autoselect (1000baseT )
status: active
nd6 options=21
em1: flags=8943 metric 0 mtu 
1500

options=81249b
ether 00:25:90:31:bf:a3
inet 178.250.73.196 netmask 0xffc0 broadcast 178.250.73.255 
inet 178.250.73.198 netmask 0xffc0 broadcast 178.250.73.255 vhid 3 
inet 178.250.73.199 netmask 0xffc0 broadcast 178.250.73.255 vhid 5 
inet 178.250.73.200 netmask 0xffc0 broadcast 178.250.73.255 vhid 6 
inet 178.250.73.221 netmask 0xffc0 broadcast 178.250.73.255 vhid 7 
inet6 fe80::225:90ff:fe31:bfa3%em1 prefixlen 64 scopeid 0x2 
inet6 2a02:1658:1:1::1:2 prefixlen 64 
inet6 2a02:1658:1:1::1:1 prefixlen 64 vhid 4 
carp: MASTER vhid 3 advbase 1 advskew 10
carp: MASTER vhid 5 advbase 1 advskew 10
carp: MASTER vhid 6 advbase 1 advskew 10
carp: MASTER vhid 7 advbase 1 advskew 10
carp: MASTER vhid 4 advbase 1 advskew 10
media: Ethernet autoselect (1000baseT )
status: active
nd6 options=21
lo0: flags=8049 metric 0 mtu 16384
options=680003
inet6 ::1 prefixlen 128 
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3 
inet 127.0.0.1 netmask 0xff00 
groups: lo 
nd6 options=21
pflog0: flags=0<> metric 0 mtu 33160
groups: pflog 
pfsync0: flags=41 metric 0 mtu 1500
pfsync: syncdev: em0 syncpeer: 10.32.10.2 maxupd: 128 defer: off
groups: pfsync 

>   * Have you tried enabling more detailed carp logging using sysctl
> net.inet.carp.log?

I didnt have tme unfortuntely - at the point where all the alerts went off
and all of the systems were offline then I just did what I needed to in
order to get it working again (i.e. shut down the passive side). This
is our main production firewall pair, so any downtime cause lots of problems
and we cant make any sales.

Is there anythng in the above which looks fishy to you though ?

-pete.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Pete French
> Indeed. I am seeing the same thing. Which revision of 12 are you running?

Ah, now that is very interesting - I wasnt expecting a reply so fast!

I am running r342847 - not though, that this is also the version I am running
on the two test systems which do work.

> I am currently (yesterday and today) bisecting revisions to find the 
> commit which broke this, because it worked in 12-BETA2 but doesn't work 
> on latest 12-STABLE.

Well done, thats takes a lot of effort to do. Thankyou for doing this.

> MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting
> LIST OF RULES AND PREPROCESSING that is still referred
> as last section of the SYNOPSIS later but was erroneously situated
> in the section IN-KERNEL NAT.

Docs only, so cant be this one I think.

> MFC r341638:
> Let kern.trap_enotcap be set as a tunable.

Also cant be this one from eyeballing the code. It simply makes it writeable.

> MFC r340405:
> Add accounting to per-domain UMA full bucket caches.

This is not touching networking, so seems unlikely, though it is an actaul
significant code chnage.

> r342051 | kp | 2018-12-13 20:00:11 + (Thu, 13 Dec 2018) | 20 lines
>
> pfsync: Performance improvement

ahh. now, this is where things like likely, as the difference
between my test amchines and my live machines is that the live machines
are using pf with pfsync enabled.

> Of these I thought r342051 sounded most likely, so I am currently 
> building r342050.

Thats my feeling too going through the above. Have you also tried simply
disabling pfsync to see if CARP returns to normal ? I could live without
pfsync to be honest, if thats what it takes to make this work.

> I will write again in a few hours when I have isolated the commit.

Thankyou again for outting in the effort to bisect this - if you can isolate
it then we can back the change out and try it again and see if that helps.

-pete.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Steven Hartland
I can't see how any of those would impact carp unless pf is now 
incorrectly blocking carp packets, which seems unlikely from that commit.


Questions:

 * Are you running a firewall?
 * What does sysctl net.inet.carp report?
 * What exactly does ifconfig report about your carp on both hosts?
 * Have you tried enabling more detailed carp logging using sysctl
   net.inet.carp.log?

    Regards
    Steve


On 16/01/2019 14:31, Thomas Steen Rasmussen wrote:

On 1/16/19 3:14 PM, Pete French wrote:

I just upgraded my pair of firewalls from 11 to 12, and am now in the
situation where CARP no longer works between them to faiilover the
virtual addresse. Both machines come up thinking that they
are the master. If I manually set the advskew on the interfaces to
a high number on what should be passive then it briefly goes to backup
mode, but then goes back to master with the message:

BACKUP -> MASTER (preempting a slower master)

This is kind of a big problem!


Indeed. I am seeing the same thing. Which revision of 12 are you running?

I am currently (yesterday and today) bisecting revisions to find the 
commit which broke this, because it worked in 12-BETA2 but doesn't 
work on latest 12-STABLE.


I have narrowed it down to somewhere between 12-STABLE-342037 which 
works, and 12-STABLE-342055 which does not.


Only 4 commits touch 12-STABLE branch in that range:


r342038 | eugen | 2018-12-13 10:52:40 + (Thu, 13 Dec 2018) | 5 lines

MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting
LIST OF RULES AND PREPROCESSING that is still referred
as last section of the SYNOPSIS later but was erroneously situated
in the section IN-KERNEL NAT.


r342047 | markj | 2018-12-13 15:51:07 + (Thu, 13 Dec 2018) | 3 lines

MFC r341638:
Let kern.trap_enotcap be set as a tunable.


r342048 | markj | 2018-12-13 16:07:35 + (Thu, 13 Dec 2018) | 3 lines

MFC r340405:
Add accounting to per-domain UMA full bucket caches.


r342051 | kp | 2018-12-13 20:00:11 + (Thu, 13 Dec 2018) | 20 lines

pfsync: Performance improvement

pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.

Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.

The number of buckets is tunable, but defaults to 2 x number of cpus.
Benchmarking has shown improvement, depending on hardware and setup, 
from ~30%

to ~100%.

Sponsored by:   Orange Business Services



Of these I thought r342051 sounded most likely, so I am currently 
building r342050.


I will write again in a few hours when I have isolated the commit.

Best regards,

Thomas Steen Rasmussen


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Thomas Steen Rasmussen

On 1/16/19 3:14 PM, Pete French wrote:

I just upgraded my pair of firewalls from 11 to 12, and am now in the
situation where CARP no longer works between them to faiilover the
virtual addresse. Both machines come up thinking that they
are the master. If I manually set the advskew on the interfaces to
a high number on what should be passive then it briefly goes to backup
mode, but then goes back to master with the message:

BACKUP -> MASTER (preempting a slower master)

This is kind of a big problem!


Indeed. I am seeing the same thing. Which revision of 12 are you running?

I am currently (yesterday and today) bisecting revisions to find the 
commit which broke this, because it worked in 12-BETA2 but doesn't work 
on latest 12-STABLE.


I have narrowed it down to somewhere between 12-STABLE-342037 which 
works, and 12-STABLE-342055 which does not.


Only 4 commits touch 12-STABLE branch in that range:


r342038 | eugen | 2018-12-13 10:52:40 + (Thu, 13 Dec 2018) | 5 lines

MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting
LIST OF RULES AND PREPROCESSING that is still referred
as last section of the SYNOPSIS later but was erroneously situated
in the section IN-KERNEL NAT.


r342047 | markj | 2018-12-13 15:51:07 + (Thu, 13 Dec 2018) | 3 lines

MFC r341638:
Let kern.trap_enotcap be set as a tunable.


r342048 | markj | 2018-12-13 16:07:35 + (Thu, 13 Dec 2018) | 3 lines

MFC r340405:
Add accounting to per-domain UMA full bucket caches.


r342051 | kp | 2018-12-13 20:00:11 + (Thu, 13 Dec 2018) | 20 lines

pfsync: Performance improvement

pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.

Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.

The number of buckets is tunable, but defaults to 2 x number of cpus.
Benchmarking has shown improvement, depending on hardware and setup, 
from ~30%

to ~100%.

Sponsored by:   Orange Business Services



Of these I thought r342051 sounded most likely, so I am currently 
building r342050.


I will write again in a few hours when I have isolated the commit.

Best regards,

Thomas Steen Rasmussen


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


CARP stopped working after upgrade from 11 to 12

2019-01-16 Thread Pete French
I just upgraded my pair of firewalls from 11 to 12, and am now in the
situation where CARP no longer works between them to faiilover the
virtual addresse. Both machines come up thinking that they
are the master. If I manually set the advskew on the interfaces to
a high number on what should be passive then it briefly goes to backup
mode, but then goes back to master with the message:

BACKUP -> MASTER (preempting a slower master)

This is kind of a big problem! Its also unexpected as I tested CARP on 12
in my development environment and it works here - though here we only have
one address insetad of several. But this has worked fine for a very long
time until now.

The setup looks like this:

ifconfig_em0="inet 10.32.10.1/16"
ifconfig_em0_ipv6="inet6 2a02:1658:1:2:e550::1/64"
ifconfig_em0_alias0="inet 10.32.10.6/16 vhid 10 advskew 10 pass redacted"
ifconfig_em0_alias1="inet6 2a02:1658:1:2:e550::6/64 vhid 30 advskew 10 pass 
redacted"

ifconfig_em1="inet 178.250.73.196/26"
ifconfig_em1_ipv6="inet6 2a02:1658:1:1::1:2/64"
ifconfig_em1_alias0="inet 178.250.73.198/26 vhid 20 advskew 10 pass redacted"
ifconfig_em1_alias1="inet6 2a02:1658:1:1::1:1/64 vhid 40 advskew 10 pass 
redacted"
ifconfig_em1_alias2="inet 178.250.73.199/26 vhid 20 advskew 10 pass redacted"
ifconfig_em1_alias3="inet 178.250.73.200/26 vhid 20 advskew 10 pass redacted"
ifconfig_em1_alias4="inet 178.250.73.221/26 vhid 20 advskew 10 pass redacted"

...and on the passive side almost identical except for the real IP's and the
advskew which is set to 128.

I have PF enables with pfsync as well, and I have set net.inet.carp.preempt=1
in systctl.conf. PF is configured to allow protocol 'carp' on both ether
interfaces and 'pfsync' on the internal one.

I did wonder if having the same vhid for a number of the addresse might be
the issue so I then changed the config to have them all on separate vhid
numbers, but the problem persists.

This is now a bit of a major problem for me, as I am running on a single
firewall with no faulover (which I dont like) and dont really know what
the path forward is.

As ever, all advice is welcome!

-pete.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"