Re: Firewall partially failing with high traffic (Updated)

2006-11-15 Thread Chris Cameron
Just building off my last message. Answering Ryans questions first:

- Do you have dedicated addresses on the carp parent interfaces?

For sure.

- Are all the carp devices on the master firewall MASTER; what about the
  backup?

Before and after the network dies, primary firewall is all MASTER,
secondary stays as BACKUP.

- Can you reach the 'dissapearing' network from the backup firewall?

Yes.

- Is preemption enabled? (sysctl net.inet.carp.preempt=1)

Yes.

- What is the output of 'netstat -sp carp' on both the master and backup
  firewalls?

Have it below.

- What about the output of 'netstat -i'? Are there output errors on the
  offending interface?

Exact output below, but no errors in or out, before or after.

- Have you tried running with carp debugging turned on? (sysctl
  net.inet.carp.log=1)

Did this on both firewalls, didn't see output one way or the other.
Restarted with it in sysctls.conf just to be sure, but didn't see
anything.




What further I know:

- set debug loud, lots of output, nothing looks different while the
problem is present.

- From the dead network, if I ping the firewall, tcpdump shows the
firewall making an arp request for the originating machine.
18:17:50.015307 arp who-has 172.168.120.50 tell 172.168.120.2

172.168.120.50 is the machine on the dead network, which was trying to
ping the firewall. This would lead me to believe the firewall saw
-something-. Lots of traffic trying to going to, but none come back from
that network.

- I can ping the dead interface locally.

- Bringing interface down and up doesn't help

- From the firewall itself, I can hang that interface. Before I was
doing it from my desktop, through the firewall.


Ifconfig explanation:

gem0 - external
gem1 - 120.x - network that disappears
hme0 - 0.x - pfsync traffic
hme1 - 121.x - Network my terminal is on
hme2 - 119.x

My ifconfig -A output from the master firewall:

$ ifconfig -A
lo0: flags=8049UP,LOOPBACK,RUNNING,MULTICAST mtu 33192
groups: lo
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0xa
gem0:
flags=8b63UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST 
mtu 1500
lladdr 00:03:ba:f2:bc:1c
groups: egress
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 216.2.22.123 netmask 0xffe0 broadcast 216.82.41.127
inet6 fe80::203:baff:fef2:bc1c%gem0 prefixlen 64 scopeid 0x1
gem1:
flags=8b63UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST 
mtu 1500
lladdr 00:03:ba:f2:bc:1d
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 172.168.120.2 netmask 0xff00 broadcast 172.168.120.255
inet6 fe80::203:baff:fef2:bc1d%gem1 prefixlen 64 scopeid 0x2
hme0: flags=8863UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST mtu
1500
lladdr 08:00:20:ee:66:60
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 10.0.0.1 netmask 0xff00 broadcast 10.0.0.255
inet6 fe80::a00:20ff:feee:6660%hme0 prefixlen 64 scopeid 0x3
hme1:
flags=8b63UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST 
mtu 1500
lladdr 08:00:20:ee:66:61
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 172.168.121.2 netmask 0xff00 broadcast 172.168.121.255
inet6 fe80::a00:20ff:feee:6661%hme1 prefixlen 64 scopeid 0x4
hme2:
flags=8b63UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST 
mtu 1500
lladdr 08:00:20:ee:66:62
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 172.168.119.2 netmask 0xff00 broadcast 172.168.119.255
inet6 fe80::a00:20ff:feee:6662%hme2 prefixlen 64 scopeid 0x5
hme3: flags=8822BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST mtu 1500
lladdr 08:00:20:ee:66:63
media: Ethernet autoselect
pflog0: flags=141UP,RUNNING,PROMISC mtu 33192
pfsync0: flags=41UP,RUNNING mtu 1348
pfsync: syncdev: hme0 maxupd: 128
enc0: flags=0 mtu 1536
tun0: flags=8051UP,POINTOPOINT,RUNNING,MULTICAST mtu 1500
groups: tun
inet 172.168.123.1 -- 172.168.123.2 netmask 0x
carp0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
carp: MASTER carpdev gem0 vhid 1 advbase 1 advskew 0
groups: carp
inet 216.82.41.116 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.97 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.98 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.117 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.118 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.119 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.120 netmask 0xffe0 broadcast 216.82.41.127
inet 216.82.41.125 netmask 0xffe0 broadcast 216.82.41.127
inet 

Firewall partially failing with high traffic

2006-11-14 Thread Chris Cameron
I have a 3.8 PF/CARP setup that I can reproducibly screw up simply by
cat'ing lots of text over a telnet session.

It has several subnets, and several NICs, but only 1 subnet becomes
unavailable. Everything else continues to work. There are no errors in
messages, daemon, with PF debug set to misc. Counters all look normal,
same with state table and netstat -m output. The only reason I believe
it's the firewall is restarting it will bring the network back up.

I can't (easily) give direct output from things like ifconfig or pf.conf
as they're both huge and contain information I've been told we don't
want to send out. Hopefully this doesn't prevent anyone from helping me
out.


gem0 - external
gem1 - 120.x
hme0 - 0.x
hme1 - 121.x
hme2 - 119.x


Coming in on hme1 routed through gem1, I can cause everything off gem1
to stop working. The interface shows as up, but nothing works. All other
interfaces work fine. PF continues to work as NAT and external
firewalling still operates.

No errors anywhere, even with debugging turned on in PF. netstat -m
looks the same before and after.


I'm hoping someone can give me a better way to debug this, considering I
can reproduce it. I don't believe it's PF as I can disable and re-enable
it with no effect.

I've disabled ohci using config -e as those were the only errors I was
seeing. Specifically:
ohci0: 1 scheduling overruns

However they didn't happen anywhere near this problem.

dmesg (out of messages):
syncing disks... done
o
arpresolve
console is /[EMAIL PROTECTED],0/[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL 
PROTECTED],3f8
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2005 OpenBSD. All rights reserved.
http://www.OpenBSD.org
Copyright (c) 1995-2005 OpenBSD. All rights reserved.
http://www.OpenBSD.org
OpenBSD 3.8 (CARP) #0: Fri Feb 24 15:29:15 MST 2006
[EMAIL PROTECTED]:/usr/src/sys/arch/sparc64/compile/CARP
total memory = 1073741824
avail memory = 969023488
using 6553 buffers containing 53682176 bytes of memory
bootpath: /[EMAIL PROTECTED],0/[EMAIL PROTECTED],0/[EMAIL PROTECTED],0/[EMAIL 
PROTECTED],0
mainbus0 (root): Sun Fire V120 (UltraSPARC-IIe 648MHz)
cpu0 at mainbus0: SUNW,UltraSPARC-IIe @ 648 MHz, version 0 FPU
cpu0: physical 32K instruction (32 b/l), 16K data (32 b/l), 2048K
external (64 b/l)
psycho0 at mainbus0
SUNW,sabre: impl 0, version 0: ign 7c0 bus range 0 to 3; PCI bus 0
DVMA map: c000 to e000
IOTDB: 4d0a000 to 4d8a000
pci0 at psycho0
ppb0 at pci0 dev 1 function 1 Sun Simba PCI-PCI rev 0x13
pci1 at ppb0 bus 1
ebus0 at pci1 dev 12 function 0 Sun PCIO Ebus2 (US III) rev 0x01
flashprom at ebus0 addr 0-f not configured
clock1 at ebus0 addr 0-1fff: mk48t59: hostid 83f2bc1c
ebus_attach: idprom: incomplete
SUNW,lomh at ebus0 addr 20-23 ipl 42 not configured
gem0 at pci1 dev 12 function 1 Sun ERI Ether rev 0x01: ivec 3006,
address 00:03:ba:f2:bc:1c
bmtphy0 at gem0 phy 1: BCM5221 100baseTX PHY, rev. 4
ohci0 at pci1 dev 12 function 3 Sun USB rev 0x01: ivec 24, version
1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: Sun OHCI root hub, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
Acer Labs M7101 Power rev 0x00 at pci1 dev 3 function 0 not configured
Acer Labs M7101 Power rev 0x00 at pci1 dev 3 function 0 not configured
ebus1 at pci1 dev 7 function 0 Acer Labs M1533 ISA rev 0x00
power at ebus1 addr 800-82f ipl 37 not configured
com0 at ebus1 addr 3f8-3ff ipl 43: ns16550a, 16 byte fifo
com0: console
com1 at ebus1 addr 2e8-2ef ipl 43: ns16550a, 16 byte fifo
pciide0 at pci1 dev 13 function 0 Acer Labs M5229 UDMA IDE rev 0xc3:
DMA, channel 0 configured to native-PCI, channel 1 configured to
native-PCI
pciide0: using ivec 180c for native-PCI interrupt
pciide0: channel 0 disabled (no drives)
pciide0: channel 1 disabled (no drives)
gem1 at pci1 dev 5 function 1 Sun ERI Ether rev 0x01: ivec 301c,
address 00:03:ba:f2:bc:1d
bmtphy1 at gem1 phy 1: BCM5221 100baseTX PHY, rev. 4
ohci1 at pci1 dev 5 function 3 Sun USB rev 0x01: ivec 26, version 1.0,
legacy support
usb1 at ohci1: USB revision 1.0
uhub1 at usb1
uhub1: Sun OHCI root hub, rev 1.00/1.00, addr 1
uhub1: 4 ports with 4 removable, self powered
ppb1 at pci0 dev 1 function 0 Sun Simba PCI-PCI rev 0x13
pci2 at ppb1 bus 2
siop0 at pci2 dev 8 function 0 Symbios Logic 53c896 rev 0x07: ivec
1820, using 8K of on-board RAM
scsibus0 at siop0: 16 targets
sd0 at scsibus0 targ 0 lun 0: FUJITSU, MAT3073N SUN72G, 0602 SCSI4
0/direct fixed
sd0: 70007MB, 14100 cyl, 24 head, 423 sec, 512 bytes/sec, 143374738 sec
total
sd1 at scsibus0 targ 1 lun 0: FUJITSU, MAT3073N SUN72G, 0602 SCSI4
0/direct fixed
sd1: 70007MB, 14100 cyl, 24 head, 423 sec, 512 bytes/sec, 143374738 sec
total
siop1 at pci2 dev 8 function 1 Symbios Logic 53c896 rev 0x07: ivec
1820, using 8K of on-board RAM
scsibus1 at siop1: 16 targets
ppb2 at pci2 dev 5 function 0 Intel S21154AE/BE PCI-PCI rev 0x00
pci3 at ppb2 bus 

Re: Firewall partially failing with high traffic

2006-11-14 Thread Tobias Weingartner
In article [EMAIL PROTECTED], Chris Cameron wrote:
 
  I have a 3.8 PF/CARP setup that I can reproducibly screw up simply by
  cat'ing lots of text over a telnet session.

Chances are that you're hitting some bug in 3.8, that has likely been
fixed in 3.9, or 4.0.  Or the rule you're using to pass the traffic is
wrong.  You using keep state?  Are you using 'flags S/SA' on that
rule?

With the amount of information you've given, it is hard to even theorize
what could be wrong.  People would need more information.

--Toby.



Re: Firewall partially failing with high traffic

2006-11-14 Thread Will Maier
On Tue, Nov 14, 2006 at 09:28:47AM -0700, Chris Cameron wrote:
 Upgrading isn't an option. I mean it is, but as soon as I say
 Don't know, lets just upgrade, that's a major hit to something
 that was tough to get in in the first place. This will be a
 Firewall-1 shop again quite quickly and any future thing I
 recommend isn't going to have much weight.

You need to upgrade anyway to properly keep up with security
updates. You're now running a system that is no longer supported;
upgrading to a supported system is a Good Thing regardless of the
issue you're currently dealing with.

As a bonus, things generally get better and 'more fixed' with each
new version and, as Tobias says, there's a good chance the problem
you're running up against is resolved.

-- 

o--{ Will Maier }--o
| web:...http://www.lfod.us/ | [EMAIL PROTECTED] |
*--[ BSD Unix: Live Free or Die ]--*



Re: Firewall partially failing with high traffic

2006-11-14 Thread Carlos A. Carnero Delgado

Hi,

On 11/14/06, Chris Cameron [EMAIL PROTECTED] wrote:

I have a 3.8 PF/CARP setup that I can reproducibly screw up simply by
cat'ing lots of text over a telnet session.


can you post `pfctl -s info` and `pfctl -s memory`?

Best regards,
Carlos.
--
nick grah windows just crashed again, unstable crap.
yukito Windows isn't unstable, it's just spontaneous.



Re: Firewall partially failing with high traffic

2006-11-14 Thread Chris Cameron
This is while it's working. I'll repost this tonight when I'm able to
hang it.

Status: Enabled for 0 days 16:47:54   Debug: Urgent

Interface Stats for gem0  IPv4 IPv6
  Bytes In  1560279475  272
  Bytes Out 1464940667  352
  Packets In
Passed 23485100
Blocked  883254
  Packets Out
Passed 23883682
Blocked 213

State Table  Total Rate
  current entries  784
  searches18122501  299.7/s
  inserts   1069401.8/s
  removals  1061561.8/s
Counters
  match 3044965.0/s
  bad-offset 00.0/s
  fragment   20.0/s
  short  00.0/s
  normalize  00.0/s
  memory 00.0/s
  bad-timestamp  00.0/s
  congestion   1290.0/s
  ip-option  00.0/s
  proto-cksum  3010.0/s
  state-mismatch  15190.0/s
  state-insert 9030.0/s
  state-limit00.0/s
  src-limit  00.0/s
  synproxy   00.0/s
$ sudo pfctl -s memory
stateshard limit1
src-nodes hard limit1
frags hard limit 5000
tableshard limit 1000
table-entries hard limit   10
$


Chris

On Tue, 2006-11-14 at 13:05 -0500, Carlos A. Carnero Delgado wrote:
 Hi,
 
 On 11/14/06, Chris Cameron [EMAIL PROTECTED] wrote:
  I have a 3.8 PF/CARP setup that I can reproducibly screw up simply by
  cat'ing lots of text over a telnet session.
 
 can you post `pfctl -s info` and `pfctl -s memory`?
 
 Best regards,
 Carlos.



Re: Firewall partially failing with high traffic

2006-11-14 Thread Joachim Schipper
On Tue, Nov 14, 2006 at 06:03:51AM -0700, Chris Cameron wrote:
 I have a 3.8 PF/CARP setup that I can reproducibly screw up simply by
 cat'ing lots of text over a telnet session.
 
 It has several subnets, and several NICs, but only 1 subnet becomes
 unavailable. Everything else continues to work. There are no errors in
 messages, daemon, with PF debug set to misc. Counters all look normal,
 same with state table and netstat -m output. The only reason I believe
 it's the firewall is restarting it will bring the network back up.

 gem0 - external
 gem1 - 120.x
 hme0 - 0.x
 hme1 - 121.x
 hme2 - 119.x
 
 
 Coming in on hme1 routed through gem1, I can cause everything off gem1
 to stop working. The interface shows as up, but nothing works. All other
 interfaces work fine. PF continues to work as NAT and external
 firewalling still operates.
 
 No errors anywhere, even with debugging turned on in PF. netstat -m
 looks the same before and after.

 I'm hoping someone can give me a better way to debug this, considering I
 can reproduce it. I don't believe it's PF as I can disable and re-enable
 it with no effect.

What happens when you send the same data from the firewall?

 
 I've disabled ohci using config -e as those were the only errors I was
 seeing. Specifically:
 ohci0: 1 scheduling overruns
 
 However they didn't happen anywhere near this problem.

That does not look like a likely culprit, no.

Are you sure it's not just bad hardware?

Joachim



Re: Firewall partially failing with high traffic

2006-11-14 Thread Ryan McBride
At 2006-11-14 13:03:51, Chris Cameron wrote:
 I can't (easily) give direct output from things like ifconfig or pf.conf
 as they're both huge and contain information I've been told we don't
 want to send out. Hopefully this doesn't prevent anyone from helping me
 out.

If it's a problem with carp, it's going to be really difficult to
resolve without seeing the ifconfig ouptut, but here are some questions
that you might want to consider...

- Do you have dedicated addresses on the carp parent interfaces?
- Are all the carp devices on the master firewall MASTER; what about the
  backup?
- Can you reach the 'dissapearing' network from the backup firewall?
- Is preemption enabled? (sysctl net.inet.carp.preempt=1)
- What is the output of 'netstat -sp carp' on both the master and backup
  firewalls?
- What about the output of 'netstat -i'? Are there output errors on the
  offending interface?
- Have you tried running with carp debugging turned on? (sysctl
  net.inet.carp.log=1)