Re: Constant rate mbuf leak

2011-02-13 Thread Alan Wilkie
I have located the mbuf leak, but I suspect not the root cause.  There 
was new code added in 4.8 concerning routing sockets that allocates an 
mbuf, but if a subsequent operation fails it schedules a timeout to 
retry and doesn't free the mbuf.  The rate of the timer is - no surprise 
- 5Hz.  The real question is why the routing socket operation fails in 
the first place, it must be something hardware specific or there would 
be lots more people suffering the same problem.


I'll put in a sendbug with all the details.

On 11/02/2011 3:31 PM, Alan Wilkie wrote:
I have now upgraded my machine to "OpenBSD 4.9-beta (GENERIC) #654: 
Wed Feb  9 14:50:38 MST 2011", and I am still seeing a constant rate 
consumption of mbufs.  I have tried a number of things (shutting down 
all non-essential user processes, turning off network interfaces, 
etc), but none have made any difference, the system consumes 256 byte 
mbufs at a constant rate of 5 mbufs per second:


...




Re: Constant rate mbuf leak

2011-02-13 Thread Lars Kotthoff
> ifconfig -A output

lo0: flags=8049 mtu 33200
priority: 0
groups: lo
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x6
rl0: flags=8802 mtu 1500
lladdr xxx
priority: 0
media: Ethernet autoselect (none)
status: no carrier
inet xxx netmask 0xff00 broadcast xxx
inet6 xxx prefixlen 64 scopeid 0x1
rl1: flags=8943 mtu 1500
lladdr xxx
priority: 0
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet6 xxx prefixlen 64 scopeid 0x2
rl2: flags=8943 mtu 1500
lladdr xxx
priority: 0
groups: egress
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet6 xxx prefixlen 64 scopeid 0x3
inet xxx netmask 0xf800 broadcast xxx
ral0: flags=8943 mtu 1500
lladdr xxx
priority: 4
groups: wlan
media: IEEE802.11 autoselect mode 11g hostap
status: active
ieee80211: nwid xxx chan xxx bssid xxx wpapsk xxx wpaprotos wpa1,wpa2 
wpaakms psk wpaciphers ccmp wpagroupcipher ccmp
inet6 xxx prefixlen 64 scopeid 0x4
inet xxx netmask 0x broadcast xxx
enc0: flags=0<> mtu 1536
priority: 0
bridge0: flags=41 mtu 1500
priority: 0
groups: bridge
pflog0: flags=141 mtu 33200
priority: 0
groups: pflog

> altq? nfs?

Yes, both. Problem occurs without any of them as well though.

> are you using AES crypto?

Yes, encrypted partition backed by external USB disk.

> and describing what the system is doing might be helpful.

Not a whole lot. Mostly router and webserver. The load of the webserver does not
seem to make any difference though -- it's leaking mbufs at approximately the
same rate with virtually no load at all and max load (i.e. the 2MBit upstream
that my ISP gives me). Both no load and max has been sustained for several weeks
(I'm running a mirror now hence the high load), so I can definitely say that
it doesn't make a difference.

Lars



Re: Constant rate mbuf leak

2011-02-13 Thread Stuart Henderson
ifconfig -A output (to show the interfaces including any tunnels/ppp/etc)
and describing what the system is doing might be helpful. any altq? nfs?
are you using AES crypto?

it would be good to get a good write-up into a PR so it's not lost
and so people who don't read misc will see it.


On 2011-02-11, Alan Wilkie  wrote:
> I have now upgraded my machine to "OpenBSD 4.9-beta (GENERIC) #654: Wed 
> Feb  9 14:50:38 MST 2011", and I am still seeing a constant rate 
> consumption of mbufs.  I have tried a number of things (shutting down 
> all non-essential user processes, turning off network interfaces, etc), 
> but none have made any difference, the system consumes 256 byte mbufs at 
> a constant rate of 5 mbufs per second:
>
> # uptime && netstat -m
>   3:23PM  up 26 mins, 3 users, load averages: 0.24, 0.35, 0.34
> 8078 mbufs in use:
>  7844 mbufs allocated to data
>  117 mbufs allocated to packet headers
>  117 mbufs allocated to socket names and addresses
> 21/58/6144 mbuf 2048 byte clusters in use (current/peak/max)
> ...
>
> The system is very lightly loaded.  If I let it continue, the mbuf usage 
> increases to the point where the system becomes unusable.
>
> Can anybody point me in the right direction?  How can I figure out what 
> is allocating a data mbuf every 200ms?
>
> Thanks,
>
> Alan
>
> dmesg output follows:
> OpenBSD 4.9-beta (GENERIC) #654: Wed Feb  9 14:50:38 MST 2011
>  t...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
> cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 
> 586-class) 500 MHz
> cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX
> real mem  = 536440832 (511MB)
> avail mem = 517533696 (493MB)
> mainbus0 at root
> bios0 at mainbus0: AT/286+ BIOS, date 20/70/03, BIOS32 rev. 0 @ 0xfac40
> pcibios0 at bios0: rev 2.0 @ 0xf/0x1
> pcibios0: pcibios_get_intr_routing - function not supported
> pcibios0: PCI IRQ Routing information unavailable.
> pcibios0: PCI bus #0 is the last bus
> bios0: ROM list: 0xc8000/0xa800
> cpu0 at mainbus0: (uniprocessor)
> amdmsr0 at mainbus0
> pci0 at mainbus0 bus 0: configuration mode 1 (bios)
> io address conflict 0x6100/0x100
> io address conflict 0x6200/0x200
> pchb0 at pci0 dev 1 function 0 "AMD Geode LX" rev 0x31
> glxsb0 at pci0 dev 1 function 2 "AMD Geode LX Crypto" rev 0x00: RNG AES
> vr0 at pci0 dev 6 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 11, 
> address 00:00:24:ca:b3:74
> ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
> 0x004063, model 0x0034
> vr1 at pci0 dev 7 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 5, 
> address 00:00:24:ca:b3:75
> ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
> 0x004063, model 0x0034
> vr2 at pci0 dev 8 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 9, 
> address 00:00:24:ca:b3:76
> ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
> 0x004063, model 0x0034
> vr3 at pci0 dev 9 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 12, 
> address 00:00:24:ca:b3:77
> ukphy3 at vr3 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
> 0x004063, model 0x0034
> glxpcib0 at pci0 dev 20 function 0 "AMD CS5536 ISA" rev 0x03: rev 3, 
> 32-bit 3579545Hz timer, watchdog, gpio
> gpio0 at glxpcib0: 32 pins
> pciide0 at pci0 dev 20 function 2 "AMD CS5536 IDE" rev 0x01: DMA, 
> channel 0 wired to compatibility, channel 1 wired to compatibility
> wd0 at pciide0 channel 0 drive 1: 
> wd0: 1-sector PIO, LBA48, 238475MB, 488397168 sectors
> wd0(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 2
> pciide0: channel 1 ignored (disabled)
> ohci0 at pci0 dev 21 function 0 "AMD CS5536 USB" rev 0x02: irq 15, 
> version 1.0, legacy support
> ehci0 at pci0 dev 21 function 1 "AMD CS5536 USB" rev 0x02: irq 15
> usb0 at ehci0: USB revision 2.0
> uhub0 at usb0 "AMD EHCI root hub" rev 2.00/1.00 addr 1
> isa0 at glxpcib0
> isadma0 at isa0
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> com0: console
> com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5
> pckbd0 at pckbc0 (kbd slot)
> pckbc0: using irq 1 for kbd slot
> wskbd0 at pckbd0: console keyboard
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> nsclpcsio0 at isa0 port 0x2e/2: NSC PC87366 rev 9: GPIO VLM TMS
> gpio1 at nsclpcsio0: 29 pins
> npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
> usb1 at ohci0: USB revision 1.0
> uhub1 at usb1 "AMD OHCI root hub" rev 1.00/1.00 addr 1
> biomask e5c5 netmask ffe5 ttymask 
> mtrr: K6-family MTRR support (2 registers)
> ulpt0 at uhub1 port 1 configuration 1 interface 0 "HewLett Packard HP 
> LaserJet 1200" rev 1.10/1.00 addr 2
> ulpt0: using bi-directional mode
> vscsi0 at root
> scsibus0 at vscsi0: 256 targets
> softraid0 at root
> root on wd0a swap on wd0b dump on wd0b



Re: Constant rate mbuf leak

2011-02-11 Thread Abel Abraham Camarillo Ojeda
On Fri, Feb 11, 2011 at 3:44 PM, Lars Kotthoff  wrote:
>> In the meantime knowing which board it is (or, even better, what network
>> drivers are in use) would help immensely.
>
> 3 like this
> rl0 at pci0 dev 18 function 0 "Realtek 8139" rev 0x10
>
> and one
> ral0 at pci0 dev 21 function 0 "Ralink RT2860" rev 0x00
> ral0: MAC/BBP RT2860 (rev 0x0101), RF RT2820 (MIMO 2T3R)
>
> Alan's network drivers seem to be completely different though.
>
> Lars
>
>

I have had a lot of problem with rl*, I didn't wanted to debug, so just
bought some re* (if you want something cheap).

I remember a comment by Jacob Meuser about how rl driver (and cards)
sucks and re* are more or less ok.



Re: Constant rate mbuf leak

2011-02-11 Thread Lars Kotthoff
> Are you all running bridged setups ?

I am, but the problem also occurred without the bridge. I originally suspected
the wireless interface (which is bridged with one of the wired ones) and removed
the card and hence the bridge. Same problem.

Lars



Re: Constant rate mbuf leak

2011-02-11 Thread Christiano F. Haesbaert
Are you all running bridged setups ?



Re: Constant rate mbuf leak

2011-02-11 Thread Lars Kotthoff
> In the meantime knowing which board it is (or, even better, what network
> drivers are in use) would help immensely.

3 like this
rl0 at pci0 dev 18 function 0 "Realtek 8139" rev 0x10

and one
ral0 at pci0 dev 21 function 0 "Ralink RT2860" rev 0x00
ral0: MAC/BBP RT2860 (rev 0x0101), RF RT2820 (MIMO 2T3R)

Alan's network drivers seem to be completely different though.

Lars



Re: Constant rate mbuf leak

2011-02-11 Thread Bret S. Lambert
Prime suspect here would be the network driver. dlg@ had a nice mbuf leak
detect-o-matic diff a while back. I'll have to see if I can find it.

In the meantime knowing which board it is (or, even better, what network
drivers are in use) would help immensely.

On Fri, Feb 11, 2011 at 06:20:50PM +, Lars Kotthoff wrote:
> Just to say that I've been having the same problem with a Soekris board since
> about 4.4. I haven't figured out what's going on, but strangely the problem is
> getting better with time (i.e. the rate at which mbufs are allocated 
> decreases).
> I *think* that it was fine in 4.3 (though I never run the machine for any 
> length
> of time with that kernel), so you could try that if you want to investigate.
> 
> I haven't been able to establish a correlation between allocated mbufs and
> (network) load either.
> 
> The "solution" for me so far has been to keep a watchful eye and reboot the
> machine once too much memory is used, combined with a watchdog and monit to
> reboot the machine automatically if it becomes unresponsive.
> 
> Lars



Re: Constant rate mbuf leak

2011-02-11 Thread Chris
> "Lars" == Lars Kotthoff  writes:

Lars> Just to say that I've been having the same problem with a
Lars> Soekris board since about 4.4. I haven't figured out what's
Lars> going on, but strangely the problem is getting better with
Lars> time (i.e. the rate at which mbufs are allocated decreases).
Lars> I *think* that it was fine in 4.3 (though I never run the
Lars> machine for any length of time with that kernel), so you could
Lars> try that if you want to investigate.

Lars> I haven't been able to establish a correlation between
Lars> allocated mbufs and (network) load either.

Lars> The "solution" for me so far has been to keep a watchful eye
Lars> and reboot the machine once too much memory is used, combined
Lars> with a watchdog and monit to reboot the machine automatically
Lars> if it becomes unresponsive.


I've had a similar issue in the past (see PR kernel/6380).  First a
small amount of background, I'm using an Alix 3d3 to act as a
bridging firewall.

ISP <--> vr2 <--> Bridge0 + PF <--> vr1 <--> MyHost

With this setup, if PF was enabled, or disabled, I would leak 2k sized
mbufs at a roughly linear rate, causing the system to become
non-responsive after it could not allocate more mbufs.  Raising the
limit on mbufs would prolong the hang, and raised high enough the
machine would hang when it ran off the end of memory.

I eventually found a way to mitigate this by filtering the MAC's seen
through the bridge.  This isn't a fix to the real problem, just a
bandaid that seems to fit.  Basically I only allow packets written with
the MAC for MyHost on the bridge with the following in
/etc/hostname.bridge0:

add vr2
add vr1
rule pass in on vr1 src 88:88:88:88:88:88 tag extbr
rule pass out on vr1 dst 88:88:88:88:88:88 tag extbr
rule block on vr1
up

This keeps my inside machine from having to see the ISP's usual
background packets (arp spam, etc).  With these filters in place the
firewall has been stable and non-leaking for > 100 days.

I don't understand the link between this filtering and the memory leaks
that are seen without it (I started to go through the code, but so far
RealLife(TM) has kept me from completely getting my head around it).

Anyways, I don't know if this will be at all applicable for what you are
seeing, but hopefully it's a nudge in the right direction.

-- 
Chris



Re: Constant rate mbuf leak

2011-02-11 Thread Lars Kotthoff
Just to say that I've been having the same problem with a Soekris board since
about 4.4. I haven't figured out what's going on, but strangely the problem is
getting better with time (i.e. the rate at which mbufs are allocated decreases).
I *think* that it was fine in 4.3 (though I never run the machine for any length
of time with that kernel), so you could try that if you want to investigate.

I haven't been able to establish a correlation between allocated mbufs and
(network) load either.

The "solution" for me so far has been to keep a watchful eye and reboot the
machine once too much memory is used, combined with a watchdog and monit to
reboot the machine automatically if it becomes unresponsive.

Lars



Constant rate mbuf leak

2011-02-10 Thread Alan Wilkie
I have now upgraded my machine to "OpenBSD 4.9-beta (GENERIC) #654: Wed 
Feb  9 14:50:38 MST 2011", and I am still seeing a constant rate 
consumption of mbufs.  I have tried a number of things (shutting down 
all non-essential user processes, turning off network interfaces, etc), 
but none have made any difference, the system consumes 256 byte mbufs at 
a constant rate of 5 mbufs per second:


# uptime && netstat -m
 3:23PM  up 26 mins, 3 users, load averages: 0.24, 0.35, 0.34
8078 mbufs in use:
7844 mbufs allocated to data
117 mbufs allocated to packet headers
117 mbufs allocated to socket names and addresses
21/58/6144 mbuf 2048 byte clusters in use (current/peak/max)
...

The system is very lightly loaded.  If I let it continue, the mbuf usage 
increases to the point where the system becomes unusable.


Can anybody point me in the right direction?  How can I figure out what 
is allocating a data mbuf every 200ms?


Thanks,

Alan

dmesg output follows:
OpenBSD 4.9-beta (GENERIC) #654: Wed Feb  9 14:50:38 MST 2011
t...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 
586-class) 500 MHz

cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX
real mem  = 536440832 (511MB)
avail mem = 517533696 (493MB)
mainbus0 at root
bios0 at mainbus0: AT/286+ BIOS, date 20/70/03, BIOS32 rev. 0 @ 0xfac40
pcibios0 at bios0: rev 2.0 @ 0xf/0x1
pcibios0: pcibios_get_intr_routing - function not supported
pcibios0: PCI IRQ Routing information unavailable.
pcibios0: PCI bus #0 is the last bus
bios0: ROM list: 0xc8000/0xa800
cpu0 at mainbus0: (uniprocessor)
amdmsr0 at mainbus0
pci0 at mainbus0 bus 0: configuration mode 1 (bios)
io address conflict 0x6100/0x100
io address conflict 0x6200/0x200
pchb0 at pci0 dev 1 function 0 "AMD Geode LX" rev 0x31
glxsb0 at pci0 dev 1 function 2 "AMD Geode LX Crypto" rev 0x00: RNG AES
vr0 at pci0 dev 6 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 11, 
address 00:00:24:ca:b3:74
ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
vr1 at pci0 dev 7 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 5, 
address 00:00:24:ca:b3:75
ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
vr2 at pci0 dev 8 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 9, 
address 00:00:24:ca:b3:76
ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
vr3 at pci0 dev 9 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 12, 
address 00:00:24:ca:b3:77
ukphy3 at vr3 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
glxpcib0 at pci0 dev 20 function 0 "AMD CS5536 ISA" rev 0x03: rev 3, 
32-bit 3579545Hz timer, watchdog, gpio

gpio0 at glxpcib0: 32 pins
pciide0 at pci0 dev 20 function 2 "AMD CS5536 IDE" rev 0x01: DMA, 
channel 0 wired to compatibility, channel 1 wired to compatibility

wd0 at pciide0 channel 0 drive 1: 
wd0: 1-sector PIO, LBA48, 238475MB, 488397168 sectors
wd0(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 2
pciide0: channel 1 ignored (disabled)
ohci0 at pci0 dev 21 function 0 "AMD CS5536 USB" rev 0x02: irq 15, 
version 1.0, legacy support

ehci0 at pci0 dev 21 function 1 "AMD CS5536 USB" rev 0x02: irq 15
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 "AMD EHCI root hub" rev 2.00/1.00 addr 1
isa0 at glxpcib0
isadma0 at isa0
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
com0: console
com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
nsclpcsio0 at isa0 port 0x2e/2: NSC PC87366 rev 9: GPIO VLM TMS
gpio1 at nsclpcsio0: 29 pins
npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
usb1 at ohci0: USB revision 1.0
uhub1 at usb1 "AMD OHCI root hub" rev 1.00/1.00 addr 1
biomask e5c5 netmask ffe5 ttymask 
mtrr: K6-family MTRR support (2 registers)
ulpt0 at uhub1 port 1 configuration 1 interface 0 "HewLett Packard HP 
LaserJet 1200" rev 1.10/1.00 addr 2

ulpt0: using bi-directional mode
vscsi0 at root
scsibus0 at vscsi0: 256 targets
softraid0 at root
root on wd0a swap on wd0b dump on wd0b