Re: Constant rate mbuf leak
I have located the mbuf leak, but I suspect not the root cause. There was new code added in 4.8 concerning routing sockets that allocates an mbuf, but if a subsequent operation fails it schedules a timeout to retry and doesn't free the mbuf. The rate of the timer is - no surprise - 5Hz. The real question is why the routing socket operation fails in the first place, it must be something hardware specific or there would be lots more people suffering the same problem. I'll put in a sendbug with all the details. On 11/02/2011 3:31 PM, Alan Wilkie wrote: I have now upgraded my machine to "OpenBSD 4.9-beta (GENERIC) #654: Wed Feb 9 14:50:38 MST 2011", and I am still seeing a constant rate consumption of mbufs. I have tried a number of things (shutting down all non-essential user processes, turning off network interfaces, etc), but none have made any difference, the system consumes 256 byte mbufs at a constant rate of 5 mbufs per second: ...
Re: Constant rate mbuf leak
> ifconfig -A output lo0: flags=8049 mtu 33200 priority: 0 groups: lo inet 127.0.0.1 netmask 0xff00 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x6 rl0: flags=8802 mtu 1500 lladdr xxx priority: 0 media: Ethernet autoselect (none) status: no carrier inet xxx netmask 0xff00 broadcast xxx inet6 xxx prefixlen 64 scopeid 0x1 rl1: flags=8943 mtu 1500 lladdr xxx priority: 0 media: Ethernet autoselect (100baseTX full-duplex) status: active inet6 xxx prefixlen 64 scopeid 0x2 rl2: flags=8943 mtu 1500 lladdr xxx priority: 0 groups: egress media: Ethernet autoselect (100baseTX full-duplex) status: active inet6 xxx prefixlen 64 scopeid 0x3 inet xxx netmask 0xf800 broadcast xxx ral0: flags=8943 mtu 1500 lladdr xxx priority: 4 groups: wlan media: IEEE802.11 autoselect mode 11g hostap status: active ieee80211: nwid xxx chan xxx bssid xxx wpapsk xxx wpaprotos wpa1,wpa2 wpaakms psk wpaciphers ccmp wpagroupcipher ccmp inet6 xxx prefixlen 64 scopeid 0x4 inet xxx netmask 0x broadcast xxx enc0: flags=0<> mtu 1536 priority: 0 bridge0: flags=41 mtu 1500 priority: 0 groups: bridge pflog0: flags=141 mtu 33200 priority: 0 groups: pflog > altq? nfs? Yes, both. Problem occurs without any of them as well though. > are you using AES crypto? Yes, encrypted partition backed by external USB disk. > and describing what the system is doing might be helpful. Not a whole lot. Mostly router and webserver. The load of the webserver does not seem to make any difference though -- it's leaking mbufs at approximately the same rate with virtually no load at all and max load (i.e. the 2MBit upstream that my ISP gives me). Both no load and max has been sustained for several weeks (I'm running a mirror now hence the high load), so I can definitely say that it doesn't make a difference. Lars
Re: Constant rate mbuf leak
ifconfig -A output (to show the interfaces including any tunnels/ppp/etc) and describing what the system is doing might be helpful. any altq? nfs? are you using AES crypto? it would be good to get a good write-up into a PR so it's not lost and so people who don't read misc will see it. On 2011-02-11, Alan Wilkie wrote: > I have now upgraded my machine to "OpenBSD 4.9-beta (GENERIC) #654: Wed > Feb 9 14:50:38 MST 2011", and I am still seeing a constant rate > consumption of mbufs. I have tried a number of things (shutting down > all non-essential user processes, turning off network interfaces, etc), > but none have made any difference, the system consumes 256 byte mbufs at > a constant rate of 5 mbufs per second: > > # uptime && netstat -m > 3:23PM up 26 mins, 3 users, load averages: 0.24, 0.35, 0.34 > 8078 mbufs in use: > 7844 mbufs allocated to data > 117 mbufs allocated to packet headers > 117 mbufs allocated to socket names and addresses > 21/58/6144 mbuf 2048 byte clusters in use (current/peak/max) > ... > > The system is very lightly loaded. If I let it continue, the mbuf usage > increases to the point where the system becomes unusable. > > Can anybody point me in the right direction? How can I figure out what > is allocating a data mbuf every 200ms? > > Thanks, > > Alan > > dmesg output follows: > OpenBSD 4.9-beta (GENERIC) #654: Wed Feb 9 14:50:38 MST 2011 > t...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC > cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" > 586-class) 500 MHz > cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX > real mem = 536440832 (511MB) > avail mem = 517533696 (493MB) > mainbus0 at root > bios0 at mainbus0: AT/286+ BIOS, date 20/70/03, BIOS32 rev. 0 @ 0xfac40 > pcibios0 at bios0: rev 2.0 @ 0xf/0x1 > pcibios0: pcibios_get_intr_routing - function not supported > pcibios0: PCI IRQ Routing information unavailable. > pcibios0: PCI bus #0 is the last bus > bios0: ROM list: 0xc8000/0xa800 > cpu0 at mainbus0: (uniprocessor) > amdmsr0 at mainbus0 > pci0 at mainbus0 bus 0: configuration mode 1 (bios) > io address conflict 0x6100/0x100 > io address conflict 0x6200/0x200 > pchb0 at pci0 dev 1 function 0 "AMD Geode LX" rev 0x31 > glxsb0 at pci0 dev 1 function 2 "AMD Geode LX Crypto" rev 0x00: RNG AES > vr0 at pci0 dev 6 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 11, > address 00:00:24:ca:b3:74 > ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI > 0x004063, model 0x0034 > vr1 at pci0 dev 7 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 5, > address 00:00:24:ca:b3:75 > ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI > 0x004063, model 0x0034 > vr2 at pci0 dev 8 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 9, > address 00:00:24:ca:b3:76 > ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI > 0x004063, model 0x0034 > vr3 at pci0 dev 9 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 12, > address 00:00:24:ca:b3:77 > ukphy3 at vr3 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI > 0x004063, model 0x0034 > glxpcib0 at pci0 dev 20 function 0 "AMD CS5536 ISA" rev 0x03: rev 3, > 32-bit 3579545Hz timer, watchdog, gpio > gpio0 at glxpcib0: 32 pins > pciide0 at pci0 dev 20 function 2 "AMD CS5536 IDE" rev 0x01: DMA, > channel 0 wired to compatibility, channel 1 wired to compatibility > wd0 at pciide0 channel 0 drive 1: > wd0: 1-sector PIO, LBA48, 238475MB, 488397168 sectors > wd0(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 2 > pciide0: channel 1 ignored (disabled) > ohci0 at pci0 dev 21 function 0 "AMD CS5536 USB" rev 0x02: irq 15, > version 1.0, legacy support > ehci0 at pci0 dev 21 function 1 "AMD CS5536 USB" rev 0x02: irq 15 > usb0 at ehci0: USB revision 2.0 > uhub0 at usb0 "AMD EHCI root hub" rev 2.00/1.00 addr 1 > isa0 at glxpcib0 > isadma0 at isa0 > com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo > com0: console > com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo > pckbc0 at isa0 port 0x60/5 > pckbd0 at pckbc0 (kbd slot) > pckbc0: using irq 1 for kbd slot > wskbd0 at pckbd0: console keyboard > pcppi0 at isa0 port 0x61 > spkr0 at pcppi0 > nsclpcsio0 at isa0 port 0x2e/2: NSC PC87366 rev 9: GPIO VLM TMS > gpio1 at nsclpcsio0: 29 pins > npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16 > usb1 at ohci0: USB revision 1.0 > uhub1 at usb1 "AMD OHCI root hub" rev 1.00/1.00 addr 1 > biomask e5c5 netmask ffe5 ttymask > mtrr: K6-family MTRR support (2 registers) > ulpt0 at uhub1 port 1 configuration 1 interface 0 "HewLett Packard HP > LaserJet 1200" rev 1.10/1.00 addr 2 > ulpt0: using bi-directional mode > vscsi0 at root > scsibus0 at vscsi0: 256 targets > softraid0 at root > root on wd0a swap on wd0b dump on wd0b
Re: Constant rate mbuf leak
On Fri, Feb 11, 2011 at 3:44 PM, Lars Kotthoff wrote: >> In the meantime knowing which board it is (or, even better, what network >> drivers are in use) would help immensely. > > 3 like this > rl0 at pci0 dev 18 function 0 "Realtek 8139" rev 0x10 > > and one > ral0 at pci0 dev 21 function 0 "Ralink RT2860" rev 0x00 > ral0: MAC/BBP RT2860 (rev 0x0101), RF RT2820 (MIMO 2T3R) > > Alan's network drivers seem to be completely different though. > > Lars > > I have had a lot of problem with rl*, I didn't wanted to debug, so just bought some re* (if you want something cheap). I remember a comment by Jacob Meuser about how rl driver (and cards) sucks and re* are more or less ok.
Re: Constant rate mbuf leak
> Are you all running bridged setups ? I am, but the problem also occurred without the bridge. I originally suspected the wireless interface (which is bridged with one of the wired ones) and removed the card and hence the bridge. Same problem. Lars
Re: Constant rate mbuf leak
Are you all running bridged setups ?
Re: Constant rate mbuf leak
> In the meantime knowing which board it is (or, even better, what network > drivers are in use) would help immensely. 3 like this rl0 at pci0 dev 18 function 0 "Realtek 8139" rev 0x10 and one ral0 at pci0 dev 21 function 0 "Ralink RT2860" rev 0x00 ral0: MAC/BBP RT2860 (rev 0x0101), RF RT2820 (MIMO 2T3R) Alan's network drivers seem to be completely different though. Lars
Re: Constant rate mbuf leak
Prime suspect here would be the network driver. dlg@ had a nice mbuf leak detect-o-matic diff a while back. I'll have to see if I can find it. In the meantime knowing which board it is (or, even better, what network drivers are in use) would help immensely. On Fri, Feb 11, 2011 at 06:20:50PM +, Lars Kotthoff wrote: > Just to say that I've been having the same problem with a Soekris board since > about 4.4. I haven't figured out what's going on, but strangely the problem is > getting better with time (i.e. the rate at which mbufs are allocated > decreases). > I *think* that it was fine in 4.3 (though I never run the machine for any > length > of time with that kernel), so you could try that if you want to investigate. > > I haven't been able to establish a correlation between allocated mbufs and > (network) load either. > > The "solution" for me so far has been to keep a watchful eye and reboot the > machine once too much memory is used, combined with a watchdog and monit to > reboot the machine automatically if it becomes unresponsive. > > Lars
Re: Constant rate mbuf leak
> "Lars" == Lars Kotthoff writes: Lars> Just to say that I've been having the same problem with a Lars> Soekris board since about 4.4. I haven't figured out what's Lars> going on, but strangely the problem is getting better with Lars> time (i.e. the rate at which mbufs are allocated decreases). Lars> I *think* that it was fine in 4.3 (though I never run the Lars> machine for any length of time with that kernel), so you could Lars> try that if you want to investigate. Lars> I haven't been able to establish a correlation between Lars> allocated mbufs and (network) load either. Lars> The "solution" for me so far has been to keep a watchful eye Lars> and reboot the machine once too much memory is used, combined Lars> with a watchdog and monit to reboot the machine automatically Lars> if it becomes unresponsive. I've had a similar issue in the past (see PR kernel/6380). First a small amount of background, I'm using an Alix 3d3 to act as a bridging firewall. ISP <--> vr2 <--> Bridge0 + PF <--> vr1 <--> MyHost With this setup, if PF was enabled, or disabled, I would leak 2k sized mbufs at a roughly linear rate, causing the system to become non-responsive after it could not allocate more mbufs. Raising the limit on mbufs would prolong the hang, and raised high enough the machine would hang when it ran off the end of memory. I eventually found a way to mitigate this by filtering the MAC's seen through the bridge. This isn't a fix to the real problem, just a bandaid that seems to fit. Basically I only allow packets written with the MAC for MyHost on the bridge with the following in /etc/hostname.bridge0: add vr2 add vr1 rule pass in on vr1 src 88:88:88:88:88:88 tag extbr rule pass out on vr1 dst 88:88:88:88:88:88 tag extbr rule block on vr1 up This keeps my inside machine from having to see the ISP's usual background packets (arp spam, etc). With these filters in place the firewall has been stable and non-leaking for > 100 days. I don't understand the link between this filtering and the memory leaks that are seen without it (I started to go through the code, but so far RealLife(TM) has kept me from completely getting my head around it). Anyways, I don't know if this will be at all applicable for what you are seeing, but hopefully it's a nudge in the right direction. -- Chris
Re: Constant rate mbuf leak
Just to say that I've been having the same problem with a Soekris board since about 4.4. I haven't figured out what's going on, but strangely the problem is getting better with time (i.e. the rate at which mbufs are allocated decreases). I *think* that it was fine in 4.3 (though I never run the machine for any length of time with that kernel), so you could try that if you want to investigate. I haven't been able to establish a correlation between allocated mbufs and (network) load either. The "solution" for me so far has been to keep a watchful eye and reboot the machine once too much memory is used, combined with a watchdog and monit to reboot the machine automatically if it becomes unresponsive. Lars
Constant rate mbuf leak
I have now upgraded my machine to "OpenBSD 4.9-beta (GENERIC) #654: Wed Feb 9 14:50:38 MST 2011", and I am still seeing a constant rate consumption of mbufs. I have tried a number of things (shutting down all non-essential user processes, turning off network interfaces, etc), but none have made any difference, the system consumes 256 byte mbufs at a constant rate of 5 mbufs per second: # uptime && netstat -m 3:23PM up 26 mins, 3 users, load averages: 0.24, 0.35, 0.34 8078 mbufs in use: 7844 mbufs allocated to data 117 mbufs allocated to packet headers 117 mbufs allocated to socket names and addresses 21/58/6144 mbuf 2048 byte clusters in use (current/peak/max) ... The system is very lightly loaded. If I let it continue, the mbuf usage increases to the point where the system becomes unusable. Can anybody point me in the right direction? How can I figure out what is allocating a data mbuf every 200ms? Thanks, Alan dmesg output follows: OpenBSD 4.9-beta (GENERIC) #654: Wed Feb 9 14:50:38 MST 2011 t...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 586-class) 500 MHz cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX real mem = 536440832 (511MB) avail mem = 517533696 (493MB) mainbus0 at root bios0 at mainbus0: AT/286+ BIOS, date 20/70/03, BIOS32 rev. 0 @ 0xfac40 pcibios0 at bios0: rev 2.0 @ 0xf/0x1 pcibios0: pcibios_get_intr_routing - function not supported pcibios0: PCI IRQ Routing information unavailable. pcibios0: PCI bus #0 is the last bus bios0: ROM list: 0xc8000/0xa800 cpu0 at mainbus0: (uniprocessor) amdmsr0 at mainbus0 pci0 at mainbus0 bus 0: configuration mode 1 (bios) io address conflict 0x6100/0x100 io address conflict 0x6200/0x200 pchb0 at pci0 dev 1 function 0 "AMD Geode LX" rev 0x31 glxsb0 at pci0 dev 1 function 2 "AMD Geode LX Crypto" rev 0x00: RNG AES vr0 at pci0 dev 6 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 11, address 00:00:24:ca:b3:74 ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 vr1 at pci0 dev 7 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 5, address 00:00:24:ca:b3:75 ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 vr2 at pci0 dev 8 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 9, address 00:00:24:ca:b3:76 ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 vr3 at pci0 dev 9 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 12, address 00:00:24:ca:b3:77 ukphy3 at vr3 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 glxpcib0 at pci0 dev 20 function 0 "AMD CS5536 ISA" rev 0x03: rev 3, 32-bit 3579545Hz timer, watchdog, gpio gpio0 at glxpcib0: 32 pins pciide0 at pci0 dev 20 function 2 "AMD CS5536 IDE" rev 0x01: DMA, channel 0 wired to compatibility, channel 1 wired to compatibility wd0 at pciide0 channel 0 drive 1: wd0: 1-sector PIO, LBA48, 238475MB, 488397168 sectors wd0(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 2 pciide0: channel 1 ignored (disabled) ohci0 at pci0 dev 21 function 0 "AMD CS5536 USB" rev 0x02: irq 15, version 1.0, legacy support ehci0 at pci0 dev 21 function 1 "AMD CS5536 USB" rev 0x02: irq 15 usb0 at ehci0: USB revision 2.0 uhub0 at usb0 "AMD EHCI root hub" rev 2.00/1.00 addr 1 isa0 at glxpcib0 isadma0 at isa0 com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo com0: console com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo pckbc0 at isa0 port 0x60/5 pckbd0 at pckbc0 (kbd slot) pckbc0: using irq 1 for kbd slot wskbd0 at pckbd0: console keyboard pcppi0 at isa0 port 0x61 spkr0 at pcppi0 nsclpcsio0 at isa0 port 0x2e/2: NSC PC87366 rev 9: GPIO VLM TMS gpio1 at nsclpcsio0: 29 pins npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16 usb1 at ohci0: USB revision 1.0 uhub1 at usb1 "AMD OHCI root hub" rev 1.00/1.00 addr 1 biomask e5c5 netmask ffe5 ttymask mtrr: K6-family MTRR support (2 registers) ulpt0 at uhub1 port 1 configuration 1 interface 0 "HewLett Packard HP LaserJet 1200" rev 1.10/1.00 addr 2 ulpt0: using bi-directional mode vscsi0 at root scsibus0 at vscsi0: 256 targets softraid0 at root root on wd0a swap on wd0b dump on wd0b