Re: mbufs growing in 7.2
it's sad, but still the problem remains. during the week, in the absence of any significant network load, mbufs increased from 650 to ~1100. i am informing you because you will be tempted to write off the problem, for example, to a single faulty hardware. but maybe this is not an igc problem, since using openrsync(with mfs and nfs) on localhost increases mbufs by more than a thousand, and they do not return back(i found a workaround how to get them back, but it's not a solution)
Re: mbufs growing in 7.2
Hi David, Here are all the outputs you mentioned. Hopefully something will improve our understanding of this story. David Gwynne writes: > Ofails are the sum of output errors and queue drops. Can you figure > out which one it is with netstat -I igc0 -e and netstat -I igc0 -d? % netstat -I igc2 -d NameMtu Network Address Ipkts IdropOpkts Odrop Colls igc2150000:e2:69:53:c0:0b 18487201 0 53088501 172714 0 igc21500 192.168.172 192.168.172.1 18487201 0 53088501 172714 0 igc21500 192.168.172 192.168.172.5318487201 0 53088501 172714 0 % netstat -I igc2 -e NameMtu Network Address Ipkts IerrsOpkts Oerrs Colls igc2150000:e2:69:53:c0:0b 18487201 0 53088501 0 0 igc21500 192.168.172 192.168.172.1 18487201 0 53088501 0 0 igc21500 192.168.172 192.168.172.5318487201 0 53088501 0 0 > The state of the rxring accounting according to "systat mb" output > would be interesting too. IFACERING LIVELOCKS SIZE ALIVE LWM HWM CWM System mbufs 0 256 5561 479 mcl2k204840 14 mcl2k2 2112 4534 442 mcl4k4096 0 8 mcl8k8192 0 6 mcl16k 16384 0 1 lo0 igc0 020482410 102324 120481410 102314 220481510 102315 320481210 102312 igc1 020482710 102327 120482710 102327 220482410 102324 320482510 102325 igc2 020481810 102318 120481410 102314 220481710 102317 320481710 102317 igc3 020483010 102330 120483110 102331 220483310 102333 320483010 102330 > kstat output is easy to get too, though I'm not sure it will be useful > in this situation. igc2:0:rxq:0 packets: 4544785 packets bytes: 2087452079 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets igc2:0:rxq:1 packets: 5722952 packets bytes: 3638339639 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets igc2:0:rxq:2 packets: 5479968 packets bytes: 2818395627 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets igc2:0:rxq:3 packets: 2739496 packets bytes: 1411808602 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets igc2:0:txq:0 packets: 19740629 packets bytes: 24868676639 bytes qdrops: 5 packets errors: 0 packets qlen: 0 packets maxqlen: 1023 packets oactive: false igc2:0:txq:1 packets: 11828063 packets bytes: 14495415780 bytes qdrops: 42113 packets errors: 0 packets qlen: 1023 packets maxqlen: 1023 packets oactive: false igc2:0:txq:2 packets: 7975725 packets bytes: 9745852229 bytes qdrops: 95687 packets errors: 0 packets qlen: 1023 packets maxqlen: 1023 packets oactive: false igc2:0:txq:3 packets: 13544084 packets bytes: 16273238465 bytes qdrops: 34909 packets errors: 0 packets qlen: 1023 packets maxqlen: 1023 packets oactive: false > The mbuf (and all other) pool counters from vmstat -m are easy to get too. Memory statistics by bucket size Size In Use Free Requests HighWater Couldfree 16 1060732 1179241280 0 32 980812 162242 640 4 64 2117 59 758707 320 4131 128 510540 52 25550193 160637 256 225287 63967 80 10123 512 395 29 39457 40 0 1024 125 7 163014 20 4 204816443 71 46030 10 21091 4096 58 10 83403 5 6680 8192 58
Re: mbufs growing in 7.2
> On 30 Nov 2022, at 14:36, Greg Steuck wrote: > > Greg Steuck writes: > >> The watched kettle never boiled. No more crashes in over two weeks >> (instead of two in the first week). I tried a loop of alternating iperf3 >> tcp and udp to no ill effect. I still see the growth in the metrics I >> reported, yet the system remained stable. >> >> I applied the patch below and am still collecting the metrics. I doubt >> they are responsible for the original problem. > > This time the problem fired after 6 days of uptime. The system is > running 7.2 + igc off-by-one fix. > > The symptoms are: > > 1 A single interface is "stuck", sometimes ping replies come back, > incoming packets are visible in tcpdump, no reply packets > appear in tcpdump (nor received on the other machine). > 2 Other interfaces are fine to the point that I can ssh over one of > them to debug. > 3 The stuck interface remains stuck after ifconfig down/up. > 4 The stuck interface remains stuck throughout pfctl -d/-e. This did > reenable ping replies, that were stuck for a bit. > 5 netstat -i shows large (and rising) value in Ofail column > 6 netstat -m shows a number of 'mbuf 2112' stuck fairly high: > 5539 mbufs in use: >5441 mbufs allocated to data >7 mbufs allocated to packet headers >91 mbufs allocated to socket names and addresses > 40/112 mbuf 2048 byte clusters in use (current/peak) > 4510/6630 mbuf 2112 byte clusters in use (current/peak) > 7 no established connections to speak of > > The interface stickiness is mainly its inability to send higher level > protocol replies. E.g. a TCP connection from a remote system doesn't get > completed. Or SYN/ACK completes, but then I can see the data to the > machine and not even ACKs coming back. ktrace shows the application > writes the data which just never makes it down the stack to where > tcpdump would see it. > > Obligatory graph of seemingly related counters: "mbufs in use", > "mbuf 2112 byte clusters", and "Ofail" counts > > https://docs.google.com/spreadsheets/d/e/2PACX-1vRr61USv9VNvaIq9qEs8W1wy869ai6MwNmevDLmxLJOV3DaUBcrRUzwzNZP92syltrWfrmIUWq7qevG/pubchart?oid=202363413=interactive > > There's a long tail to the left covering 5 days of mostly nothing > happening. The drop of "mbufs in use" from 7355 to 5739 is around the > time I removed the system from service and possibly when I cycled > ifconfig down/up (I have no record). > > The system is still up and moved off to the side for debugging. It can > remain up for as long as we have things to try (and power utility > cooperates). Ofails are the sum of output errors and queue drops. Can you figure out which one it is with netstat -I igc0 -e and netstat -I igc0 -d? The state of the rxring accounting according to "systat mb" output would be interesting too. kstat output is easy to get too, though I'm not sure it will be useful in this situation. The mbuf (and all other) pool counters from vmstat -m are easy to get too.
Re: mbufs growing in 7.2
Greg Steuck writes: > The watched kettle never boiled. No more crashes in over two weeks > (instead of two in the first week). I tried a loop of alternating iperf3 > tcp and udp to no ill effect. I still see the growth in the metrics I > reported, yet the system remained stable. > > I applied the patch below and am still collecting the metrics. I doubt > they are responsible for the original problem. This time the problem fired after 6 days of uptime. The system is running 7.2 + igc off-by-one fix. The symptoms are: 1 A single interface is "stuck", sometimes ping replies come back, incoming packets are visible in tcpdump, no reply packets appear in tcpdump (nor received on the other machine). 2 Other interfaces are fine to the point that I can ssh over one of them to debug. 3 The stuck interface remains stuck after ifconfig down/up. 4 The stuck interface remains stuck throughout pfctl -d/-e. This did reenable ping replies, that were stuck for a bit. 5 netstat -i shows large (and rising) value in Ofail column 6 netstat -m shows a number of 'mbuf 2112' stuck fairly high: 5539 mbufs in use: 5441 mbufs allocated to data 7 mbufs allocated to packet headers 91 mbufs allocated to socket names and addresses 40/112 mbuf 2048 byte clusters in use (current/peak) 4510/6630 mbuf 2112 byte clusters in use (current/peak) 7 no established connections to speak of The interface stickiness is mainly its inability to send higher level protocol replies. E.g. a TCP connection from a remote system doesn't get completed. Or SYN/ACK completes, but then I can see the data to the machine and not even ACKs coming back. ktrace shows the application writes the data which just never makes it down the stack to where tcpdump would see it. Obligatory graph of seemingly related counters: "mbufs in use", "mbuf 2112 byte clusters", and "Ofail" counts https://docs.google.com/spreadsheets/d/e/2PACX-1vRr61USv9VNvaIq9qEs8W1wy869ai6MwNmevDLmxLJOV3DaUBcrRUzwzNZP92syltrWfrmIUWq7qevG/pubchart?oid=202363413=interactive There's a long tail to the left covering 5 days of mostly nothing happening. The drop of "mbufs in use" from 7355 to 5739 is around the time I removed the system from service and possibly when I cycled ifconfig down/up (I have no record). The system is still up and moved off to the side for debugging. It can remain up for as long as we have things to try (and power utility cooperates). Thanks Greg
Re: mbufs growing in 7.2
I also was not able to recreate the issues I initially saw w/ high traffic on 7.2 after I re-upgraded. I can't explain it but openbsd 7.2 igc has been solid for about 1.5 weeks now, no patches. On Tue, Nov 22, 2022 at 11:21 PM Greg Steuck wrote: > > The watched kettle never boiled. No more crashes in over two weeks > (instead of two in the first week). I tried a loop of alternating iperf3 > tcp and udp to no ill effect. I still see the growth in the metrics I > reported, yet the system remained stable. > > I applied the patch below and am still collecting the metrics. I doubt > they are responsible for the original problem. > > Thanks > Greg > > Moritz Buhl writes: > > > Hi Greg, Hi Joe, > > > > dlg@ hinted to me that the ring might overwrite it's own starting > > position with the current code. > > > > Does this help? > > mbuhl > > > > Index: dev/pci/if_igc.c > > === > > RCS file: /cvs/src/sys/dev/pci/if_igc.c,v > > retrieving revision 1.9 > > diff -u -p -r1.9 if_igc.c > > --- dev/pci/if_igc.c 2 Jun 2022 07:41:17 - 1.9 > > +++ dev/pci/if_igc.c 8 Nov 2022 10:35:39 - > > @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq) > > mask = sc->num_tx_desc - 1; > > > > for (;;) { > > - if (free <= IGC_MAX_SCATTER) { > > + if (free <= IGC_MAX_SCATTER + 1) { > > ifq_set_oactive(ifq); > > break; > > } > > @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq) > > /* Consume the first descriptor */ > > prod++; > > prod &= mask; > > + free--; > > } > > > > for (i = 0; i < map->dm_nsegs; i++) { >
Re: mbufs growing in 7.2
The watched kettle never boiled. No more crashes in over two weeks (instead of two in the first week). I tried a loop of alternating iperf3 tcp and udp to no ill effect. I still see the growth in the metrics I reported, yet the system remained stable. I applied the patch below and am still collecting the metrics. I doubt they are responsible for the original problem. Thanks Greg Moritz Buhl writes: > Hi Greg, Hi Joe, > > dlg@ hinted to me that the ring might overwrite it's own starting > position with the current code. > > Does this help? > mbuhl > > Index: dev/pci/if_igc.c > === > RCS file: /cvs/src/sys/dev/pci/if_igc.c,v > retrieving revision 1.9 > diff -u -p -r1.9 if_igc.c > --- dev/pci/if_igc.c 2 Jun 2022 07:41:17 - 1.9 > +++ dev/pci/if_igc.c 8 Nov 2022 10:35:39 - > @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq) > mask = sc->num_tx_desc - 1; > > for (;;) { > - if (free <= IGC_MAX_SCATTER) { > + if (free <= IGC_MAX_SCATTER + 1) { > ifq_set_oactive(ifq); > break; > } > @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq) > /* Consume the first descriptor */ > prod++; > prod &= mask; > + free--; > } > > for (i = 0; i < map->dm_nsegs; i++) {
Re: mbufs growing in 7.2
no, they don't grow anymore. apparently, the initial growth is associated with network services(tor, i2pd and the like), which do not immediately gain full force. using iperf also does not cause mbufs to grow thx! you're the best ^.^
Re: mbufs growing in 7.2
> Does this help? > mbuhl > Index: dev/pci/if_igc.c > === > RCS file: /cvs/src/sys/dev/pci/if_igc.c,v > retrieving revision 1.9 > diff -u -p -r1.9 if_igc.c > --- dev/pci/if_igc.c2 Jun 2022 07:41:17 - 1.9 > +++ dev/pci/if_igc.c8 Nov 2022 10:35:39 - > @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq) > mask = sc->num_tx_desc - 1; > > for (;;) { > - if (free <= IGC_MAX_SCATTER) { > + if (free <= IGC_MAX_SCATTER + 1) { > ifq_set_oactive(ifq); > break; > } > @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq) > /* Consume the first descriptor */ > prod++; > prod &= mask; > + free--; > } > > for (i = 0; i < map->dm_nsegs; i++) { i wouldn't say it helps. it grew by a hundred(from ~550 to ~650) in four hours. at the same time, there is no network load. but it is necessary, of course, to observe longer. 654 mbufs in use: 591 mbufs allocated to data 52 mbufs allocated to packet headers 11 mbufs allocated to socket names and addresses 255/320 mbuf 2048 byte clusters in use (current/peak) 311/450 mbuf 2112 byte clusters in use (current/peak) 0/48 mbuf 4096 byte clusters in use (current/peak) 16/32 mbuf 8192 byte clusters in use (current/peak) 0/14 mbuf 9216 byte clusters in use (current/peak) 0/10 mbuf 12288 byte clusters in use (current/peak) 0/8 mbuf 16384 byte clusters in use (current/peak) 0/16 mbuf 65536 byte clusters in use (current/peak) 3656/3656/1048576 Kbytes allocated to network (current/peak/max) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines
Re: mbufs growing in 7.2
Hi Greg, Hi Joe, dlg@ hinted to me that the ring might overwrite it's own starting position with the current code. Does this help? mbuhl Index: dev/pci/if_igc.c === RCS file: /cvs/src/sys/dev/pci/if_igc.c,v retrieving revision 1.9 diff -u -p -r1.9 if_igc.c --- dev/pci/if_igc.c2 Jun 2022 07:41:17 - 1.9 +++ dev/pci/if_igc.c8 Nov 2022 10:35:39 - @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq) mask = sc->num_tx_desc - 1; for (;;) { - if (free <= IGC_MAX_SCATTER) { + if (free <= IGC_MAX_SCATTER + 1) { ifq_set_oactive(ifq); break; } @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq) /* Consume the first descriptor */ prod++; prod &= mask; + free--; } for (i = 0; i < map->dm_nsegs; i++) {
Re: mbufs growing in 7.2
I have had (somewhat) similar sounding issues with igc and 7.2 immediately after upgrading. I am able to reproduce it reliably by running an iperf3 test to another machine on the network. Within a few seconds the NIC will stop working. No traffic, no ping. No errors in dmesg or syslog. The issue also arises on its own without iperf3 stressing the nic if something on the network temporarily spikes traffic such as a large download. It's bad enough that I have had to rollback to 7.1 to keep my network online. The box is a topton N5105 (jasper lake) fanless box from aliexpress. Unfortunately I did not save a full dmesg from when I had 7.2 running, however here is some info from 7.1 dmesg re: the NICs: igc0 at pci2 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address igc1 at pci3 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address igc2 at pci4 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address igc3 at pci5 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address I have not tried pulling out the offload checksum patches in 7.2 but may try that at some point, time permitting. On Sun, Nov 6, 2022 at 12:45 PM Greg Steuck wrote: > > Greg Steuck writes: > > > My router has become unstable since upgrading from 7.1-stable to > > 7.2. After several days of uptime the machine gets into a state where > > some applications (unbound & dhcpd) report ENOBUFS (No buffer space > > available). At that time the machine is pingable over all the > > interfaces, but only the upstream interface seems functional (igc0). > > The networks downstream of the router can't get much data across. I > > don't have a good characterization of this. > > > > At first I suspected this had something to do with the igc checksum > > offloading commit, so I am now running 7.2 with this reverted: > > "Implement and enable IPv4, TCP, and UDP checksum offloading for igc." > > So far it appears that reverting improved stability. I had 2 crashes > last week and 0 in the last 8 days. > > > I also started monitoring some counters that appeared relevant with > > this trivial loop: > > > > $ while : ; do date; netstat -s | grep err; netstat -m; netstat -ni | grep > > '^[Ni]'; sleep 300; done | tee err-log > > > > I have some 38 hours worth of counters as of now. I observe an upward > > trend in "mbuf 2112" and "mbufs in use", I extracted the values with > > > > $ perl -ne 'print "$x,$1\n" if m/^(\d+).*mbuf 2112/; $x=$1 if > > /^(\d+)\smbufs in use/;' err-log > > > > It starts out 610,410-ish and ends at 717,513. I have a picture for > > those visually inclined: https://photos.app.goo.gl/DZGCrJnJDohPrVyZ8 > > The growth is very slow, so I'm not sure it matters much. The 8 day > graph still shows a very slow ramp but it'll take a long time for > that to become a problem: https://photos.app.goo.gl/H64FRMkrfrY3hi6f7 > (8 days worth of 5-minute-spaced samples) > > I'm reapplying the patch keeping the same monitoring on. Hopefully > something will be visible in those stats. If not, at least we'll learn > whether the diff correlates with the failure. > > Thanks > Greg >
Re: mbufs growing in 7.2
Greg Steuck writes: > My router has become unstable since upgrading from 7.1-stable to > 7.2. After several days of uptime the machine gets into a state where > some applications (unbound & dhcpd) report ENOBUFS (No buffer space > available). At that time the machine is pingable over all the > interfaces, but only the upstream interface seems functional (igc0). > The networks downstream of the router can't get much data across. I > don't have a good characterization of this. > > At first I suspected this had something to do with the igc checksum > offloading commit, so I am now running 7.2 with this reverted: > "Implement and enable IPv4, TCP, and UDP checksum offloading for igc." So far it appears that reverting improved stability. I had 2 crashes last week and 0 in the last 8 days. > I also started monitoring some counters that appeared relevant with > this trivial loop: > > $ while : ; do date; netstat -s | grep err; netstat -m; netstat -ni | grep > '^[Ni]'; sleep 300; done | tee err-log > > I have some 38 hours worth of counters as of now. I observe an upward > trend in "mbuf 2112" and "mbufs in use", I extracted the values with > > $ perl -ne 'print "$x,$1\n" if m/^(\d+).*mbuf 2112/; $x=$1 if /^(\d+)\smbufs > in use/;' err-log > > It starts out 610,410-ish and ends at 717,513. I have a picture for > those visually inclined: https://photos.app.goo.gl/DZGCrJnJDohPrVyZ8 The growth is very slow, so I'm not sure it matters much. The 8 day graph still shows a very slow ramp but it'll take a long time for that to become a problem: https://photos.app.goo.gl/H64FRMkrfrY3hi6f7 (8 days worth of 5-minute-spaced samples) I'm reapplying the patch keeping the same monitoring on. Hopefully something will be visible in those stats. If not, at least we'll learn whether the diff correlates with the failure. Thanks Greg