Re: mbufs growing in 7.2

2022-12-01 Thread 4
it's sad, but still the problem remains. during the week, in the absence of any 
significant network load, mbufs increased from 650 to ~1100. i am informing you 
because you will be tempted to write off the problem, for example, to a single 
faulty hardware. but maybe this is not an igc problem, since using 
openrsync(with mfs and nfs) on localhost increases mbufs by more than a 
thousand, and they do not return back(i found a workaround how to get them 
back, but it's not a solution)



Re: mbufs growing in 7.2

2022-11-30 Thread Greg Steuck
Hi David,

Here are all the outputs you mentioned. Hopefully something will improve
our understanding of this story.

David Gwynne  writes:

> Ofails are the sum of output errors and queue drops. Can you figure
> out which one it is with netstat -I igc0 -e and netstat -I igc0 -d?

% netstat -I igc2 -d
NameMtu   Network Address  Ipkts IdropOpkts Odrop Colls
igc2150000:e2:69:53:c0:0b 18487201 0 53088501 172714 0
igc21500  192.168.172 192.168.172.1 18487201 0 53088501 172714 0
igc21500  192.168.172 192.168.172.5318487201 0 53088501 172714 0
% netstat -I igc2 -e
NameMtu   Network Address  Ipkts IerrsOpkts Oerrs Colls
igc2150000:e2:69:53:c0:0b 18487201 0 53088501 0 0
igc21500  192.168.172 192.168.172.1 18487201 0 53088501 0 0
igc21500  192.168.172 192.168.172.5318487201 0 53088501 0 0

> The state of the rxring accounting according to "systat mb" output
> would be interesting too.

IFACERING  LIVELOCKS  SIZE ALIVE   LWM   HWM   CWM
System   mbufs 0   256  5561 479
 mcl2k204840  14
 mcl2k2   2112  4534 442
 mcl4k4096 0   8
 mcl8k8192 0   6
 mcl16k  16384 0   1
lo0
igc0 020482410  102324
 120481410  102314
 220481510  102315
 320481210  102312
igc1 020482710  102327
 120482710  102327
 220482410  102324
 320482510  102325
igc2 020481810  102318
 120481410  102314
 220481710  102317
 320481710  102317
igc3 020483010  102330
 120483110  102331
 220483310  102333
 320483010  102330

> kstat output is easy to get too, though I'm not sure it will be useful
> in this situation.

igc2:0:rxq:0
 packets: 4544785 packets
   bytes: 2087452079 bytes
  qdrops: 0 packets
  errors: 0 packets
qlen: 0 packets
igc2:0:rxq:1
 packets: 5722952 packets
   bytes: 3638339639 bytes
  qdrops: 0 packets
  errors: 0 packets
qlen: 0 packets
igc2:0:rxq:2
 packets: 5479968 packets
   bytes: 2818395627 bytes
  qdrops: 0 packets
  errors: 0 packets
qlen: 0 packets
igc2:0:rxq:3
 packets: 2739496 packets
   bytes: 1411808602 bytes
  qdrops: 0 packets
  errors: 0 packets
qlen: 0 packets
igc2:0:txq:0
 packets: 19740629 packets
   bytes: 24868676639 bytes
  qdrops: 5 packets
  errors: 0 packets
qlen: 0 packets
 maxqlen: 1023 packets
 oactive: false
igc2:0:txq:1
 packets: 11828063 packets
   bytes: 14495415780 bytes
  qdrops: 42113 packets
  errors: 0 packets
qlen: 1023 packets
 maxqlen: 1023 packets
 oactive: false
igc2:0:txq:2
 packets: 7975725 packets
   bytes: 9745852229 bytes
  qdrops: 95687 packets
  errors: 0 packets
qlen: 1023 packets
 maxqlen: 1023 packets
 oactive: false
igc2:0:txq:3
 packets: 13544084 packets
   bytes: 16273238465 bytes
  qdrops: 34909 packets
  errors: 0 packets
qlen: 1023 packets
 maxqlen: 1023 packets
 oactive: false


> The mbuf (and all other) pool counters from vmstat -m are easy to get too.

Memory statistics by bucket size
Size   In Use   Free   Requests  HighWater  Couldfree
  16 1060732 1179241280  0
  32  980812 162242 640  4
  64 2117 59 758707 320   4131
 128   510540 52   25550193 160637
 256  225287  63967  80  10123
 512  395 29  39457  40  0
1024  125  7 163014  20  4
204816443 71  46030  10  21091
4096   58 10  83403   5   6680
8192   58  

Re: mbufs growing in 7.2

2022-11-30 Thread David Gwynne



> On 30 Nov 2022, at 14:36, Greg Steuck  wrote:
> 
> Greg Steuck  writes:
> 
>> The watched kettle never boiled. No more crashes in over two weeks
>> (instead of two in the first week). I tried a loop of alternating iperf3
>> tcp and udp to no ill effect. I still see the growth in the metrics I
>> reported, yet the system remained stable.
>> 
>> I applied the patch below and am still collecting the metrics. I doubt
>> they are responsible for the original problem.
> 
> This time the problem fired after 6 days of uptime. The system is
> running 7.2 + igc off-by-one fix.
> 
> The symptoms are:
> 
> 1 A single interface is "stuck", sometimes ping replies come back,
>  incoming packets are visible in tcpdump, no reply packets
>  appear in tcpdump (nor received on the other machine).
> 2 Other interfaces are fine to the point that I can ssh over one of
>  them to debug.
> 3 The stuck interface remains stuck after ifconfig down/up.
> 4 The stuck interface remains stuck throughout pfctl -d/-e. This did
>  reenable ping replies, that were stuck for a bit.
> 5 netstat -i shows large (and rising) value in Ofail column
> 6 netstat -m shows a number of 'mbuf 2112' stuck fairly high:
>  5539 mbufs in use:
>5441 mbufs allocated to data
>7 mbufs allocated to packet headers
>91 mbufs allocated to socket names and addresses
>  40/112 mbuf 2048 byte clusters in use (current/peak)
>  4510/6630 mbuf 2112 byte clusters in use (current/peak)
> 7 no established connections to speak of
> 
> The interface stickiness is mainly its inability to send higher level
> protocol replies. E.g. a TCP connection from a remote system doesn't get
> completed. Or SYN/ACK completes, but then I can see the data to the
> machine and not even ACKs coming back. ktrace shows the application
> writes the data which just never makes it down the stack to where
> tcpdump would see it.
> 
> Obligatory graph of seemingly related counters: "mbufs in use",
> "mbuf 2112 byte clusters", and "Ofail" counts
> 
> https://docs.google.com/spreadsheets/d/e/2PACX-1vRr61USv9VNvaIq9qEs8W1wy869ai6MwNmevDLmxLJOV3DaUBcrRUzwzNZP92syltrWfrmIUWq7qevG/pubchart?oid=202363413=interactive
> 
> There's a long tail to the left covering 5 days of mostly nothing
> happening. The drop of "mbufs in use" from 7355 to 5739 is around the
> time I removed the system from service and possibly when I cycled
> ifconfig down/up (I have no record).
> 
> The system is still up and moved off to the side for debugging. It can
> remain up for as long as we have things to try (and power utility
> cooperates).

Ofails are the sum of output errors and queue drops. Can you figure out which 
one it is with netstat -I igc0 -e and netstat -I igc0 -d?

The state of the rxring accounting according to "systat mb" output would be 
interesting too. kstat output is easy to get too, though I'm not sure it will 
be useful in this situation.

The mbuf (and all other) pool counters from vmstat -m are easy to get too.


Re: mbufs growing in 7.2

2022-11-29 Thread Greg Steuck
Greg Steuck  writes:

> The watched kettle never boiled. No more crashes in over two weeks
> (instead of two in the first week). I tried a loop of alternating iperf3
> tcp and udp to no ill effect. I still see the growth in the metrics I
> reported, yet the system remained stable.
>
> I applied the patch below and am still collecting the metrics. I doubt
> they are responsible for the original problem.

This time the problem fired after 6 days of uptime. The system is
running 7.2 + igc off-by-one fix.

The symptoms are:

1 A single interface is "stuck", sometimes ping replies come back,
  incoming packets are visible in tcpdump, no reply packets
  appear in tcpdump (nor received on the other machine).
2 Other interfaces are fine to the point that I can ssh over one of
  them to debug.
3 The stuck interface remains stuck after ifconfig down/up.
4 The stuck interface remains stuck throughout pfctl -d/-e. This did
  reenable ping replies, that were stuck for a bit.
5 netstat -i shows large (and rising) value in Ofail column
6 netstat -m shows a number of 'mbuf 2112' stuck fairly high:
  5539 mbufs in use:
5441 mbufs allocated to data
7 mbufs allocated to packet headers
91 mbufs allocated to socket names and addresses
  40/112 mbuf 2048 byte clusters in use (current/peak)
  4510/6630 mbuf 2112 byte clusters in use (current/peak)
7 no established connections to speak of

The interface stickiness is mainly its inability to send higher level
protocol replies. E.g. a TCP connection from a remote system doesn't get
completed. Or SYN/ACK completes, but then I can see the data to the
machine and not even ACKs coming back. ktrace shows the application
writes the data which just never makes it down the stack to where
tcpdump would see it.

Obligatory graph of seemingly related counters: "mbufs in use",
"mbuf 2112 byte clusters", and "Ofail" counts

https://docs.google.com/spreadsheets/d/e/2PACX-1vRr61USv9VNvaIq9qEs8W1wy869ai6MwNmevDLmxLJOV3DaUBcrRUzwzNZP92syltrWfrmIUWq7qevG/pubchart?oid=202363413=interactive

There's a long tail to the left covering 5 days of mostly nothing
happening. The drop of "mbufs in use" from 7355 to 5739 is around the
time I removed the system from service and possibly when I cycled
ifconfig down/up (I have no record).

The system is still up and moved off to the side for debugging. It can
remain up for as long as we have things to try (and power utility
cooperates).

Thanks
Greg



Re: mbufs growing in 7.2

2022-11-23 Thread Joe Miller
I also was not able to recreate the issues I initially saw w/ high
traffic on 7.2 after I re-upgraded. I can't explain it but openbsd 7.2
igc has been solid for about 1.5 weeks now, no patches.

On Tue, Nov 22, 2022 at 11:21 PM Greg Steuck  wrote:
>
> The watched kettle never boiled. No more crashes in over two weeks
> (instead of two in the first week). I tried a loop of alternating iperf3
> tcp and udp to no ill effect. I still see the growth in the metrics I
> reported, yet the system remained stable.
>
> I applied the patch below and am still collecting the metrics. I doubt
> they are responsible for the original problem.
>
> Thanks
> Greg
>
> Moritz Buhl  writes:
>
> > Hi Greg, Hi Joe,
> >
> > dlg@ hinted to me that the ring might overwrite it's own starting
> > position with the current code.
> >
> > Does this help?
> > mbuhl
> >
> > Index: dev/pci/if_igc.c
> > ===
> > RCS file: /cvs/src/sys/dev/pci/if_igc.c,v
> > retrieving revision 1.9
> > diff -u -p -r1.9 if_igc.c
> > --- dev/pci/if_igc.c  2 Jun 2022 07:41:17 -   1.9
> > +++ dev/pci/if_igc.c  8 Nov 2022 10:35:39 -
> > @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq)
> >   mask = sc->num_tx_desc - 1;
> >
> >   for (;;) {
> > - if (free <= IGC_MAX_SCATTER) {
> > + if (free <= IGC_MAX_SCATTER + 1) {
> >   ifq_set_oactive(ifq);
> >   break;
> >   }
> > @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq)
> >   /* Consume the first descriptor */
> >   prod++;
> >   prod &= mask;
> > + free--;
> >   }
> >
> >   for (i = 0; i < map->dm_nsegs; i++) {
>



Re: mbufs growing in 7.2

2022-11-22 Thread Greg Steuck
The watched kettle never boiled. No more crashes in over two weeks
(instead of two in the first week). I tried a loop of alternating iperf3
tcp and udp to no ill effect. I still see the growth in the metrics I
reported, yet the system remained stable.

I applied the patch below and am still collecting the metrics. I doubt
they are responsible for the original problem.

Thanks
Greg

Moritz Buhl  writes:

> Hi Greg, Hi Joe,
>
> dlg@ hinted to me that the ring might overwrite it's own starting
> position with the current code.
>
> Does this help?
> mbuhl
>
> Index: dev/pci/if_igc.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_igc.c,v
> retrieving revision 1.9
> diff -u -p -r1.9 if_igc.c
> --- dev/pci/if_igc.c  2 Jun 2022 07:41:17 -   1.9
> +++ dev/pci/if_igc.c  8 Nov 2022 10:35:39 -
> @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq)
>   mask = sc->num_tx_desc - 1;
>  
>   for (;;) {
> - if (free <= IGC_MAX_SCATTER) {
> + if (free <= IGC_MAX_SCATTER + 1) {
>   ifq_set_oactive(ifq);
>   break;
>   }
> @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq)
>   /* Consume the first descriptor */
>   prod++;
>   prod &= mask;
> + free--;
>   }
>  
>   for (i = 0; i < map->dm_nsegs; i++) {



Re: mbufs growing in 7.2

2022-11-21 Thread Masturbating monkey
no, they don't grow anymore. apparently, the initial growth is associated with 
network services(tor, i2pd and the like), which do not immediately gain full 
force. using iperf also does not cause mbufs to grow
thx! you're the best ^.^



Re: mbufs growing in 7.2

2022-11-19 Thread Masturbating monkey
> Does this help?
> mbuhl

> Index: dev/pci/if_igc.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_igc.c,v
> retrieving revision 1.9
> diff -u -p -r1.9 if_igc.c
> --- dev/pci/if_igc.c2 Jun 2022 07:41:17 -   1.9
> +++ dev/pci/if_igc.c8 Nov 2022 10:35:39 -
> @@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq)
> mask = sc->num_tx_desc - 1;
>  
> for (;;) {
> -   if (free <= IGC_MAX_SCATTER) {
> +   if (free <= IGC_MAX_SCATTER + 1) {
> ifq_set_oactive(ifq);
> break;
> }
> @@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq)
> /* Consume the first descriptor */
> prod++;
> prod &= mask;
> +   free--;
> }
>  
> for (i = 0; i < map->dm_nsegs; i++) {

i wouldn't say it helps. it grew by a hundred(from ~550 to ~650) in four hours. 
at the same time, there is no network load. but it is necessary, of course, to 
observe longer.
654 mbufs in use:
591 mbufs allocated to data
52 mbufs allocated to packet headers
11 mbufs allocated to socket names and addresses
255/320 mbuf 2048 byte clusters in use (current/peak)
311/450 mbuf 2112 byte clusters in use (current/peak)
0/48 mbuf 4096 byte clusters in use (current/peak)
16/32 mbuf 8192 byte clusters in use (current/peak)
0/14 mbuf 9216 byte clusters in use (current/peak)
0/10 mbuf 12288 byte clusters in use (current/peak)
0/8 mbuf 16384 byte clusters in use (current/peak)
0/16 mbuf 65536 byte clusters in use (current/peak)
3656/3656/1048576 Kbytes allocated to network (current/peak/max)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines



Re: mbufs growing in 7.2

2022-11-08 Thread Moritz Buhl
Hi Greg, Hi Joe,

dlg@ hinted to me that the ring might overwrite it's own starting
position with the current code.

Does this help?
mbuhl

Index: dev/pci/if_igc.c
===
RCS file: /cvs/src/sys/dev/pci/if_igc.c,v
retrieving revision 1.9
diff -u -p -r1.9 if_igc.c
--- dev/pci/if_igc.c2 Jun 2022 07:41:17 -   1.9
+++ dev/pci/if_igc.c8 Nov 2022 10:35:39 -
@@ -978,7 +978,7 @@ igc_start(struct ifqueue *ifq)
mask = sc->num_tx_desc - 1;
 
for (;;) {
-   if (free <= IGC_MAX_SCATTER) {
+   if (free <= IGC_MAX_SCATTER + 1) {
ifq_set_oactive(ifq);
break;
}
@@ -1005,6 +1005,7 @@ igc_start(struct ifqueue *ifq)
/* Consume the first descriptor */
prod++;
prod &= mask;
+   free--;
}
 
for (i = 0; i < map->dm_nsegs; i++) {



Re: mbufs growing in 7.2

2022-11-07 Thread Joe Miller
I have had (somewhat) similar sounding issues with igc and 7.2
immediately after upgrading. I am able to reproduce it reliably by
running an iperf3 test to another machine on the network. Within a few
seconds the NIC will stop working. No traffic, no ping. No errors in
dmesg or syslog.

The issue also arises on its own without iperf3 stressing the nic if
something on the network temporarily spikes traffic such as a large
download. It's bad enough that I have had to rollback to 7.1 to keep
my network online.

The box is a topton N5105 (jasper lake) fanless box from aliexpress.
Unfortunately I did not save a full dmesg from when I had 7.2 running,
however here is some info from 7.1 dmesg re: the NICs:

igc0 at pci2 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address
igc1 at pci3 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address
igc2 at pci4 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address
igc3 at pci5 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address

I have not tried pulling out the offload checksum patches in 7.2 but
may try that at some point, time permitting.

On Sun, Nov 6, 2022 at 12:45 PM Greg Steuck  wrote:
>
> Greg Steuck  writes:
>
> > My router has become unstable since upgrading from 7.1-stable to
> > 7.2. After several days of uptime the machine gets into a state where
> > some applications (unbound & dhcpd) report ENOBUFS (No buffer space
> > available). At that time the machine is pingable over all the
> > interfaces, but only the upstream interface seems functional (igc0).
> > The networks downstream of the router can't get much data across. I
> > don't have a good characterization of this.
> >
> > At first I suspected this had something to do with the igc checksum
> > offloading commit, so I am now running 7.2 with this reverted:
> > "Implement and enable IPv4, TCP, and UDP checksum offloading for igc."
>
> So far it appears that reverting improved stability. I had 2 crashes
> last week and 0 in the last 8 days.
>
> > I also started monitoring some counters that appeared relevant with
> > this trivial loop:
> >
> > $ while : ; do date; netstat -s | grep err; netstat -m; netstat -ni | grep 
> > '^[Ni]'; sleep 300; done | tee err-log
> >
> > I have some 38 hours worth of counters as of now. I observe an upward
> > trend in "mbuf 2112" and "mbufs in use", I extracted the values with
> >
> > $ perl -ne 'print "$x,$1\n" if m/^(\d+).*mbuf 2112/; $x=$1 if 
> > /^(\d+)\smbufs in use/;' err-log
> >
> > It starts out 610,410-ish and ends at 717,513. I have a picture for
> > those visually inclined: https://photos.app.goo.gl/DZGCrJnJDohPrVyZ8
>
> The growth is very slow, so I'm not sure it matters much. The 8 day
> graph still shows a very slow ramp but it'll take a long time for
> that to become a problem: https://photos.app.goo.gl/H64FRMkrfrY3hi6f7
> (8 days worth of 5-minute-spaced samples)
>
> I'm reapplying the patch keeping the same monitoring on. Hopefully
> something will be visible in those stats. If not, at least we'll learn
> whether the diff correlates with the failure.
>
> Thanks
> Greg
>



Re: mbufs growing in 7.2

2022-11-06 Thread Greg Steuck
Greg Steuck  writes:

> My router has become unstable since upgrading from 7.1-stable to
> 7.2. After several days of uptime the machine gets into a state where
> some applications (unbound & dhcpd) report ENOBUFS (No buffer space
> available). At that time the machine is pingable over all the
> interfaces, but only the upstream interface seems functional (igc0).
> The networks downstream of the router can't get much data across. I
> don't have a good characterization of this.
>
> At first I suspected this had something to do with the igc checksum
> offloading commit, so I am now running 7.2 with this reverted:
> "Implement and enable IPv4, TCP, and UDP checksum offloading for igc."

So far it appears that reverting improved stability. I had 2 crashes
last week and 0 in the last 8 days.

> I also started monitoring some counters that appeared relevant with
> this trivial loop:
>
> $ while : ; do date; netstat -s | grep err; netstat -m; netstat -ni | grep 
> '^[Ni]'; sleep 300; done | tee err-log
>
> I have some 38 hours worth of counters as of now. I observe an upward
> trend in "mbuf 2112" and "mbufs in use", I extracted the values with
>
> $ perl -ne 'print "$x,$1\n" if m/^(\d+).*mbuf 2112/; $x=$1 if /^(\d+)\smbufs 
> in use/;' err-log
>
> It starts out 610,410-ish and ends at 717,513. I have a picture for
> those visually inclined: https://photos.app.goo.gl/DZGCrJnJDohPrVyZ8

The growth is very slow, so I'm not sure it matters much. The 8 day
graph still shows a very slow ramp but it'll take a long time for
that to become a problem: https://photos.app.goo.gl/H64FRMkrfrY3hi6f7
(8 days worth of 5-minute-spaced samples)

I'm reapplying the patch keeping the same monitoring on. Hopefully
something will be visible in those stats. If not, at least we'll learn
whether the diff correlates with the failure.

Thanks
Greg