I have had (somewhat) similar sounding issues with igc and 7.2
immediately after upgrading. I am able to reproduce it reliably by
running an iperf3 test to another machine on the network. Within a few
seconds the NIC will stop working. No traffic, no ping. No errors in
dmesg or syslog.

The issue also arises on its own without iperf3 stressing the nic if
something on the network temporarily spikes traffic such as a large
download. It's bad enough that I have had to rollback to 7.1 to keep
my network online.

The box is a topton N5105 (jasper lake) fanless box from aliexpress.
Unfortunately I did not save a full dmesg from when I had 7.2 running,
however here is some info from 7.1 dmesg re: the NICs:

igc0 at pci2 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address
igc1 at pci3 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address
igc2 at pci4 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address
igc3 at pci5 dev 0 function 0 "Intel I225-V" rev 0x03, msix, 4 queues, address

I have not tried pulling out the offload checksum patches in 7.2 but
may try that at some point, time permitting.

On Sun, Nov 6, 2022 at 12:45 PM Greg Steuck <gne...@openbsd.org> wrote:
>
> Greg Steuck <gne...@openbsd.org> writes:
>
> > My router has become unstable since upgrading from 7.1-stable to
> > 7.2. After several days of uptime the machine gets into a state where
> > some applications (unbound & dhcpd) report ENOBUFS (No buffer space
> > available). At that time the machine is pingable over all the
> > interfaces, but only the upstream interface seems functional (igc0).
> > The networks downstream of the router can't get much data across. I
> > don't have a good characterization of this.
> >
> > At first I suspected this had something to do with the igc checksum
> > offloading commit, so I am now running 7.2 with this reverted:
> > "Implement and enable IPv4, TCP, and UDP checksum offloading for igc."
>
> So far it appears that reverting improved stability. I had 2 crashes
> last week and 0 in the last 8 days.
>
> > I also started monitoring some counters that appeared relevant with
> > this trivial loop:
> >
> > $ while : ; do date; netstat -s | grep err; netstat -m; netstat -ni | grep 
> > '^[Ni]'; sleep 300; done | tee err-log
> >
> > I have some 38 hours worth of counters as of now. I observe an upward
> > trend in "mbuf 2112" and "mbufs in use", I extracted the values with
> >
> > $ perl -ne 'print "$x,$1\n" if m/^(\d+).*mbuf 2112/; $x=$1 if 
> > /^(\d+)\smbufs in use/;' err-log
> >
> > It starts out 610,410-ish and ends at 717,513. I have a picture for
> > those visually inclined: https://photos.app.goo.gl/DZGCrJnJDohPrVyZ8
>
> The growth is very slow, so I'm not sure it matters much. The 8 day
> graph still shows a very slow ramp but it'll take a long time for
> that to become a problem: https://photos.app.goo.gl/H64FRMkrfrY3hi6f7
> (8 days worth of 5-minute-spaced samples)
>
> I'm reapplying the patch keeping the same monitoring on. Hopefully
> something will be visible in those stats. If not, at least we'll learn
> whether the diff correlates with the failure.
>
> Thanks
> Greg
>

Reply via email to