Greg Steuck <gne...@openbsd.org> writes:

> The watched kettle never boiled. No more crashes in over two weeks
> (instead of two in the first week). I tried a loop of alternating iperf3
> tcp and udp to no ill effect. I still see the growth in the metrics I
> reported, yet the system remained stable.
>
> I applied the patch below and am still collecting the metrics. I doubt
> they are responsible for the original problem.

This time the problem fired after 6 days of uptime. The system is
running 7.2 + igc off-by-one fix.

The symptoms are:

1 A single interface is "stuck", sometimes ping replies come back,
  incoming packets are visible in tcpdump, no reply packets
  appear in tcpdump (nor received on the other machine).
2 Other interfaces are fine to the point that I can ssh over one of
  them to debug.
3 The stuck interface remains stuck after ifconfig down/up.
4 The stuck interface remains stuck throughout pfctl -d/-e. This did
  reenable ping replies, that were stuck for a bit.
5 netstat -i shows large (and rising) value in Ofail column
6 netstat -m shows a number of 'mbuf 2112' stuck fairly high:
  5539 mbufs in use:
        5441 mbufs allocated to data
        7 mbufs allocated to packet headers
        91 mbufs allocated to socket names and addresses
  40/112 mbuf 2048 byte clusters in use (current/peak)
  4510/6630 mbuf 2112 byte clusters in use (current/peak)
7 no established connections to speak of

The interface stickiness is mainly its inability to send higher level
protocol replies. E.g. a TCP connection from a remote system doesn't get
completed. Or SYN/ACK completes, but then I can see the data to the
machine and not even ACKs coming back. ktrace shows the application
writes the data which just never makes it down the stack to where
tcpdump would see it.

Obligatory graph of seemingly related counters: "mbufs in use",
"mbuf 2112 byte clusters", and "Ofail" counts

https://docs.google.com/spreadsheets/d/e/2PACX-1vRr61USv9VNvaIq9qEs8W1wy869ai6MwNmevDLmxLJOV3DaUBcrRUzwzNZP92syltrWfrmIUWq7qevG/pubchart?oid=202363413&format=interactive

There's a long tail to the left covering 5 days of mostly nothing
happening. The drop of "mbufs in use" from 7355 to 5739 is around the
time I removed the system from service and possibly when I cycled
ifconfig down/up (I have no record).

The system is still up and moved off to the side for debugging. It can
remain up for as long as we have things to try (and power utility
cooperates).

Thanks
Greg

Reply via email to