Friendly reminder. I really would like to proceed in debugging this
issue with a more suitable report, but I definitely need directions...
Copying tech@, hoping to attract the attention of some developers.
All the best
On 2021-12-07 19:39, Alessandro De Laurenzis wrote:
Greetings,
I recently installed OpenBSD 7.0 on an old CoreDuo2 machine (Compaq
610, complete dmesg in attach), which was powered by 5.5/5.6/5.7 some
years ago, without any relevant issues (after that, it has been used
as home server with Debian).
mskc0 at pci4 dev 0 function 0 "Marvell Yukon 88E8042" rev 0x10,
Yukon-2 FE+ rev. A0 (0x0): msi
msk0 at mskc0 port A: address 18:a9:05:94:ab:19
eephy0 at msk0 phy 0: 88E3016 10/100 PHY, rev. 0
I noticed that the trunk(4) failover protocol is broken when the
Ethernet cable is plugged in (starting in this configuration, no lease
is acquired from DHCP server, switching to Ethernet from wifi breaks
the connection; in both cases, trunk and msk0 status is: no carrier).
It's worth noting that when msk0 is configured as "stand-alone" (i.e.,
without trunk(4) failover), the connection is pretty functional and
stable.
Since I didn't remember any similar problems showing up with 5.x, I
made a bit of bisecting, and my conclusion is that the functionality
got broken b/w 6.2 and 6.3 and, specifically, after the following
commit:
RCS file: /cvs/src/sys/dev/pci/if_msk.c,v
----------------------------
revision 1.131
date: 2018/01/06 03:11:04; author: dlg; state: Exp; lines: +251
-311; commitid: BhB8LisF92o4xfOK;
rework the transmit and receive paths to address reliability issues.
phessler@ has been having trouble with msk on overdrive 1000s. some
of the issues relate to the driver not coping with exhaustion of
mbufs for the rx ring, the other issues are corruption of the mcl9k
pool that msk uses.
this diff adds a timeout that the rx refill code uses when the rx
ring is empty and cannot be filled. it'll periodically retry the
ring refill until it can get some mbufs in the air again.
the current code made hunting for the mcl9k issue too hard, so this
rewrites it to be simpler and more like other drivers. there's now
just arrays of mbuf pointers and dmamaps to shadow the hardware
ring entries, and producer and consumer indexes. what was there
before had linkes lists of something to hold mbuf pointers and
dmamaps, and some way to go from the ring to go back to that. i
think, it was hard to tell what was happening.
this also copies the ADDR64 handling on the tx ring to the rx ring.
this potentially makes more rx descriptors available, but that can
happen later.
in hindsight the mcl9k problem could have been from letting if_rxr
allocate the entier ring. if every descriptor was filled, the chip
may have run around the ring when it shouldnt have. giving rxr one
less descriptor than there is on the ring may have fixed the problem
too.
this work also makes it easier to make msk mpsafe.
tested by an ok phessler@
ok kettenis@ deraadt@
=============================================================================
and the corresponding one for sys/dev/pci/if_mskvar.h (revision 1.14,
same log).
On a fresh 6.3 install, which was showing the issue, I reverted the 2
files to the revisions 1.130 and 1.13 respectively, observing a
functional trunk(4) failover again.
The diff is too long and complex, so I cannot say where the problem
lies exactly, but I hope this report contains enough information to
start an analysis (I'm copying the involved developers, just in case
they are not reading this list); of course, I'm available to test any
patches (on 7.0 or -current) and add further details if needed.
Please note that the dmesg is from OBSD 6.3, since that is the version
currently installed on the laptop; in case you're interested in the
7.0/current's dmesg, just let me know.
All the best