On Mon, Nov 25, 2024 at 1:09 PM Chris Packham <[email protected]> wrote: > > On Sat, Nov 23, 2024 at 3:40 PM Tom Rini <[email protected]> wrote: > > > > On Wed, Nov 20, 2024 at 11:29:43AM +1300, Chris Packham wrote: > > > Hi U-Boot, > > > > > > We've hit a weird problem at $dayjob with a board using the Marvell > > > CN9130 SoC and using the asix88179 USB-Eth adapter. > > > > > > The problem is after enabling and unrelated feature in u-boot the > > > asix88179 fails to receive data (I can confirm that the link partner > > > does see packets in the transmit direction) > > > > > > => version > > > U-Boot 2022.01 (Nov 08 2024 - 09:45:44 +0000) > > > => usb start > > > starting USB... > > > Bus usb3@500000: Register 2000120 NbrPorts 2 > > > Starting the controller > > > USB XHCI 1.00 > > > scanning bus usb3@500000 for devices... 2 USB Device(s) found > > > scanning usb for storage devices... 0 Storage Device(s) found > > > => ping ${serverip} > > > Waiting for Ethernet connection... unable to connect. > > > Reset Ethernet Device > > > Waiting for Ethernet connection... done. > > > Using ax88179_eth device > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > Rx: failed to receive: -5 > > > > > > Abort > > > ping failed; host 10.37.233.65 is not alive > > > => <INTERRUPT> > > > > > > Debugging a little we can see that the -EIO is actually because > > > xhci_bulk_tx() hits a timeout from xhci_wait_for_event(). > > > > > > We think this is triggered by the u-boot image size crossing some > > > boundary (the problem seems to start when .bss_end crosses > > > 0x00000000000f0000) although I've so far been unable to find > > > specifically why that might be. As far as I can tell u-boot is being > > > built relocatably and nothing is overlapping. I also considered that > > > ATF might be preventing access to something but so far I see no > > > evidence of this. > > > > > > If I turn off some features to reduce the build size the problem goes > > > away. That is actually how we've avoided the immediate issue, although > > > that means the problem will likely come back and an inopportune time. > > > > > > Does anyone have any ideas as to what the true root cause might be? > > > I'm a bit stumped. > > > > Hummmm. Since you note it seems to be when a threshold is crossed in BSS > > size, add something to the BSS of a variable size that you control, and > > after confirming that you can replicate the problem this way, grow it > > just past the limit and compare u-boot.map files in the works/fails > > cases to see just what's being moved around? > > So I tried a little experiment > > diff --git a/net/net.c b/net/net.c > index b003b84b3537..a6def9785133 100644 > --- a/net/net.c > +++ b/net/net.c > @@ -180,6 +180,10 @@ u32 net_boot_file_size; > /* Boot file size in blocks as reported by the DHCP server */ > u32 net_boot_file_expected_size_in_blocks; > > +#define DUMMY_SIZE (1 << 11) > + > +int dummy[DUMMY_SIZE] = {0}; > + > static uchar net_pkt_buf[(PKTBUFSRX+1) * PKTSIZE_ALIGN + PKTALIGN]; > /* Receive packets */ > uchar *net_rx_packets[PKTBUFSRX]; > @@ -211,6 +215,7 @@ int __maybe_unused net_busy_flag; > static int on_ipaddr(const char *name, const char *value, enum env_op op, > int flags) > { > + dummy[DUMMY_SIZE - 1] = -1; > if (flags & H_PROGRAMMATIC) > return 0; > > > If I make DUMMY_SIZE (1 << 10) I don't see the problem. With > DUMMY_SIZE (1 << 11) I can see the problem. If I make it DUMMY_SIZE (1 > << 14) then the problem goes away again. > > The obvious things that are moving are net_rx_packet, > net_rx_packet_len and net_rx_packets. I'll see if I can narrow things > down to specifically which of these is being problematic. >
The plot thickens on this one. First I found that even if I moved my dummy block after the symbols I suspected the failure would remain. I kept narrowing things down and found that my dummy array needed to have a length between 0x800 and 0x8e0 to cause an issue. As I was trying to debug why this was, I found that I could fix a failing system with a `usb reset`. I'm now suspecting there's something in the mix that is relying on uninitialised memory (or perhaps the calculation for clearing out .bss is slightly off).

