On 2023-03-16 21:44, Attila Nagy wrote:
Hi,

As this is super annoying, I'm willing to pay a $500 bounty for solving this issue (whomever is first, however I don't anticipate a big competition :) Having an invoice would be best, but I'm willing to accept individuals as well). I can't give remote access, but can run debug builds with serial console. stable/13 branch.

I have a bunch of netbooted machines, one set in a cluster is older (HP DL80 G9, 2x8C, Intel I350 -igb- NICs), the other set is newer (HP XL225n G10, AMD EPYC2x16C, BCM57412 -bnxt- NICs).
All of these boot from the network, which is basically:
- get IP and options with DHCP with the help of the NIC's PXE stack
- get the loader and kernel, start it
- do another round of DHCP from the kernel (bootp_subr.c)
- mount the root via NFS and let everything work as usual

The problem is that the newer machines take an indefinite time to boot. The older ones (with igb NIC) work reliably, they always boot fast. The process of getting an IP address via DHCP (bootpc_call from bootp_subr.c) either succeeds normally (in a few seconds), or takes a lot of time. Common (measured) times to boot range from 10s of minutes to anywhere between a few hours (1-6).
Sometimes it just gets stuck and couldn't get past bootpc_call (getting the 
DHCP lease).

Do you have STP/RSTP enabled on the switch ports? When the link goes down when switching from firmware mode to kernel mode, the port will go back to blocking. When the dhcp requests don't make it to the dhcp server because of this and the link goes down and up again while retrying (don't know if this happens) you will get the same problem on the next try. As a simple test you could put a dumb unmanaged switch between your core network and the server.

best regards, Matthias

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to