I'm seeing an intermittent problem where xCat randomly fails to boot a sateless node, hanging up after the DHCP info has been passed to the node, but it fails to download initrd-stateless.gz. I'm using xCat 2.10, but I've seen this issue with previous versions (2.9.x). Our nodes (HP BL460c Gen9) are using UEFI, so I'm set to use xnba for our boot method rather than legacy PXE. This ends up with httpd being used to serve out almost all the initial files beyond the first one (xnba.efi).
I've tried increasing the amount of concurrent file access via config options in httpd (setting MaxClients 1000 & ServerLimit 1000), but nothing has seemed to help so far. It also happens with even small amounts of concurrent boots (as low as 12 simultaneous boots). When I get in this failed state I can simply power cycle the node and have it come up fine, so I don't think it's a configuration issue with xCat. This issue occurs randomly and doesn't follow any pattern that I can tell based on nodes or switches. I also don't think it's an issue with throughput since the xCat master is using two 10Gb NIC's in a bonded configuration to serve out images. You can see where it hangs in the attached image. I'd appreciate any help on tracking down what's going on here. Thanks, John Lopilato
------------------------------------------------------------------------------
_______________________________________________ xCAT-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xcat-user
