So I had a chance to have a bit of a look today and got some mixed results.
Initially I tested that the node was able to boot fine and it could. Then I ran nodeset osimage, confirmed that it had updated dhcp and rebooted. root@mgt4:~# nodeset comp078 osimage comp078: statelite centos7.5-x86_64-compute root@mgt4:~# grep comp078 /var/lib/dhc dhclient/ dhcpd/ root@mgt4:~# grep comp078 /var/lib/dhcpd/dhcpd.leases host comp078 { supersede server.ddns-hostname = "comp078"; supersede host-name = "comp078"; "http:// ${next-server}:80/tftpboot/xcat/xnba/nodes/comp078"; "http:// ${next-server}:80/tftpboot/xcat/xnba/nodes/comp078.uefi"; root@mgt4:~# ssh comp078 shutdown -r now However the machine was able to boot fine but it did have to retry the loading, as seen on the console [2019-07-22T15:20:40+10:00] Station IP address is 100.64.1.78 [2019-07-22T15:20:40+10:00] [2019-07-22T15:20:40+10:00] Server IP address is 100.64.0.1 [2019-07-22T15:20:40+10:00] NBP filename is xcat/xnba.efi [2019-07-22T15:20:40+10:00] NBP filesize is 139200 Bytes [2019-07-22T15:20:40+10:00] Downloading NBP file... [2019-07-22T15:20:40+10:00] [2019-07-22T15:20:40+10:00] NBP file downloaded successfully. [2019-07-22T15:20:40+10:00] xNBA initialising devices...ok [2019-07-22T15:20:40+10:00] [2019-07-22T15:20:40+10:00] [2019-07-22T15:20:40+10:00] xCAT Network Boot Agent [2019-07-22T15:20:40+10:00] 1m37m40miPXE 1.0.3-131028 (d603e)0m37m40m -- Open Source Network Boot Firmware -- 0m36m40mhttp://ipxe.org0m37m40m [2019-07-22T15:20:40+10:00] Features: HTTP HTTPS iSCSI DNS TFTP EFI [2019-07-22T15:20:40+10:00] net0: 00:0a:f7:be:fc:de using <NULL> on EFI SNP (open) [2019-07-22T15:20:40+10:00] [Link:up, TX:0 TXE:0 RX:0 RXE:0] [2019-07-22T15:20:40+10:00] DHCP (net0 00:0a:f7:be:fc:de)... ok [2019-07-22T15:20:40+10:00] net0: 100.64.1.78/255.255.248.0 gw 100.64.0.1 [2019-07-22T15:20:40+10:00] Next server: 100.64.0.1 [2019-07-22T15:20:40+10:00] Filename: http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi [2019-07-22T15:20:40+10:00] http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi.................. Connection timed out (http://ipxe.org/4c0a6012) [2019-07-22T15:20:56+10:00] No more network devices [2019-07-22T15:20:56+10:00] xNBA initialising devices...ok [2019-07-22T15:20:56+10:00] [2019-07-22T15:20:56+10:00] [2019-07-22T15:20:56+10:00] xCAT Network Boot Agent [2019-07-22T15:20:56+10:00] 1m37m40miPXE 1.0.3-131028 (d603e)0m37m40m -- Open Source Network Boot Firmware -- 0m36m40mhttp://ipxe.org0m37m40m [2019-07-22T15:20:56+10:00] Features: HTTP HTTPS iSCSI DNS TFTP EFI [2019-07-22T15:20:56+10:00] net1: 00:0a:f7:be:fc:de using <NULL> on EFI SNP (open) [2019-07-22T15:20:56+10:00] [Link:up, TX:0 TXE:0 RX:0 RXE:0] [2019-07-22T15:20:56+10:00] DHCP (net1 00:0a:f7:be:fc:de)... ok [2019-07-22T15:20:56+10:00] net1: 100.64.1.78/255.255.248.0 gw 100.64.0.1 [2019-07-22T15:20:56+10:00] Next server: 100.64.0.1 [2019-07-22T15:20:56+10:00] Filename: http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi [2019-07-22T15:20:56+10:00] http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi........... ok [2019-07-22T15:21:04+10:00] http://100.64.0.1:80/tftpboot/xcat/elilo-x64.efi... ok [2019-07-22T15:21:04+10:00] ELILO v3.14 for EFI/x86_64 [2019-07-22T15:21:05+10:00] Loading kernel /tftpboot/xcat/osimage/centos75-gpfs5.0.2.0-compute/kernel... done [2019-07-22T15:21:05+10:00] Loading file /tftpboot/xcat/osimage/centos75-gpfs5.0.2.0-compute/initrd-stateless.gz...done So I tried it again, but specified the osimage to use (which probably really didnt make too much difference) root@mgt4:~# nodeset comp078 osimage=centos75-gpfs5.0.2.0-compute comp078: statelite centos7.5-x86_64-compute root@mgt4:~# ssh comp078 shutdown -r now And this time it failed. Looking at the capture from wireshark, you can see that it downloads xnba over tftp, but after that there is only some ARP traffic and no HTTP GET requests. Just to confirm, I re-eddited the leases file and removed the :80 from the entries and the node is now booting fine. So in summary it has managed to boot before with the port :80 specified as shown above, but generally it does fail and removing port 80 from the URL appears to be the most reliable way to fix it when its not able to boot. Definitely strange behaviour and clearly I missing something else here. I have attached the text version of pcap file from a normal boot (normal-boot.txt.gz), the 1st kind of broken boot (broken-boot01.txt.gz) and the completely broken boot broken-boot02.txt.gz All of the above files had lines not relevant (eg ARP requests for other nodes) removed. Let me know if you need the actual PCAP files? Cheers, Carl. On Thu, 18 Jul 2019 at 23:17, Carl <mutantll...@gmail.com> wrote: > Great, thanks. > > I'm happy to contribute back to the community, so I'll have a look to see > what I can do. > > Cheers, > > Carl. > > > On Thu, 18 Jul. 2019, 23:08 Jarrod Johnson, <jjohns...@lenovo.com> wrote: > >> It should come in the rpm prebuilt, so shouldn’t be different… >> >> >> >> So the most ‘make the problem go away’ solution would be to have xnba.pm >> only do this when needed. Off hand I think this would be right (untested): >> >> >> https://github.com/jjohnson42/xcat-core/commit/cd61fd9db468cd142537e5bd495b71310e6a6d07 >> >> >> >> If I were in the situation, I would probably satisfy curiosity by running >> wireshark to see if any packets are emitted with :80 and if so, what looks >> odd about them. >> >> >> >> Of course, another thing I’d be tempted to do would be to try a newer >> ipxe build. I happen to have one built to see if newer codebase would >> behave differently. However last time I had checked it seemed to have >> compatibility issues with elilo. Elilo is no longer required for CentOS7 >> and up (in conjunction with a modified xnba.pm I have), but CentOS6 >> kernels still need elilo. >> >> >> >> So I suppose there are three options, depending on how little time you >> want to spend to make the problem go away or understand more. >> >> >> >> >> >> >> >> *From:* Carl <mutantll...@gmail.com> >> *Sent:* Thursday, July 18, 2019 8:53 AM >> *To:* xCAT Users Mailing list <xcat-user@lists.sourceforge.net> >> *Subject:* Re: [xcat-user] [External] Re: Unable to pxe boot node after >> mainboard replacement >> >> >> >> Thanks Jarrod, >> >> >> >> Yes it is a little strange. >> >> >> >> I'm not seeing anything on the http server logs when the dhcp lease has >> :80 in the entry. >> >> >> >> I don't fully understand how xnba is built, could it be bringing in >> something from the management node (CentOS 6.5) that might be part of the >> issue? >> >> >> >> Cheers, >> >> >> >> Carl. >> >> >> >> On Thu, 18 Jul. 2019, 22:35 Jarrod Johnson, <jjohns...@lenovo.com> wrote: >> >> The change is from: >> >> commit 1889ec879d2ba721869217ad2e4f03d47b7fba40 >> >> Author: yangsbj <yang...@cn.ibm.com> >> >> Date: Thu Nov 1 23:29:01 2018 -0400 >> >> >> >> support site.httpport in nodeset and mknb >> >> >> >> >> >> Prior to that change, non-80 ports did not work. >> >> >> >> What is unusual is that 80 should be the normal port and the url parsing >> should be xNBA and not UEFI specific, so I’m uncertain why :80 would cause >> a problem in your environment. >> >> >> >> Nodes that have not been ‘nodeset’ since your upgrade would not have the >> :80…. >> >> >> >> A reasonable mitigation in the code would be to skip the port designation >> if it is default, though it is still fairly odd that this would do anything >> different… >> >> >> >> *From:* Carl <mutantll...@gmail.com> >> *Sent:* Thursday, July 18, 2019 4:01 AM >> *To:* xCAT Users Mailing list <xcat-user@lists.sourceforge.net> >> *Subject:* [External] Re: [xcat-user] Unable to pxe boot node after >> mainboard replacement >> >> >> >> Hi all, >> >> >> >> Further to the above I have managed to isolate the issue. >> >> >> >> It looks like when nodeset is run, it is adding :80 to the boot options >> in the leases file. >> >> >> >> Eg: >> >> >> >> host comp078 { >> dynamic; >> hardware ethernet 00:0a:f7:be:fc:de; >> uid 00:0a:f7:be:fc:de; >> fixed-address 100.64.1.78; >> supersede server.ddns-hostname = "comp078"; >> supersede host-name = "comp078"; >> if option user-class-identifier = "xNBA" and option >> client-architecture >> = 00:00 { >> supersede server.always-broadcast = 01; >> supersede server.filename = >> " >> http://${next-server}:80/tftpboot/xcat/xnba/nodes/comp078"; >> } elsif option user-class-identifier = "xNBA" and option >> client-architecture = 00:09 { >> supersede server.filename = >> " >> http://${next-server}:80/tftpboot/xcat/xnba/nodes/comp078.uefi"; >> } elsif option client-architecture = 00:07 { >> supersede server.filename = "xcat/xnba.efi"; >> } elsif option client-architecture = 00:00 { >> supersede server.filename = "xcat/xnba.kpxe"; >> } else { >> supersede server.filename = ""; >> } >> } >> >> >> >> If I manually edit the leases file and remove :80 from the two filename >> entries above, the node is able to boot fine. >> >> >> >> Is anyone able to advise on why my environment might be now doing this? >> >> >> >> Thanks, >> >> >> >> Carl. >> >> >> >> >> >> >> >> >> >> >> >> On Thu, 18 Jul 2019 at 16:22, Carl <mutantll...@gmail.com> wrote: >> >> Hi Folks, >> >> We recently replaced the mainboard on a Dell R640. >> >> I removed the mac address from the node definition and let switch based >> discovery take care of discovering the new MAC address and running BMC >> setup. Everything went well and the node ended at the xcat shell. >> >> However when I tried to boot the node (statelite) its failing to find the >> image and if I persist it dies with a horible UEFI error. The node also has >> this problem if I nodeset it to boot to shell. >> >> As other nodes are able to boot statelite fine, I assumed that it was a >> hardware error. Dell has replaced the mainboard a second time, but the >> issue still persists. >> >> >> >> It might be worth mentioning that the last time that we had a mainboard >> replacement on a comp node was about 9 months ago and we have updated xCat >> a couple of times since then. Attached is the console log of the UEFI crash >> and the pxe boot messages that are seen on a working and non-working node. >> >> Is anyone able to suggest any tricks to further debug this issue. I'm >> reluctant to pin the problem on xCat, but find it unlikely that I have hit >> two mainboards with the same fault. >> >> Thanks, >> >> Carl. >> >> >> >> #### These are the pxe boot messages for the node that isnt working #### >> [2019-07-10T10:45:47+10:00] ESC[2JESC[01;01HBooting from PXE Device 2: >> Integrated NIC 1 Port 3 Partition 1 >> [2019-07-10T10:45:48+10:00] >> [2019-07-10T10:45:48+10:00] >>Start PXE over IPv4. >> [2019-07-10T10:45:52+10:00] Station IP address is 100.64.1.78 >> [2019-07-10T10:45:52+10:00] >> [2019-07-10T10:45:52+10:00] Server IP address is 100.64.0.1 >> [2019-07-10T10:45:52+10:00] NBP filename is xcat/xnba.efi >> [2019-07-10T10:45:52+10:00] NBP filesize is 139200 Bytes >> [2019-07-10T10:45:52+10:00] Downloading NBP file... >> [2019-07-10T10:45:52+10:00] >> [2019-07-10T10:45:52+10:00] NBP file downloaded successfully. >> [2019-07-10T10:45:52+10:00] xNBA initialising devices...ok >> [2019-07-10T10:45:52+10:00] >> [2019-07-10T10:45:52+10:00] >> [2019-07-10T10:45:52+10:00] xCAT Network Boot Agent >> [2019-07-10T10:45:52+10:00] ESC[1mESC[37mESC[40miPXE 1.0.3-131028 >> (d603e)ESC[0mESC[37mESC[40m -- Open Source Network Boot Firmware -- >> ESC[0mESC[36mESC[40mhttp://ipxe.orgESC[0mESC[37mESC[40m >> [2019-07-10T10:45:52+10:00] Features: HTTP HTTPS iSCSI DNS TFTP EFI >> [2019-07-10T10:45:52+10:00] net0: 00:0a:f7:be:b7:d2 using <NULL> on EFI >> SNP (open) >> [2019-07-10T10:45:52+10:00] [Link:up, TX:0 TXE:0 RX:0 RXE:0] >> [2019-07-10T10:45:52+10:00] DHCP (net0 00:0a:f7:be:b7:d2)... ok >> [2019-07-10T10:45:52+10:00] net0: 100.64.1.78/255.255.248.0 gw 100.64.0.1 >> [2019-07-10T10:45:52+10:00] Next server: 100.64.0.1 >> [2019-07-10T10:45:52+10:00] Filename: >> http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi >> [2019-07-10T10:45:52+10:00] >> http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi.................. >> Connection timed out (http://ipxe.org/4c0a6012) >> [2019-07-10T10:46:08+10:00] No more network devices >> [2019-07-10T10:46:08+10:00] xNBA initialising devices...ok >> [2019-07-10T10:46:08+10:00] >> [2019-07-10T10:46:08+10:00] >> [2019-07-10T10:46:08+10:00] xCAT Network Boot Agent >> [2019-07-10T10:46:08+10:00] ESC[1mESC[37mESC[40miPXE 1.0.3-131028 >> (d603e)ESC[0mESC[37mESC[40m -- Open Source Network Boot Firmware -- >> ESC[0mESC[36mESC[40mhttp://ipxe.orgESC[0mESC[37mESC[40m >> [2019-07-10T10:46:08+10:00] Features: HTTP HTTPS iSCSI DNS TFTP EFI >> [2019-07-10T10:46:08+10:00] net1: 00:0a:f7:be:b7:d2 using <NULL> on EFI >> SNP (open) >> [2019-07-10T10:46:08+10:00] [Link:up, TX:0 TXE:0 RX:0 RXE:0] >> [2019-07-10T10:46:08+10:00] DHCP (net1 00:0a:f7:be:b7:d2)... ok >> [2019-07-10T10:46:08+10:00] net1: 100.64.1.78/255.255.248.0 gw 100.64.0.1 >> [2019-07-10T10:46:08+10:00] Next server: 100.64.0.1 >> [2019-07-10T10:46:08+10:00] Filename: >> http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi >> [2019-07-10T10:46:08+10:00] >> http://100.64.0.1:80/tftpboot/xcat/xnba/nodes/comp078.uefi.................. >> Connection timed out (http://ipxe.org/4c0a6012) >> [2019-07-10T10:46:24+10:00] No more network devices >> >> >> >> #### As a comparison, this is what we see on a node that boots fine #### >> [2019-07-18T11:59:45+10:00] ESC[0mESC[37mESC[40mESC[2JESC[01;01HBooting >> from PXE Device 1: Integrated NIC 1 Port 3 Partition 1 >> [2019-07-18T11:59:46+10:00] >> [2019-07-18T11:59:46+10:00] >>Start PXE over IPv4. >> [2019-07-18T11:59:50+10:00] Station IP address is 100.64.1.86 >> [2019-07-18T11:59:50+10:00] >> [2019-07-18T11:59:50+10:00] Server IP address is 100.64.0.1 >> [2019-07-18T11:59:50+10:00] NBP filename is xcat/xnba.efi >> [2019-07-18T11:59:50+10:00] NBP filesize is 139200 Bytes >> [2019-07-18T11:59:50+10:00] Downloading NBP file... >> [2019-07-18T11:59:50+10:00] >> [2019-07-18T11:59:50+10:00] NBP file downloaded successfully. >> [2019-07-18T11:59:50+10:00] xNBA initialising devices...ok >> [2019-07-18T11:59:50+10:00] >> [2019-07-18T11:59:50+10:00] >> [2019-07-18T11:59:50+10:00] xCAT Network Boot Agent >> [2019-07-18T11:59:50+10:00] ESC[1mESC[37mESC[40miPXE 1.0.3-131028 >> (d603e)ESC[0mESC[37mESC[40m -- Open Source Network Boot Firmware -- >> ESC[0mESC[36mESC[40mhttp://ipxe.orgESC[0mESC[37mESC[40m >> [2019-07-18T11:59:50+10:00] Features: HTTP HTTPS iSCSI DNS TFTP EFI >> [2019-07-18T11:59:50+10:00] net0: 00:0a:f7:bd:e6:b8 using <NULL> on EFI >> SNP (open) >> [2019-07-18T11:59:50+10:00] [Link:up, TX:0 TXE:0 RX:0 RXE:0] >> [2019-07-18T11:59:50+10:00] DHCP (net0 00:0a:f7:bd:e6:b8)... ok >> [2019-07-18T11:59:50+10:00] net0: 100.64.1.86/255.255.248.0 gw 100.64.0.1 >> [2019-07-18T11:59:50+10:00] Next server: 100.64.0.1 >> [2019-07-18T11:59:50+10:00] Filename: >> http://100.64.0.1/tftpboot/xcat/xnba/nodes/comp086.uefi >> [2019-07-18T11:59:51+10:00] >> http://100.64.0.1/tftpboot/xcat/xnba/nodes/comp086.uefi... ok >> [2019-07-18T11:59:51+10:00] >> http://100.64.0.1/tftpboot/xcat/elilo-x64.efi... ok >> [2019-07-18T11:59:51+10:00] ELILO v3.14 for EFI/x86_64 >> [2019-07-18T11:59:51+10:00] Loading kernel >> /tftpboot/xcat/osimage/centos75-gpfs5.0.2.0-compute/kernel... done >> [2019-07-18T11:59:51+10:00] Loading file >> /tftpboot/xcat/osimage/centos75-gpfs5.0.2.0-compute/initrd-stateless.gz...done >> >> _______________________________________________ >> xCAT-user mailing list >> xCAT-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/xcat-user >> >> _______________________________________________ >> xCAT-user mailing list >> xCAT-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/xcat-user >> >
normal-boot.txt.gz
Description: GNU Zip compressed data
broken-boot01.txt.gz
Description: GNU Zip compressed data
broken-boot02.txt.gz
Description: GNU Zip compressed data
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user