Hello all,

Having enjoyed a smooth running of xCAT 2.16 on a small cluster of 4 compute 
nodes for about 18 months, I have hit a problem on the last Centos 7 system 
update from 3.10.0-1127.13.1 to 3.10.0-1127.18.2 that resulted in all the 
clients not booting up with this error:

   CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA
   CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140
   GATEWAY IP: 130.246.32.254

   PXE Boot aborted. Booting to next device...
   PXE-M0F: Exiting Intel Boot Agent

No such problem with the previous Centos 7 updates.
Last successful update to 3.10.0-1127.13.1 on 24 June went through okay.

I have attempted a number of checks and recoveries but no joy:
a. confirm master can accept dhcp requests with DHCPNAK records in messages.
b. confirm master can accept tftp downloads from a different system.
c. confirm master can accept http downloads on kernel and ramdisk files.
d. power-cycled master
e. power-cycled clients
f. manually lsdef -t osimage, chdef -t osimage, genimage, packimage, nodeset

Unfortunately the above fault persists on all four compute nodes that are still 
down.

I think I have run out of ideas.
Before giving up and making a fresh XCAT installation, I wonder if anyone can 
shed some clues to trouble shoot PXE aborted errors.

Many thanks.
Peter Chiu
STFC RAL Space, UK
==============================================================================
Here are some details on the systems:

Master node: main.bnsc.rl.ac.uk 130.246.32.140/22 gateway 130.246.32.254
Compute node1: proc01.bnsc.rl.ac.uk 130.246.32.141/22 00:25:90:5a:eb:8a
Operating system: CentOS Linux release 7.8.2003 (Core)
xCAT: # rpm -qf  /opt/xcat/sbin/xcatd
xCAT-server-2.16-snap202006161607.noarch

Checks:

a. DHCP records in master /var/log/messages, no error.

The master server has picked up the dhcp requests, and offered the address.
But no further communication afterwards.

Aug 4 15:00:27 main dhcpd: DHCPDISCOVER from 00:25:90:5a:eb:8a via bond0
Aug 4 15:00:27 main dhcpd: DHCPOFFER on 130.246.32.141 to 00:25:90:5a:eb:8a via 
bond0
Aug 4 15:00:29 main dhcpd: Dynamic and static leases present for 130.246.32.141.
Aug 4 15:00:29 main dhcpd: Remove host declaration proc01 or remove 
130.246.32.141
Aug 4 15:00:29 main dhcpd: from the dynamic address pool for bond0
Aug 4 15:00:29 main dhcpd: DHCPREQUEST for 130.246.32.141 (130.246.32.140) from 
00:25:90:5a:eb:8a via bond0
Aug 4 15:00:29 main dhcpd: DHCPACK on 130.246.32.141 to 00:25:90:5a:eb:8a via 
bond0

b. /var/log/xcat/cluster.log
No errors, just a record of a new image produced.

Aug 4 14:24:54 main xcat[28101]: INFO xCAT: Allowing lsdef -t site -o 
clustersite -i installdir for root from localhost
Aug 4 14:24:54 main xcat[28103]: INFO xCAT: Allowing genimage -i eth0 -n 
dca,ixgbe,igb,e1000e,e1000,tg3 -o centos7.6 -p compute --tempfile 
/tmp/xcat_genimage.28086 for root from localhost
Aug 4 14:27:29 main xcat[25483]: INFO xCAT: Allowing packimage 
centos7.6-x86_64-netboot-compute for root from localhost
Aug 4 14:27:30 main xcat[25499]: INFO xCAT: Allowing ilitefile 
centos7.6-x86_64-statelite-compute for root from localhost
Aug 4 14:30:07 main xcat[26073]: INFO xCAT: Allowing nodeset to compute 
osimage=centos7.6-x86_64-netboot-compute for root from localhost
Aug 4 14:34:33 main xcat[26958]: INFO xCAT: Allowing rpower to compute reset 
for root from localhost
Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc03: changing 
status=powering-on
Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc04: changing 
status=powering-on
Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc01: changing 
status=powering-on
Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc02: changing 
status=powering-on

c. Check dhcp lease file for the files to be downloaded:
less /var/lib/dhcpd/dhcpd.leases
host proc01.bnsc.rl.ac.uk {
deleted;
}
host proc04.bnsc.rl.ac.uk {
deleted;
}
host proc01 {
dynamic;
hardware ethernet 00:25:90:5a:eb:8a;
uid 00:25:90:5a:eb:8a;
fixed-address 130.246.32.141;
supersede server.ddns-hostname = "proc01";
supersede host-name = "proc01";
if option user-class-identifier = "xNBA" and option client-architecture
= 00:00 {
supersede server.always-broadcast = 01;
supersede server.filename =
"http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01";;
} elsif option user-class-identifier = "xNBA" and option
client-architecture = 00:09 {
supersede server.filename =
"http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01.uefi";;
} elsif option client-architecture = 00:07 {
supersede server.filename = "xcat/xnba.efi";
} elsif option client-architecture = 00:00 {
supersede server.filename = "xcat/xnba.kpxe";
} else {
supersede server.filename = "";
}
}

Follow through this list to download the files on a separate Centos server.

d. tftp 130.236.32.140
[root@cds1 xcat]# tftp 130.246.32.140
tftp> get xcat/xnba.kpxe
tftp> get xcat/xnba.efi
tftp> get yaboot
tftp> get xcat/xnba/nets/130.246.32.0_22
tftp> get xcat/xnba/nets/130.246.32.0_22.uefi
tftp> quit
[root@cds1 xcat]# ls
130.246.32.0_22 130.246.32.0_22.uefi elilo.efi xnba.efi xnba.kpxe yaboot
[root@cds1 xcat]# ls -ls
total 536
4 -rw-r--r-- 1 root root 252 Aug 4 09:46 130.246.32.0_22
4 -rw-r--r-- 1 root root 116 Aug 4 09:46 130.246.32.0_22.uefi
0 -rw-r--r-- 1 root root 0 Aug 4 09:45 elilo.efi
140 -rw-r--r-- 1 root root 139169 Aug 4 09:45 xnba.efi
80 -rw-r--r-- 1 root root 74786 Aug 4 09:45 xnba.kpxe
308 -rw-r--r-- 1 root root 310187 Aug 4 09:46 yaboot

e. use wget to download the node start up file
wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01

root@cds1 xcat]# wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01
--2020-08-04 11:57:18-- http://130.246.32.140/tftpboot/xcat/xnba/nodes/proc01
Connecting to 130.246.32.140:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 528
Saving to: `proc01'

100%[======================================>] 528 --.-K/s in 0s

2020-08-04 11:57:18 (85.2 MB/s) - `proc01' saved [528/528]

f. This file in turn contains the instructions to download the kernel and 
ramdisk
[root@cds1 xcat]# less proc01
#!gpxe
#netboot centos7.6-x86_64-compute
imgfetch -n kernel 
http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel
imgload kernel
imgargs kernel 
imgurl=http://130.246.32.140:80//install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gz
 XCAT=130.246.32.140:3001 NODE=proc01 FC=yes XCATHTTPPORT=80 netdev=eth0 
selinux=0 biosdevname=0 net.ifnames=0 BOOTIF=01-${netX/machyp}
imgfetch 
http://${next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz
imgexec kernel

Both the kernel and ramdisk can also be downloaded using wget command.


This email and any attachments are intended solely for the use of the named 
recipients. If you are not the intended recipient you must not use, disclose, 
copy or distribute this email or any of its attachments and should notify the 
sender immediately and delete this email from your system. UK Research and 
Innovation (UKRI) has taken every reasonable precaution to minimise risk of 
this email or any attachments containing viruses or malware but the recipient 
should carry out its own virus and malware checks before opening the 
attachments. UKRI does not accept any liability for any losses or damages which 
the recipient may sustain due to presence of any viruses. Opinions, conclusions 
or other information in this message and attachments that are not related 
directly to UKRI business are solely those of the author and do not represent 
the views of UKRI.

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to