Re: [xcat-user] Fw: Re: RHEL-7.3 provisioning error

Saurabh Barve Thu, 28 Sep 2017 13:27:02 -0700

Hi,

>>>> since your MN have 2 nics connected to compute node

There is only one NIC connected to the compute nodes. There are two interfaces configured on the MN - one faces the external network [enp130s0f1] and the second faces the internal cluster network [enp130s0f0]. DHCP is configured to listen only on enp130s0f0. I even added the interface name [enp130s0f0] to the file /etc/systemd/system/dhcpd.service to make sure than DHCP listens only on enp130s0f0.

If you look at the logs I had included, all of the DHCP requests are coming only to one interface - enp130s0f0.

I'm using the same xCAT configuration that worked for me for RHEL-6.8 for xCAT version 2.9. I'm not sure what changed with xCAT 2.13.4 and/or RHEL-7.3

Regards,

Saurabh
------
Saurabh Barve
Tata Consultancy Services

-----"Song BJ Yang" <yang...@cn.ibm.com> wrote: -----

To: xcat-user@lists.sourceforge.net, barve_saur...@cat.com
From: "Song BJ Yang" <yang...@cn.ibm.com>
Date: 09/28/2017 11:45AM
Subject: Re: Fw: Re: [xcat-user] RHEL-7.3 provisioning error

hi,

there is a site attribute site.dhcpinterfaces:

# tabdump -d site|grep dhcpinter -A10
dhcpinterfaces: The network interfaces DHCP should listen on. If it is the same for all
nodes, use a comma-separated list of the NICs. To specify different NICs
for different nodes, use the format: "xcatmn|eth1,eth2;service|bond0",
where xcatmn is the name of the management node, DHCP should listen on
the eth1 and eth2 interfaces. All the nodes in group 'service' should
listen on the 'bond0' interface.

To disable the genesis kernel from being sent to specific interfaces, a
':noboot' option can be appended to the interface name. For example,
if the management node has two interfaces, eth1 and eth2, disable
genesis from being sent to eth1 using: "eth1:noboot,eth2".

since your MN have 2 nics connected to compute node, suppose the nic manes are ETH0 and ETH1 on MN;

My compute nodes have six interfaces:
--> 4 1GbE interfaces --- eno1 (eth0), eno2 (eth1), eno3 (eth2), eno4 (eth3) ==> connected to ETH0 on MN
--> 2 10GbE interfaces --- enp130s0f0 (eth4), enp130s0f0 (eth5) ===> connected to ETH1 on MN

the 2 10GbE interfaces is the provision network, you might need to set site.dhcpinterface to "ETH0:noboot,ETH1" with:

chdef -t site -o clustersite dhcpinterfaces="ETH0:noboot,ETH1",

then run "makedhcp -n", then reboot the node or retrovision the node

best regards

------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193

----- Original message -----
From: Wei Hua WH Hu/China/IBM
To: Song BJ Yang/China/IBM@IBMCN
Cc:
Subject: Fw: Re: [xcat-user] RHEL-7.3 provisioning error
Date: Thu, Sep 28, 2017 2:03 PM

Best Regards!
--------------------------------------------------------------
Hu, Wei Hua (胡卫华)
IBM China System Technology Laboratory
Email: huwei...@cn.ibm.com
Tel: 86-10-82453253
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193

----- Original message -----
From: Daniel Letai <d...@letai.org.il>
To: xcat-user@lists.sourceforge.net
Cc:
Subject: Re: [xcat-user] RHEL-7.3 provisioning error
Date: Wed, Sep 27, 2017 4:20 PM

Try modifying the pxe file according to https://www.ibm.com/mysupport/s/article/ka150000000H5ohAAC/Problem-booting-from-local-disk-after-node-provisioning

That is:

in /tftpboot/pxelinux.cfg/adqan001

change:

label xCAT
localboot 0

to:

label xCAT
kernel chain.c32

append hd0

On 09/26/2017 04:24 PM, Saurabh Barve wrote:

>>>>What’s the updatestatus of the compute node after that first successful PXE and Anaconda run? Check it with “lsdef adqan001”. Also what is the value of currchain after Anaconda runs and the machine reboots?

Here's the update status of the node while the OS is installing on it (I've kept only the entries that I thought were relevant to the question):

[root@admwqamgr ~]# lsdef adqan001
Object name: adqan001
addkcmdline=edd=off ipv6.disable=1
arch=x86_64
currchain=boot
currstate=install rhels7.3-x86_64-compute
installnic=eth4
interface=eth4
ip=192.168.40.7
mac=90:E2:BA:74:90:A4
netboot=xnba
nfsdir=/install
nfsserver=admwqamgr
os=rhels7.3
postscripts=syslog,remoteshell
primarynic=eth4
profile="">
provmethod=cat-compute-rhels7.3-x86_64
status=installing
statustime=09-26-2017 08:09:13
tftpdir=/tftpboot
tftpserver=admwqamgr
updatestatus=synced
updatestatustime=09-05-2017 11:15:23

Here's the status of the node after OS installation completes and the machine reboots (this is where it gets stuck):

[root@admwqamgr ~]# lsdef adqan001
Object name: adqan001
addkcmdline=edd=off ipv6.disable=1
arch=x86_64
currchain=boot
currstate=boot
installnic=eth4
interface=eth4
ip=192.168.40.7
mac=90:E2:BA:74:90:A4
netboot=xnba
nfsdir=/install
nfsserver=admwqamgr
os=rhels7.3
postscripts=syslog,remoteshell
primarynic=eth4
profile="">
provmethod=cat-compute-rhels7.3-x86_64
status=booting
statustime=09-26-2017 08:17:25
tftpdir=/tftpboot
tftpserver=admwqamgr
updatestatus=synced
updatestatustime=09-05-2017 11:15:23

>>>> Are your DNS settings correct? Can the compute node resolve the master node in the Anaconda shell? Forward and reverse DNS must work.

Yes. If I manually boot the node to disk after the install, it boots into the OS. After logging in to the compute node, I can verify that forward and reverse DNS are working fine. I also tried specifying the IP address of the xCAT management node instead of its name in the 'noderes' table but got the same result.

>>>> What’s in the PXE file on the master node after the Anaconda run? /tftpboot/pxelinux.cfg/adqan001

Here you go:

[root@admwqamgr ~]# cat /tftpboot/pxelinux.cfg/adqan001
#boot
DEFAULT xCAT
LABEL xCAT
LOCALBOOT 0

>>>> At the end of the postscripts run the ‘updatestatus.awk’ script needs to work – that’s what calls back to the master node and updates the status of the node

Does this need to be run manually? My postscripts table is pretty simple:

[root@admwqamgr ~]# tabdump postscripts
#node,postscripts,postbootscripts,comments,disable
"compute","syslog,remoteshell",,,

Regards,
Saurabh

From: <russa...@comcast.net>
To: "'xCAT Users Mailing list'" <xcat-user@lists.sourceforge.net>
Date: 26-09-2017 18:11
Subject: Re: [xcat-user] RHEL-7.3 provisioning error

What’s the updatestatus of the compute node after that first successful PXE and Anaconda run? Check it with “lsdef adqan001”. Also what is the value of currchain after Anaconda runs and the machine reboots?

Are your DNS settings correct? Can the compute node resolve the master node in the Anaconda shell? Forward and reverse DNS must work.

What’s in the PXE file on the master node after the Anaconda run? /tftpboot/pxelinux.cfg/adqan001
That file is what instructs the machine to boot from disk.

At the end of the postscripts run the ‘updatestatus.awk’ script needs to work – that’s what calls back to the master node and updates the status of the node.

From: Saurabh Barve [mailto:barve_saur...@cat.com]
Sent: Tuesday, September 26, 2017 5:15 AM
To: xcat-user@lists.sourceforge.net
Subject: [xcat-user] RHEL-7.3 provisioning error

Hi,

I'm trying to deploy RHEL-7.3 on my cluster compute nodes but running into problems with PXE after the node is successfully installed.

Overview

These are the details of my xCAT management node:

[root@admwqamgr ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)
[root@admwqamgr ~]# uname -r
3.10.0-514.26.2.el7.x86_64

[root@admwqamgr ~]# lsxcatd -a
Version 2.13.4 (git commit 6ee3741498768994e4bb10d2a77c9699bcabde90, built Tue May 16 10:03:13 EDT 2017)
This is a Management Node
cfgloc=Pg:dbname=xcatdb;host=192.168.40.4|xcatadm
dbengine=Pg
dbname=xcatdb
dbhost=192.168.40.4
dbadmin=xcatadm

My compute nodes have six interfaces:
--> 4 1GbE interfaces --- eno1 (eth0), eno2 (eth1), eno3 (eth2), eno4 (eth3)
--> 2 10GbE interfaces --- enp130s0f0 (eth4), enp130s0f0 (eth5)

Two things about the network:
(i) I'm deploying the compute nodes over the "eth4" interface
(ii) There is network connectivity on both eth0 and eth4 - this is beyond my control

The boot order for the node as specified in the BIOS is:
enp130s0f0 (eth4)
enp130s0f0 (eth5)
eno1 (eth0)
eno2 (eth1)
eno3 (eth2)
eno4 (eth3)
HDD 1
HDD 2
HDD 3
HDD 4

The xCAT management node provides both the DHCP and DNS services for the cluster. I have NetworkManager running on the xCAT management node. IPV6 is disabled on the xCAT management node.

I also want to use NetworkManager on the compute nodes.

Problem

I deploy the node using the commands:

nodeset adqan001 osimage=compute-rhels7.3-x86_64
rsetboot adqan001 net
rpower adqan001 on

The node deploys over eth4 without any problem. However, when it reboots after the installation, the node doesn't boot from disk. I see the following error messages in the logs on the xCAT management server:

+++++++++++++++++++
Sep 25 13:34:28 admwqamgr dhcpd: DHCPDISCOVER from 90:e2:ba:74:90:a4 via enp130s0f0
Sep 25 13:34:28 admwqamgr dhcpd: DHCPOFFER on 192.168.40.7 to 90:e2:ba:74:90:a4 via enp130s0f0
Sep 25 13:34:28 admwqamgr dhcpd: DHCPREQUEST for 192.168.40.7 (192.168.40.4) from 90:e2:ba:74:90:a4 via enp130s0f0
Sep 25 13:34:28 admwqamgr dhcpd: DHCPACK on 192.168.40.7 to 90:e2:ba:74:90:a4 via enp130s0f0
Sep 25 13:34:50 admwqamgr dhcpd: DHCPDISCOVER from 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:51 admwqamgr dhcpd: DHCPOFFER on 192.168.40.82 to 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:54 admwqamgr dhcpd: DHCPREQUEST for 192.168.40.82 (192.168.40.4) from 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:54 admwqamgr dhcpd: DHCPACK on 192.168.40.82 to 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:54 admwqamgr in.tftpd[7342]: RRQ from 192.168.40.82 filename xcat/xnba.kpxe
Sep 25 13:34:54 admwqamgr in.tftpd[7342]: tftp: client does not accept options
Sep 25 13:34:54 admwqamgr in.tftpd[7343]: RRQ from 192.168.40.82 filename xcat/xnba.kpxe
Sep 25 13:34:54 admwqamgr in.tftpd[7343]: Client 192.168.40.82 finished xcat/xnba.kpxe
Sep 25 13:34:54 admwqamgr dhcpd: DHCPDISCOVER from 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:55 admwqamgr dhcpd: DHCPOFFER on 192.168.40.80 to 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:55 admwqamgr dhcpd: DHCPREQUEST for 192.168.40.80 (192.168.40.4) from 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:34:55 admwqamgr dhcpd: DHCPACK on 192.168.40.80 to 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:35:31 admwqamgr dhcpd: DHCPDISCOVER from 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:35:31 admwqamgr dhcpd: DHCPOFFER on 192.168.40.82 to 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:35:31 admwqamgr dhcpd: DHCPREQUEST for 192.168.40.82 (192.168.40.4) from 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 13:35:31 admwqamgr dhcpd: DHCPACK on 192.168.40.82 to 00:1e:67:8e:df:39 via enp130s0f0
Sep 25 03:35:32 192.168.40.82 (none) dhclient[1054]: XMT: Solicit on eth0, interval 4170ms.
Sep 25 13:35:33 192.168.40.82 (none) ntpd[1633]: 0.0.0.0 c61c 0c clock_step +36000.081070 s
Sep 25 13:35:33 192.168.40.82 (none) ntpd[1633]: 0.0.0.0 c614 04 freq_mode
Sep 25 13:35:34 192.168.40.82 (none) ntpd[1633]: 0.0.0.0 c618 08 no_sys_peer
Sep 25 13:35:36 192.168.40.82 (none) dhclient[1922]: DHCPDISCOVER on eth4 to 255.255.255.255 port 67 interval 7 (xid=0x3fd26f33)
Sep 25 13:35:36 192.168.40.82 (none) dhclient[1054]: XMT: Solicit on eth0, interval 8110ms.
Sep 25 13:35:39 192.168.40.82 (none) dhclient[2017]: Bound to *:546
Sep 25 13:35:39 192.168.40.82 (none) dhclient[2017]: XMT: Solicit on eth4, interval 1010ms.
Sep 25 13:35:40 192.168.40.82 (none) dhclient[2017]: XMT: Solicit on eth4, interval 1970ms.
Sep 25 13:35:41 192.168.40.82 (none) ntpd[1633]: Listen normally on 6 eth4 fe80::92e2:baff:fe74:90a4 UDP 123
+++++++++++++++++++

192.168.40.7 is the address that the DHCP server is supposed to hand out to eth4 interface of the node with the MAC address 90:e2:ba:74:90:a4. But even after it gets the IP, the node doesn't boot to disk. The eth0 interface with the MAC address 00:1e:67:8e:df:39 then tries and succeeds in getting an IP address from the DHCP server. The node then ends up booting in the genesis shell.

Additional xCAT configuration information

[root@admwqamgr ~]# tabdump noderes
#node,servicenode,netboot,tftpserver,tftpdir,nfsserver,monserver,nfsdir,installnic,primarynic,discoverynics,cmdinterface,xcatmaster,current_osimage,next_osimage,nimserver,routenames,nameservers,proxydhcp,syslog,comments,disable
"compute",,"xnba","admwqamgr","/tftpboot","admwqamgr",,"/install","eth4","eth4","eth4",,,,,,,,,,,

[root@admwqamgr ~]# tabdump mac
#node,interface,mac,comments,disable
"adqan001","eth4","90:E2:BA:74:90:A4",,

[root@admwqamgr ~]# tabdump bootparams
#node,kernel,initrd,kcmdline,addkcmdline,dhcpstatements,adddhcpstatements,comments,disable
"compute",,,,"edd=off ipv6.disable=1",,,,

I'm enabling and disabling the following services in my compute node template:

# System services
services --enabled="chronyd,NetworkManager,postfix,nfs,nfs-server" --disabled="firewalld"

What I've tried so far

(1) As you can see in the 'bootparams' table above, I've disabled IPV6 for the newly deployed node
(2) I've also edited the /opt/xcat/lib/perl/xCAT/Template.pm and changed the line number 1066 to add "--noipv6" to the default Kickstart deployment parameter:
$line .= "dhcp --device=$suffix --noipv6";
(3) I've tried specifying both eth4 and enp130s0f0 in both the 'mac' and 'noderes' tables with the same result
(4) I've used both 'xnba' and 'pxe' as provisioning methods with the same result
(5) From the compute node installation template provided by xCAT, I've removed the following line as the script called here enables all network interfaces and disables NetworkManager
echo "Running Kickstart Post-Installation script..."
#INCLUDE:#ENV:XCATROOT#/share/xcat/install/scripts/post.xcat#
#INCLUDE:#ENV:XCATROOT#/share/xcat/install/scripts/post.rhels7# <<<<------- Deleted this line

I can get the node to boot into PXE the first time and install without problems, but on the subsequent it doesn't boot to disk like it should. This configuration had worked very for me on RHEL-6.8, minus the NetworkManager.

Regards,
Saurabh------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org!https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DQICAg&c=p0oa49nxxGtbbM2qgM-GB4r4m9OlGg-sEp8sXylY2aQ&r=y6oUgGLI3Va2WgIE-Qq68XsBjmVWtX92km0aTnDk-Go&m=DIoxyd1WtYURSpGirO0EdyK67bK3H_5sX3fX0_f0qeg&s=G5P6Lzta3_uG8XTUkU9jW7qOKfp1zSfKXDCpAEEOM3w&e=_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_xcat-2Duser&d=DQICAg&c=p0oa49nxxGtbbM2qgM-GB4r4m9OlGg-sEp8sXylY2aQ&r=y6oUgGLI3Va2WgIE-Qq68XsBjmVWtX92km0aTnDk-Go&m=DIoxyd1WtYURSpGirO0EdyK67bK3H_5sX3fX0_f0qeg&s=S5TKtQ8OXHoID3ik4KT7Dth4FF3vNNIKWGIeAQUFra4&e=

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

--

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user

Previous message

View by thread

View by date

Next message

Re: [xcat-user] Fw: Re: RHEL-7.3 provisioning error Song BJ Yang

Re: [xcat-user] Fw: Re: RHEL-7.3 provisioning error Saurabh Barve

Reply via email to