Hi,

I would like some help from you guys with a problem that I can't figure out
what's going on.

I am using the following xcat release:

xCAT-2.16-snap202006161607.x86_64
xCAT-genesis-scripts-ppc64-2.16-snap202006161607.noarch
xCAT-genesis-base-x86_64-2.14.5-snap201811190037.noarch
perl-xCAT-2.16-snap202006161607.noarch
ipmitool-xcat-1.8.18-3.x86_64
xCAT-probe-2.16-snap202006161607.noarch
xCAT-genesis-scripts-x86_64-2.16-snap202006161607.noarch
xCAT-genesis-base-ppc64-2.14.5-snap201811160710.noarch
xCAT-server-2.16-snap202006161607.noarch
xCAT-client-2.16-snap202006161607.noarch
elilo-xcat-3.14-4.noarch
syslinux-xcat-3.86-2.noarch
grub2-xcat-2.02-0.76.el7.1.snap201905160255.noarch
xCAT-buildkit-2.16-snap202006161607.noarch

here is my networks table:

"compute_net_1","10.240.58.0","255.255.254.0","eno1","10.240.58.1",,"<xcatmaster>","10.240.58.4,8.8.8.8,128.200.192.202",,,"10.240.58.221-10.240.58.240","10.240.58.4-10.240.59.220",,,,,"local","1500",,
"mgmt-net1","10.240.62.0","255.255.254.0","eno1.1","10.240.62.1",,,,,,"10.240.62.244-10.240.62.253","10.240.62.4-10.240.63.220",,,,,,,,
"ib-net1","10.240.60.0","255.255.254.0","eno1.2","10.240.60.1",,,,,,"10.240.60.221-10.240.60.240","10.240.60.4-10.240.61.220",,,,,,,,
"hpc2-net1","10.240.64.0","255.255.254.0","eno1.4","10.240.64.1",,,,,,,"10.240.64.2-10.240.64.254",,,,,,,,
"crsp-net1","10.20.20.0","255.255.0.0","eno1.3","10.20.20.1",,,,,,"10.20.20.151-10.20.20.200","10.20.20.7-10.20.20.150",,,,,,,,

'compute_net_1' is the net I am using to build systems.

So, I was building a node.  Here is the info on the node:

[root@hpc3-xcat-1 ~]# lsdef -t node hpc3-21-12
Object name: hpc3-21-12
    arch=x86_64
    currchain=boot
    currstate=boot
    groups=centos78
    ip=10.240.59.6
    mac=78:ac:44:37:56:20
    netboot=xnba
    nichostnamesuffixes.ib0=-ib0
    nichostnamesuffixes.ipmi=-ipmi
    nicips.ib0=10.240.61.6
    nicips.ipmi=10.240.63.6
    os=centos7.8
    postbootscripts=otherpkgs,hpc3-postscripts/hpc3postbootscript

postscripts=syslog,remoteshell,syncfiles,setupntp,hpc3-postscripts/hpc3postscript.1,confignetwork
-s
    profile=compute
    provmethod=centos7.8-x86_64-install-compute
    status=booted
    statustime=07-08-2021 00:00:46

After the initial build was completed, node lost network connectivity.  I
was scratching my head why that happened.  We have built over 200 nodes in
this cluster.

When I logged on to the node via console ( as root, as I could not SSH, due
to loss of network connectivity ), I noticed the OS is showing the NIC's
the following way:

[root@hpc3-21-12 log]# ip a
...
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group
default qlen 1000
    link/ether 78:ac:44:37:56:20 brd ff:ff:ff:ff:ff:ff
    inet 10.240.59.6/23 brd 10.240.59.255 scope global em1
       valid_lft forever preferred_lft forever
    inet6 fe80::7aac:44ff:fe37:5620/64 scope link
       valid_lft forever preferred_lft forever
3: em3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
qlen 1000
    link/ether 78:ac:44:37:56:40 brd ff:ff:ff:ff:ff:ff
4: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
qlen 1000
    link/ether 78:ac:44:37:56:22 brd ff:ff:ff:ff:ff:ff
5: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
qlen 1000
    link/ether 78:ac:44:37:56:41 brd ff:ff:ff:ff:ff:ff
...

in /etc/sysconfig/network-scripts/ directory, I see the following:

[root@hpc3-21-12 log]# ls -l /etc/sysconfig/network-scripts/ifcfg*
-rw-r--r--  1 root root 102 Jul  7 22:41
/etc/sysconfig/network-scripts/ifcfg-eno1
-rw-r--r--. 1 root root 277 Jul  7 21:47
/etc/sysconfig/network-scripts/ifcfg-eno2
-rw-r--r--. 1 root root 277 Jul  7 21:47
/etc/sysconfig/network-scripts/ifcfg-eno3
-rw-r--r--. 1 root root 277 Jul  7 21:47
/etc/sysconfig/network-scripts/ifcfg-eno4
-rwxr-xr-x. 1 root root 105 Jul  7 21:47
/etc/sysconfig/network-scripts/ifcfg-ib0
-rw-r--r--. 1 root root 254 Aug 19  2019
/etc/sysconfig/network-scripts/ifcfg-lo

So, ifcfg-eno1 gets populated, but the OS is looking for ifcfg-em1 in
/etc/sysconfig/network-scripts directory.  When it does not find it, no
network.

in /var/log/xcat/xcat.log, I see this:

[I]: network service is active
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[I]: configure the install nic eno1.
[I]: NMCLI_USED=0 configeth -s eno1
[I]: configeth on hpc3-21-12: os type: redhat
ls: cannot access /var/lib/dhclient/*eno1*: No such file or directory
['/etc/sysconfig/network-scripts/ifcfg-eno1']
[I]: >> DEVICE=eno1
[I]: >> IPADDR=10.240.59.6
[I]: >> NETMASK=255.255.254.0
[I]: >> BOOTPROTO=none
[I]: >> ONBOOT=yes
[I]: >> NAME=xcat-eno1
[I]: >> HWADDR=78:ac:44:37:56:20
[I]: >> MTU=1500
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[I]: back up /etc/sysconfig/network-scripts to
/etc/sysconfig/network-scripts.xcatbak
[E]:Error: nicips,nictypes and nicnetworks should be configured in nics
table for ib0.
[E]:Error: nicips,nictypes and nicnetworks should be configured in nics
table for ipmi.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
configure nic and its device :
[E]:Error: Check the NIC data in the 'nics' table.

So, as you might have noticed, NICs are named as 'enoX'(eno0) and the
OS(udev) is labeling them as 'emX'(em1).  I am not sure why this is
happening.

My DHCP range come from 10.240.58.X and static IP address is from
10.240.59.X , which is part of 10.240.58.0/23 network.  I am not sure if
that is cause of this or not.

I am at a loss why this is happening though.

Could someone please help shed some light on this for me?

Thanks a lot!
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to