can u check the xCAT configuration with "xcatprobe xcatmn -i
<provision_network_interface> "  ?

I suggest use the ip address instead of hostname for master at the site
table :    "master",",maestro-xcat.maestro.pasteur.fr",,

from your log, the switch-based discovery worked:
Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO  xcat.discovery.switch:
(ac:1f:6b:8b:65:87) Found node: maestro-300

did u see the node definition updated with this mac address?

if you define mtms/serial number in the predefine node definition, the
mtms-based discovery will performed.

you can grep nodename from /var/log/xcat/compute.log, it should catch all
the logs for that compute node.


Thanks,
Casandra

...................................................................
Casandra Hong Qiu
Phone: (845) 433-9291, t/l 293-9291
Office: Building 8, 3-B-04
cxh...@us.ibm.com





From:   Thomas HUMMEL <thomas.hum...@pasteur.fr>
To:     xcat-user@lists.sourceforge.net
Date:   07/12/2019 02:17 PM
Subject:        [EXTERNAL] [xcat-user] Discovery errors



Hello,


I'm building a brand new HPC cluster provisionned with
xCAT-server-2.14.6 on CentOS 7.6 x86_64.

A few "infrastructure" nodes are stateful, compute will be stateless.

Stateless nodes will be switch-based discovered physical nodes.

I'm used to do just this on a previous one (older CentOS and xCAT
versions)but on a simpler setup. Here it kinda work but some logs
confuse me :

I only configured one compute node. As I was not in front of the console
and I remotely powered down/up a 4 server chassis, some errors may be
normal because coming from non configured pxe'ing hosts.

My setup

- site :

#key,value,comments,disable
"blademaxp","64",,
"domain","maestro.pasteur.fr",,
"fsptimeout","0",,
"installdir","/install",,
"ipmimaxp","64",,
"ipmiretries","3",,
"ipmitimeout","2",,
"consoleondemand","no",,
"master",",maestro-xcat.maestro.pasteur.fr",,
"nameservers","192.168.149.101,192.168.149.102",,
"maxssh","8",,
"ppcmaxp","64",,
"ppcretry","3",,
"ppctimeout","0",,
"powerinterval","0",,
"syspowerinterval","0",,
"sharedtftp","1",,
"SNsyncfiledir","/var/xcat/syncfiles",,
"nodesyncfiledir","/var/xcat/node/syncfiles",,
"tftpdir","/tftpboot",,
"xcatdport","3001",,
"xcatiport","3002",,
"xcatconfdir","/etc/xcat",,
"timezone","Europe/Paris",,
"useNmapfromMN","no",,
"enableASMI","no",,
"db2installloc","/mntdb2",,
"databaseloc","/var/lib",,
"sshbetweennodes","ALLGROUPS",,
"dnshandler","ddns",,
"vsftp","n",,
"cleanupxcatpost","no",,
"dhcplease","43200",,
"auditnosyslog","0",,
"auditskipcmds","ALL",,
"dnsinterfaces","eth0",,
"dhcpinterfaces","eth0",,
"externaldns","1",,

- no service node

- DNS is on separate hosts (provisionned with stateful images using the
same xCAT)

makedns works for forward and reverse zone

- a node I want to be switched-based discovered :

Object name: maestro-300
     addkcmdline=ipv6.disable=1 biosdevname=0 net.ifnames=0
rd.driver.blacklist=nouveau nouveau.modeset=0
     bmc=10.7.97.48
     bmcport=0
     chain=osimage=netboot-cpu-centos7.6
     groups=maestro_compute,maestro_ipmi,maestro,standard,a12
     ip=192.168.153.48
     mgt=ipmi
     netboot=xnba
     nfsserver=maestro-xcat
     postbootscripts=otherpkgs
     postscripts=syslog,remoteshell,syncfiles
     switch=a12c2.dc1.pasteur.fr
     switchport=37
     tftpserver=maestro-xcat

I removed bmcsetup from chain to be in a simplier situation

- switches table

"sw","2c",,"<XXXX>",,,,,,,,,

note : an snmpwalk works fine against the switch. although the MIB
returns a12c2.pasteur.fr instead of a12c2.DC1.pasteur.fr (but the same
is true for the older cluster where it works just fine)

- switch is created as a node as seen in switch table

"a12c2.dc1.pasteur.fr","sw,all",,,,,,,,,,,

- noderes looks fine to me

"maestro",,"xnba","maestro-xcat",,"maestro-xcat",,,,,,,,,,,,,,,,

- chain also

"maestro_compute",,,"osimage=netboot-cpu-centos7.6",,,

- networks also

When booting, node does get an IP from the dynamic range

2019-07-12T19:31:29.349206+02:00 maestro-xcat dhcpd: DHCPDISCOVER from
ac:1f:6b:8b:65:87 via eth0
2019-07-12T19:31:30.151476+02:00 maestro-xcat dhcpd: DHCPDISCOVER from
ac:1f:6b:8b:65:8b via eth0
2019-07-12T19:31:30.349611+02:00 maestro-xcat dhcpd: DHCPOFFER on
192.168.144.6 to ac:1f:6b:8b:65:87 via eth0
2019-07-12T19:31:30.610112+02:00 maestro-xcat dhcpd: DHCPDISCOVER from
ac:1f:6b:8b:65:83 via eth0
2019-07-12T19:31:31.152223+02:00 maestro-xcat dhcpd: DHCPOFFER on
192.168.144.5 to ac:1f:6b:8b:65:8b via eth0
2019-07-12T19:31:31.391140+02:00 maestro-xcat dhcpd: DHCPREQUEST for
192.168.144.6 (192.168.148.10) from ac:1f:6b:8b:65:87 via eth0
2019-07-12T19:31:31.391172+02:00 maestro-xcat dhcpd: DHCPACK on
192.168.144.6 to ac:1f:6b:8b:65:87 via eth0


but afterward some things which I didn't manage to interpret seem wrong
in logs :

1) TFP Aborted

2019-07-12T19:31:31.395078+02:00 maestro-xcat in.tftpd[31860]: RRQ from
192.168.144.6 filename xcat/xnba.kpxe
2019-07-12T19:31:31.395188+02:00 maestro-xcat in.tftpd[31860]: Error
code 0: TFTP Aborted
2019-07-12T19:31:31.396765+02:00 maestro-xcat in.tftpd[31861]: RRQ from
192.168.144.6 filename xcat/xnba.kpxe
2019-07-12T19:31:31.400618+02:00 maestro-xcat in.tftpd[31861]: Client
192.168.144.6 finish


2) getcredentials

Jul 12 19:33:07 maestro-xcat xcat[31945]: INFO  xCAT: Allowing
getcredentials x509cert
Jul 12 19:33:07 maestro-xcat xcat[31946]: ERR  Received getcredentials
from , which couldn't be correlated to a node (domain mismatch?)


3) switch-based discovery seem to work for my configured node :

Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO
xcat.discovery.aaadiscovery: (ac:1f:6b:8b:65:87) Got a discovery
request, attempting to discover the node...
Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO  xcat.discovery.blade:
(ac:1f:6b:8b:65:87) Warning: Could not find any nodes using blade-based
discovery
Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO  xcat.discovery.switch:
(ac:1f:6b:8b:65:87) Found node: maestro-300
Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO
xcat.discovery.nodediscover: remove gocons session for
Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO
xcat.discovery.nodediscover: maestro-300 has been discovered
Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO
xcat.discovery.zzzdiscovery: (ac:1f:6b:8b:65:87) Successfully discovered
the node using switch discovery method.

4) malformed getpostscript

I see a lot

Jul 12 19:42:12 maestro-xcat xcat[33151]: INFO  xCAT: Allowing
getpostscript
Jul 12 19:42:12 maestro-xcat xcat[33152]: ERR  Received malformed
getpostscript requesting, ignore it.

but I only configured postscripts for stateful nodes (my only one
maestro-300 stateless is not in the postscripts table) :

#node,postscripts,postbootscripts,comments,disable
"xcatdefaults","syslog,remoteshell,syncfiles","otherpkgs",,
"service","servicenode",,,
"maestro-sched","confignetwork -s",,,
"maestro-submit","confignetwork -s",,,
"maestro-bind0","confignetwork -s",,,
"maestro-bind1","confignetwork -s",,,
"maestro-monitor","confignetwork -s",,,

What do you think about those errors ?

For some of them, it's not easy to see it it concern my configured node
or the other server of the chassis which pxe as well


Last thing : MTMS discovery seems to be performed even when switch based
us used : am I right ?


Thanks for your help.

--
Thomas HUMMEL



_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_xcat-2Duser&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=n1LR_Py9TQX0dVqfGTbLHUMGx25-C8VtBDS0nCzyNXY&m=NsAOnDYsm6CTwezc8ZnL6WNiPR9mBw_PAxVLQU3xYsc&s=Qs7U700JeF62UVzW_PF8F2gkq2-IVHIcEuy2qVp15vc&e=




_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to