can u check the xCAT configuration with "xcatprobe xcatmn -i <provision_network_interface> " ?
I suggest use the ip address instead of hostname for master at the site table : "master",",maestro-xcat.maestro.pasteur.fr",, from your log, the switch-based discovery worked: Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.switch: (ac:1f:6b:8b:65:87) Found node: maestro-300 did u see the node definition updated with this mac address? if you define mtms/serial number in the predefine node definition, the mtms-based discovery will performed. you can grep nodename from /var/log/xcat/compute.log, it should catch all the logs for that compute node. Thanks, Casandra ................................................................... Casandra Hong Qiu Phone: (845) 433-9291, t/l 293-9291 Office: Building 8, 3-B-04 cxh...@us.ibm.com From: Thomas HUMMEL <thomas.hum...@pasteur.fr> To: xcat-user@lists.sourceforge.net Date: 07/12/2019 02:17 PM Subject: [EXTERNAL] [xcat-user] Discovery errors Hello, I'm building a brand new HPC cluster provisionned with xCAT-server-2.14.6 on CentOS 7.6 x86_64. A few "infrastructure" nodes are stateful, compute will be stateless. Stateless nodes will be switch-based discovered physical nodes. I'm used to do just this on a previous one (older CentOS and xCAT versions)but on a simpler setup. Here it kinda work but some logs confuse me : I only configured one compute node. As I was not in front of the console and I remotely powered down/up a 4 server chassis, some errors may be normal because coming from non configured pxe'ing hosts. My setup - site : #key,value,comments,disable "blademaxp","64",, "domain","maestro.pasteur.fr",, "fsptimeout","0",, "installdir","/install",, "ipmimaxp","64",, "ipmiretries","3",, "ipmitimeout","2",, "consoleondemand","no",, "master",",maestro-xcat.maestro.pasteur.fr",, "nameservers","192.168.149.101,192.168.149.102",, "maxssh","8",, "ppcmaxp","64",, "ppcretry","3",, "ppctimeout","0",, "powerinterval","0",, "syspowerinterval","0",, "sharedtftp","1",, "SNsyncfiledir","/var/xcat/syncfiles",, "nodesyncfiledir","/var/xcat/node/syncfiles",, "tftpdir","/tftpboot",, "xcatdport","3001",, "xcatiport","3002",, "xcatconfdir","/etc/xcat",, "timezone","Europe/Paris",, "useNmapfromMN","no",, "enableASMI","no",, "db2installloc","/mntdb2",, "databaseloc","/var/lib",, "sshbetweennodes","ALLGROUPS",, "dnshandler","ddns",, "vsftp","n",, "cleanupxcatpost","no",, "dhcplease","43200",, "auditnosyslog","0",, "auditskipcmds","ALL",, "dnsinterfaces","eth0",, "dhcpinterfaces","eth0",, "externaldns","1",, - no service node - DNS is on separate hosts (provisionned with stateful images using the same xCAT) makedns works for forward and reverse zone - a node I want to be switched-based discovered : Object name: maestro-300 addkcmdline=ipv6.disable=1 biosdevname=0 net.ifnames=0 rd.driver.blacklist=nouveau nouveau.modeset=0 bmc=10.7.97.48 bmcport=0 chain=osimage=netboot-cpu-centos7.6 groups=maestro_compute,maestro_ipmi,maestro,standard,a12 ip=192.168.153.48 mgt=ipmi netboot=xnba nfsserver=maestro-xcat postbootscripts=otherpkgs postscripts=syslog,remoteshell,syncfiles switch=a12c2.dc1.pasteur.fr switchport=37 tftpserver=maestro-xcat I removed bmcsetup from chain to be in a simplier situation - switches table "sw","2c",,"<XXXX>",,,,,,,,, note : an snmpwalk works fine against the switch. although the MIB returns a12c2.pasteur.fr instead of a12c2.DC1.pasteur.fr (but the same is true for the older cluster where it works just fine) - switch is created as a node as seen in switch table "a12c2.dc1.pasteur.fr","sw,all",,,,,,,,,,, - noderes looks fine to me "maestro",,"xnba","maestro-xcat",,"maestro-xcat",,,,,,,,,,,,,,,, - chain also "maestro_compute",,,"osimage=netboot-cpu-centos7.6",,, - networks also When booting, node does get an IP from the dynamic range 2019-07-12T19:31:29.349206+02:00 maestro-xcat dhcpd: DHCPDISCOVER from ac:1f:6b:8b:65:87 via eth0 2019-07-12T19:31:30.151476+02:00 maestro-xcat dhcpd: DHCPDISCOVER from ac:1f:6b:8b:65:8b via eth0 2019-07-12T19:31:30.349611+02:00 maestro-xcat dhcpd: DHCPOFFER on 192.168.144.6 to ac:1f:6b:8b:65:87 via eth0 2019-07-12T19:31:30.610112+02:00 maestro-xcat dhcpd: DHCPDISCOVER from ac:1f:6b:8b:65:83 via eth0 2019-07-12T19:31:31.152223+02:00 maestro-xcat dhcpd: DHCPOFFER on 192.168.144.5 to ac:1f:6b:8b:65:8b via eth0 2019-07-12T19:31:31.391140+02:00 maestro-xcat dhcpd: DHCPREQUEST for 192.168.144.6 (192.168.148.10) from ac:1f:6b:8b:65:87 via eth0 2019-07-12T19:31:31.391172+02:00 maestro-xcat dhcpd: DHCPACK on 192.168.144.6 to ac:1f:6b:8b:65:87 via eth0 but afterward some things which I didn't manage to interpret seem wrong in logs : 1) TFP Aborted 2019-07-12T19:31:31.395078+02:00 maestro-xcat in.tftpd[31860]: RRQ from 192.168.144.6 filename xcat/xnba.kpxe 2019-07-12T19:31:31.395188+02:00 maestro-xcat in.tftpd[31860]: Error code 0: TFTP Aborted 2019-07-12T19:31:31.396765+02:00 maestro-xcat in.tftpd[31861]: RRQ from 192.168.144.6 filename xcat/xnba.kpxe 2019-07-12T19:31:31.400618+02:00 maestro-xcat in.tftpd[31861]: Client 192.168.144.6 finish 2) getcredentials Jul 12 19:33:07 maestro-xcat xcat[31945]: INFO xCAT: Allowing getcredentials x509cert Jul 12 19:33:07 maestro-xcat xcat[31946]: ERR Received getcredentials from , which couldn't be correlated to a node (domain mismatch?) 3) switch-based discovery seem to work for my configured node : Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.aaadiscovery: (ac:1f:6b:8b:65:87) Got a discovery request, attempting to discover the node... Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.blade: (ac:1f:6b:8b:65:87) Warning: Could not find any nodes using blade-based discovery Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.switch: (ac:1f:6b:8b:65:87) Found node: maestro-300 Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO xcat.discovery.nodediscover: remove gocons session for Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO xcat.discovery.nodediscover: maestro-300 has been discovered Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO xcat.discovery.zzzdiscovery: (ac:1f:6b:8b:65:87) Successfully discovered the node using switch discovery method. 4) malformed getpostscript I see a lot Jul 12 19:42:12 maestro-xcat xcat[33151]: INFO xCAT: Allowing getpostscript Jul 12 19:42:12 maestro-xcat xcat[33152]: ERR Received malformed getpostscript requesting, ignore it. but I only configured postscripts for stateful nodes (my only one maestro-300 stateless is not in the postscripts table) : #node,postscripts,postbootscripts,comments,disable "xcatdefaults","syslog,remoteshell,syncfiles","otherpkgs",, "service","servicenode",,, "maestro-sched","confignetwork -s",,, "maestro-submit","confignetwork -s",,, "maestro-bind0","confignetwork -s",,, "maestro-bind1","confignetwork -s",,, "maestro-monitor","confignetwork -s",,, What do you think about those errors ? For some of them, it's not easy to see it it concern my configured node or the other server of the chassis which pxe as well Last thing : MTMS discovery seems to be performed even when switch based us used : am I right ? Thanks for your help. -- Thomas HUMMEL _______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_xcat-2Duser&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=n1LR_Py9TQX0dVqfGTbLHUMGx25-C8VtBDS0nCzyNXY&m=NsAOnDYsm6CTwezc8ZnL6WNiPR9mBw_PAxVLQU3xYsc&s=Qs7U700JeF62UVzW_PF8F2gkq2-IVHIcEuy2qVp15vc&e=
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user