Hello,

I'm building a brand new HPC cluster provisionned with xCAT-server-2.14.6 on CentOS 7.6 x86_64.

A few "infrastructure" nodes are stateful, compute will be stateless.

Stateless nodes will be switch-based discovered physical nodes.

I'm used to do just this on a previous one (older CentOS and xCAT versions)but on a simpler setup. Here it kinda work but some logs confuse me :

I only configured one compute node. As I was not in front of the console and I remotely powered down/up a 4 server chassis, some errors may be normal because coming from non configured pxe'ing hosts.

My setup

- site :

#key,value,comments,disable
"blademaxp","64",,
"domain","maestro.pasteur.fr",,
"fsptimeout","0",,
"installdir","/install",,
"ipmimaxp","64",,
"ipmiretries","3",,
"ipmitimeout","2",,
"consoleondemand","no",,
"master",",maestro-xcat.maestro.pasteur.fr",,
"nameservers","192.168.149.101,192.168.149.102",,
"maxssh","8",,
"ppcmaxp","64",,
"ppcretry","3",,
"ppctimeout","0",,
"powerinterval","0",,
"syspowerinterval","0",,
"sharedtftp","1",,
"SNsyncfiledir","/var/xcat/syncfiles",,
"nodesyncfiledir","/var/xcat/node/syncfiles",,
"tftpdir","/tftpboot",,
"xcatdport","3001",,
"xcatiport","3002",,
"xcatconfdir","/etc/xcat",,
"timezone","Europe/Paris",,
"useNmapfromMN","no",,
"enableASMI","no",,
"db2installloc","/mntdb2",,
"databaseloc","/var/lib",,
"sshbetweennodes","ALLGROUPS",,
"dnshandler","ddns",,
"vsftp","n",,
"cleanupxcatpost","no",,
"dhcplease","43200",,
"auditnosyslog","0",,
"auditskipcmds","ALL",,
"dnsinterfaces","eth0",,
"dhcpinterfaces","eth0",,
"externaldns","1",,

- no service node

- DNS is on separate hosts (provisionned with stateful images using the same xCAT)

makedns works for forward and reverse zone

- a node I want to be switched-based discovered :

Object name: maestro-300
addkcmdline=ipv6.disable=1 biosdevname=0 net.ifnames=0 rd.driver.blacklist=nouveau nouveau.modeset=0
    bmc=10.7.97.48
    bmcport=0
    chain=osimage=netboot-cpu-centos7.6
    groups=maestro_compute,maestro_ipmi,maestro,standard,a12
    ip=192.168.153.48
    mgt=ipmi
    netboot=xnba
    nfsserver=maestro-xcat
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    switch=a12c2.dc1.pasteur.fr
    switchport=37
    tftpserver=maestro-xcat

I removed bmcsetup from chain to be in a simplier situation

- switches table

"sw","2c",,"<XXXX>",,,,,,,,,

note : an snmpwalk works fine against the switch. although the MIB returns a12c2.pasteur.fr instead of a12c2.DC1.pasteur.fr (but the same is true for the older cluster where it works just fine)

- switch is created as a node as seen in switch table

"a12c2.dc1.pasteur.fr","sw,all",,,,,,,,,,,

- noderes looks fine to me

"maestro",,"xnba","maestro-xcat",,"maestro-xcat",,,,,,,,,,,,,,,,

- chain also

"maestro_compute",,,"osimage=netboot-cpu-centos7.6",,,

- networks also

When booting, node does get an IP from the dynamic range

2019-07-12T19:31:29.349206+02:00 maestro-xcat dhcpd: DHCPDISCOVER from ac:1f:6b:8b:65:87 via eth0 2019-07-12T19:31:30.151476+02:00 maestro-xcat dhcpd: DHCPDISCOVER from ac:1f:6b:8b:65:8b via eth0 2019-07-12T19:31:30.349611+02:00 maestro-xcat dhcpd: DHCPOFFER on 192.168.144.6 to ac:1f:6b:8b:65:87 via eth0 2019-07-12T19:31:30.610112+02:00 maestro-xcat dhcpd: DHCPDISCOVER from ac:1f:6b:8b:65:83 via eth0 2019-07-12T19:31:31.152223+02:00 maestro-xcat dhcpd: DHCPOFFER on 192.168.144.5 to ac:1f:6b:8b:65:8b via eth0 2019-07-12T19:31:31.391140+02:00 maestro-xcat dhcpd: DHCPREQUEST for 192.168.144.6 (192.168.148.10) from ac:1f:6b:8b:65:87 via eth0 2019-07-12T19:31:31.391172+02:00 maestro-xcat dhcpd: DHCPACK on 192.168.144.6 to ac:1f:6b:8b:65:87 via eth0


but afterward some things which I didn't manage to interpret seem wrong in logs :

1) TFP Aborted

2019-07-12T19:31:31.395078+02:00 maestro-xcat in.tftpd[31860]: RRQ from 192.168.144.6 filename xcat/xnba.kpxe 2019-07-12T19:31:31.395188+02:00 maestro-xcat in.tftpd[31860]: Error code 0: TFTP Aborted 2019-07-12T19:31:31.396765+02:00 maestro-xcat in.tftpd[31861]: RRQ from 192.168.144.6 filename xcat/xnba.kpxe 2019-07-12T19:31:31.400618+02:00 maestro-xcat in.tftpd[31861]: Client 192.168.144.6 finish


2) getcredentials

Jul 12 19:33:07 maestro-xcat xcat[31945]: INFO xCAT: Allowing getcredentials x509cert Jul 12 19:33:07 maestro-xcat xcat[31946]: ERR Received getcredentials from , which couldn't be correlated to a node (domain mismatch?)


3) switch-based discovery seem to work for my configured node :

Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.aaadiscovery: (ac:1f:6b:8b:65:87) Got a discovery request, attempting to discover the node... Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.blade: (ac:1f:6b:8b:65:87) Warning: Could not find any nodes using blade-based discovery Jul 12 19:34:34 maestro-xcat xcat[31472]: INFO xcat.discovery.switch: (ac:1f:6b:8b:65:87) Found node: maestro-300 Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO xcat.discovery.nodediscover: remove gocons session for Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO xcat.discovery.nodediscover: maestro-300 has been discovered Jul 12 19:34:35 maestro-xcat xcat[31472]: INFO xcat.discovery.zzzdiscovery: (ac:1f:6b:8b:65:87) Successfully discovered the node using switch discovery method.

4) malformed getpostscript

I see a lot

Jul 12 19:42:12 maestro-xcat xcat[33151]: INFO  xCAT: Allowing getpostscript
Jul 12 19:42:12 maestro-xcat xcat[33152]: ERR Received malformed getpostscript requesting, ignore it.

but I only configured postscripts for stateful nodes (my only one maestro-300 stateless is not in the postscripts table) :

#node,postscripts,postbootscripts,comments,disable
"xcatdefaults","syslog,remoteshell,syncfiles","otherpkgs",,
"service","servicenode",,,
"maestro-sched","confignetwork -s",,,
"maestro-submit","confignetwork -s",,,
"maestro-bind0","confignetwork -s",,,
"maestro-bind1","confignetwork -s",,,
"maestro-monitor","confignetwork -s",,,

What do you think about those errors ?

For some of them, it's not easy to see it it concern my configured node or the other server of the chassis which pxe as well


Last thing : MTMS discovery seems to be performed even when switch based us used : am I right ?


Thanks for your help.

--
Thomas HUMMEL



_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to