Hi Brian,
Sorry for late response.
With `site.xcatdebugmode="1"or "2"`, rerun `nodeset+rsetboot+rpower`, the postscripts will be run with `set -x`, so that you can
find the debug trace of postscript execution in "/var/log/xcat/xcat.log" on the provisioned compute node(/mnt/sysimage/var/log.... inside installer). With `site.xcatdebugmode=2`, you can even ssh to the installer during installation.
Back to your issue, based on the "xcat deploy logs.txt", I am sure the issue is the same with https://sourceforge.net/p/xcat/bugs/4579/ .
The errors messages like "Error: Unable to read private ECDSA key from /etc/xcat/hostkeys" and "
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: Disabling protocol version 1. Could not load host key
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: WARNING: /etc/ssh/moduli does not exist, using fixed modulus
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: Accepted none for root from 172.16.1.2 port 41488 ssh2
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: Received disconnect from 172.16.1.2: 11: disconnected by user
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: WARNING: /etc/ssh/moduli does not exist, using fixed modulus
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: Accepted none for root from 172.16.1.2 port 41488 ssh2
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]: Received disconnect from 172.16.1.2: 11: disconnected by user
" are just some innocuous messages due to ssh version compatibility
The key info is that sshd failed to start due to
```
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]: error: Bind to port 22 on :: failed: Address already in use.
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]: fatal: Cannot bind any address.
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]: error: Bind to port 22 on :: failed: Address already in use.
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]: fatal: Cannot bind any address.
```
seems the some service has occupier port 22. The logs you provided does not give more information on this.
Since I cannot access your env, here are some debug hints:
1. look into "/var/log/xcat/xcat.log" on the provisioned compute node, see whether the debug trace uncovers anything
2. add a long sleep (such as `sleep 3600`)at the end of `/install/postscripts/remoteshell`(line #498) on MN,
```
467 if [[ $OSVER == ubuntu* || $OSVER == debian* ]]
468 then
469 if [ ! -d /var/run/sshd ];then
470 #"/var/run/sshd":
471 #Contains the process ID of the sshd listening for connections
472 #(if there are several daemons running concurrently for different ports,
473 #this contains the process ID of the one started last).
474 #The content of this file is not sensitive; it can be world-read-able.
475 #prepare the "/var/run/sshd" for ubuntu
476 mkdir /var/run/sshd
477 chmod 0755 /var/run/sshd
478 fi
479 #service ssh restart
480 restartservice ssh
481 else
482 #service sshd restart
483 # sshd is not enabled on SLES 12 by default
484 # does not hurt anything to re-enable if it is enabled already
485 # and disable enable service for diskless and statelite
486 if [[ "$NODESETSTATE" != netboot && "$NODESETSTATE" != statelite ]]; then
487 enableservice sshd
488 fi
489 restartservice sshd
490 fi
491 #if the service restart with "service/systemctl" failed
492 #try to kill the process and start
493 if [ "$?" != "0" ];then
494 PIDLIST=`ps aux | grep -v grep | grep "/usr/sbin/sshd"|awk -F" " '{print $2}'|xargs`
495 [ -n "$PIDLIST" ] && kill 9 $PIDLIST
496 /usr/sbin/sshd
497 fi
468 then
469 if [ ! -d /var/run/sshd ];then
470 #"/var/run/sshd":
471 #Contains the process ID of the sshd listening for connections
472 #(if there are several daemons running concurrently for different ports,
473 #this contains the process ID of the one started last).
474 #The content of this file is not sensitive; it can be world-read-able.
475 #prepare the "/var/run/sshd" for ubuntu
476 mkdir /var/run/sshd
477 chmod 0755 /var/run/sshd
478 fi
479 #service ssh restart
480 restartservice ssh
481 else
482 #service sshd restart
483 # sshd is not enabled on SLES 12 by default
484 # does not hurt anything to re-enable if it is enabled already
485 # and disable enable service for diskless and statelite
486 if [[ "$NODESETSTATE" != netboot && "$NODESETSTATE" != statelite ]]; then
487 enableservice sshd
488 fi
489 restartservice sshd
490 fi
491 #if the service restart with "service/systemctl" failed
492 #try to kill the process and start
493 if [ "$?" != "0" ];then
494 PIDLIST=`ps aux | grep -v grep | grep "/usr/sbin/sshd"|awk -F" " '{print $2}'|xargs`
495 [ -n "$PIDLIST" ] && kill 9 $PIDLIST
496 /usr/sbin/sshd
497 fi
498 kill -9 $CREDPID
```
change `site.xcatdbugmode=2`, rerun `nodeset` to apply the debug mode, then provision the node, ssh to the installer by `ssh <compute node name>` and check the `ps` tree and `netstat` output on port 22 to find out the real issue
3. add some debug print(`echo xxxx`) or log entries (`logger -t xcat -p local4.info `) around the lines #467 to #498 in `/install/postscripts/remoteshell` on MN, then provision to see what happened. You can find the debug print(`echo xxxx`) in /var/log/xcat/xcat.log on the provisioned compute node and the log entries (`logger -t xcat -p local4.info `) in /var/log/xcat/computes.log on MN
best regards
------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193
----- Original message -----
From: Brian Joiner <martinitime1...@gmail.com>
To: Song BJ Yang <yang...@cn.ibm.com>
Cc:
Subject: Re: [xcat-user] Syncfiles getting deleted after reboot....
Date: Tue, Aug 14, 2018 6:23 AM
Song Yang,See attached for deployment logs from Friday. There are ssh key errors, and I know that remoteshell tries to restart ssh to prep for sync files running. But postscripts finishes with exit code 0.Thanks,Brian JoinerOn Mon, Aug 13, 2018 at 2:22 PM, Brian Joiner <martinitime1...@gmail.com> wrote:We set the debug mode to 2 to get extra data, but so far I haven't seen anything that sticks out. I always get an exit code of 0 on 'syncfiles', but I'll dig around some more. Also, there are no service nodes on the cluster.Thanks,Brian Joiner--On Sun, Aug 12, 2018 at 10:05 PM, Song BJ Yang <yang...@cn.ibm.com> wrote:hi Brian,Good catch. Based on the description, I think maybe it is the similar issue described in this ticket.2 hints:1. would you please take a look at the file `/var/log/xcat/xcat.log` on the compute node after reboot? this file contains some logs on postscripts. If the information is not sufficient to position the real cause, you can enable `site.xcatdebugmode` by `chdef -t site -o clustersite xcatdebugmode=1` and retrovision the node to get more verbose information.2. is it a hierarchy cluster with service node? if yes, have you upgrade xCATsn on SN?best regards------------------------------------------------------------ ------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193----- Original message -----
From: Brian Joiner <martinitime1...@gmail.com>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net >
Cc:Subject: Re: [xcat-user] Syncfiles getting deleted after reboot....
Date: Sat, Aug 11, 2018 7:31 AM
Yang Song,Could it be related to https://sourceforge.net/p/xcat/bugs/4579/ I checked the remoteshell script and is has /usr/sbin/sshd at the bottom, so not sure what's happening.On Fri, Aug 10, 2018 at 1:22 PM, Brian Joiner <martinitime1...@gmail.com> wrote:Song,Yes there are otherpkgs that run after the reboot. I'll try to get the OS and node defs to you at some point today. The client did respond back after doing some investigation of his own:"..it appears that the script is running without a chroot and writing the files to the genimage transient file system." I did a further test by creating a 600 second sleep postscript, and found that the syncfiles are in the /etc/ but not in /mnt/sysimage/etc/I put a test file in /mnt/sysimage/etc/ and it survived the reboot. None of the 'syncfiles' were there, but my test file was.Why would syncfiles not write to the correct directory during deployment? I'm concerned that something didn't go right during the upgrade.Thanks,Brian Joiner--On Thu, Aug 9, 2018 at 10:10 PM, Song BJ Yang <yang...@cn.ibm.com> wrote:Hi Brian Joiner,is there any packages specified in `otherpkglist` and `otherpkgdir`? which which will be installed by `otherpkgs` during the post-installation rebootwould you please provide the osimage definition and node definition? thanks------------------------------------------------------------ ------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193----- Original message -----
From: Brian Joiner <martinitime1...@gmail.com>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net >
Cc:
Subject: [xcat-user] Syncfiles getting deleted after reboot....
Date: Wed, Aug 8, 2018 8:31 AM
Hardware: DellDeployment OS: RHEL 6.8No changes were made to OS definition, other than adding the syncfile listOur client upgraded their xCAT version from 2.7 to 2.14.1 and we're seeing some bizarre behavior when deploying the nodes.Just to make everything as clean as possible, I created a separate group with no postscripts (so only the default postscripts run), and removed all other post scripts and postbootscripts from the node definition.What's happening is: during initial deployment, 'syncfiles' copies over files, I've verified that they exist with 'ls', then the normal post install reboot occurs. After the reboot, all of the synced files are GONE. Multiple files, in multiple directories (mostly in /etc). I even created a dummy test file to make sure, and it's there during install but not after the reboot. Syncfiles always exits with 0updatenode -F will resync the files, and they survive a reboot.This problem is so strange, I've never seen anything like it. Any ideas?--Brian Joiner------------------------------------------------------------ ------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xcat-u...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------ ------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xcat-u...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
Brian Joiner--Brian Joiner------------------------------------------------------------ ------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xcat-u...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user Brian Joiner--Brian Joiner
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user