Hi Brian,
 
Sorry for late response.
 
With `site.xcatdebugmode="1"or "2"`, rerun `nodeset+rsetboot+rpower`, the postscripts will be run with `set -x`, so that you can   
find the debug trace of postscript execution in "/var/log/xcat/xcat.log" on the provisioned compute node(/mnt/sysimage/var/log.... inside installer). With `site.xcatdebugmode=2`, you can even ssh to the installer during installation.
 
Back to your issue, based on the "xcat deploy logs.txt", I am sure the issue is the same with https://sourceforge.net/p/xcat/bugs/4579/.  
The errors messages like "Error: Unable to read private ECDSA key from /etc/xcat/hostkeys" and "
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]:  Disabling protocol version 1. Could not load host key
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]:  WARNING: /etc/ssh/moduli does not exist, using fixed modulus
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]:  Accepted none for root from 172.16.1.2 port 41488 ssh2
computes.log-20180812:Aug 10 19:29:09 n0033 sshd[16020]:  Received disconnect from 172.16.1.2: 11: disconnected by user
" are just some innocuous messages due to ssh version compatibility
 
 
 
The key info is that sshd failed to start due to 
```
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]:  error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]:  error: Bind to port 22 on :: failed: Address already in use.
computes.log-20180812:Aug 10 19:29:08 n0033 sshd[15995]:  fatal: Cannot bind any address.
```
seems the some service has occupier port 22. The logs you provided does not give more information on this.
 
 
Since I cannot access your env, here are some debug hints:
1. look into  "/var/log/xcat/xcat.log" on the provisioned compute node, see whether the debug trace uncovers anything
2. add a long sleep (such as `sleep 3600`)at the end of `/install/postscripts/remoteshell`(line #498) on MN, 
```
467 if [[ $OSVER == ubuntu* || $OSVER == debian* ]]
468 then
469     if [ ! -d /var/run/sshd ];then
470         #"/var/run/sshd":
471         #Contains the process ID of the sshd listening for connections
472         #(if there are several daemons running concurrently for different ports,
473         #this contains the process ID of the one started last).
474         #The content of this file is not sensitive; it can be world-read-able.
475         #prepare the "/var/run/sshd" for ubuntu
476         mkdir /var/run/sshd
477         chmod 0755 /var/run/sshd
478     fi
479     #service ssh restart
480     restartservice ssh
481 else
482     #service sshd restart
483     # sshd is not enabled on SLES 12 by default
484     # does not hurt anything to re-enable if it is enabled already
485     # and disable enable service for diskless and statelite
486     if [[ "$NODESETSTATE" != netboot && "$NODESETSTATE" != statelite ]]; then
487         enableservice sshd
488     fi
489     restartservice sshd
490 fi
491 #if the service restart with "service/systemctl" failed
492 #try to kill the process and start
493 if [ "$?" != "0" ];then
494    PIDLIST=`ps aux | grep -v grep | grep "/usr/sbin/sshd"|awk -F" " '{print $2}'|xargs`
495    [ -n "$PIDLIST" ] && kill 9 $PIDLIST
496    /usr/sbin/sshd
497 fi
498 kill -9 $CREDPID
```
 
change `site.xcatdbugmode=2`, rerun `nodeset` to apply the debug mode, then provision the node, ssh to the installer by `ssh <compute node name>` and check the `ps` tree and `netstat` output on port 22 to find out the real issue
 
3. add some debug print(`echo xxxx`) or log entries (`logger -t xcat -p local4.info `) around the lines #467 to #498 in `/install/postscripts/remoteshell` on MN, then provision to see what happened. You can find the debug print(`echo xxxx`) in /var/log/xcat/xcat.log on the provisioned compute node and the  log entries (`logger -t xcat -p local4.info `)  in /var/log/xcat/computes.log on MN
 
 
 
best regards
 
 
------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193
 
 
----- Original message -----
From: Brian Joiner <martinitime1...@gmail.com>
To: Song BJ Yang <yang...@cn.ibm.com>
Cc:
Subject: Re: [xcat-user] Syncfiles getting deleted after reboot....
Date: Tue, Aug 14, 2018 6:23 AM
 
Song Yang,
 
See attached for deployment logs from Friday.  There are ssh key errors, and I know that remoteshell tries to restart ssh to prep for sync files running.  But postscripts finishes with exit code 0.
 
 
Thanks,
Brian Joiner
 
On Mon, Aug 13, 2018 at 2:22 PM, Brian Joiner <martinitime1...@gmail.com> wrote:
We set the debug mode to 2 to get extra data, but so far I haven't seen anything that sticks out.  I always get an exit code of 0 on 'syncfiles', but I'll dig around some more.  Also, there are no service nodes on the cluster.  
 
Thanks,
Brian Joiner
 
On Sun, Aug 12, 2018 at 10:05 PM, Song BJ Yang <yang...@cn.ibm.com> wrote:
hi Brian,
 
Good catch. Based on the description, I think maybe it is the similar issue described in this ticket.
 
2 hints:
 
1. would you please take a look at the file `/var/log/xcat/xcat.log` on the compute node after reboot?  this file contains some logs on postscripts. If the information is not sufficient to position the real cause, you can enable `site.xcatdebugmode`  by `chdef -t site -o clustersite xcatdebugmode=1` and retrovision the node to get more verbose information.
 
2. is it a hierarchy cluster with service node? if yes, have you upgrade xCATsn on SN?
 
best regards
------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193
 
 
----- Original message -----
From: Brian Joiner <martinitime1...@gmail.com>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Cc:
Subject: Re: [xcat-user] Syncfiles getting deleted after reboot....
Date: Sat, Aug 11, 2018 7:31 AM
 
Yang Song,
 
 
I checked the remoteshell script and is has /usr/sbin/sshd at the bottom, so not sure what's happening.
 
On Fri, Aug 10, 2018 at 1:22 PM, Brian Joiner <martinitime1...@gmail.com> wrote:
Song,
 
Yes there are otherpkgs that run after the reboot.  I'll try to get the OS and node defs to you at some point today.  The client did respond back after doing some investigation of his own:
 
"..it appears that the script is running without a chroot and writing the files to the genimage transient file system."  I did a further test by creating a 600 second sleep postscript, and found that the syncfiles are in the /etc/  but not in /mnt/sysimage/etc/
 
I put a test file in /mnt/sysimage/etc/ and it survived the reboot.  None of the 'syncfiles' were there, but my test file was.
 
Why would syncfiles not write to the correct directory during deployment?   I'm concerned that something didn't go right during the upgrade.
 
 
Thanks,
Brian Joiner
 
On Thu, Aug 9, 2018 at 10:10 PM, Song BJ Yang <yang...@cn.ibm.com> wrote:
Hi Brian Joiner,
 
is there any packages specified in `otherpkglist` and `otherpkgdir`? which which will be installed by  `otherpkgs` during the post-installation reboot
 
 
would you please provide the osimage definition and node definition? thanks
------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193
 
 
----- Original message -----
From: Brian Joiner <martinitime1...@gmail.com>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Cc:
Subject: [xcat-user] Syncfiles getting deleted after reboot....
Date: Wed, Aug 8, 2018 8:31 AM
 
 
Hardware: Dell
Deployment OS:  RHEL 6.8
No changes were made to OS definition, other than adding the syncfile list
 
 
Our client upgraded their xCAT version from 2.7 to 2.14.1 and we're seeing some bizarre behavior when deploying the nodes.
 
Just to make everything as clean as possible, I created a separate group with no postscripts (so only the default postscripts run), and removed all other post scripts and postbootscripts from the node definition.
 
What's happening is:  during initial deployment, 'syncfiles' copies over files, I've verified that they exist with 'ls', then the normal post install reboot occurs.  After the reboot, all of the synced files are GONE.  Multiple files, in multiple directories (mostly in /etc).  I even created a dummy test file to make sure, and it's there during install but not after the reboot.  Syncfiles always exits with 0
 
updatenode -F will resync the files, and they survive a reboot.  
 
This problem is so strange, I've never seen anything like it.  Any ideas?
 
--
Brian Joiner
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xcat-u...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xcat-u...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
 
 
 
--
Brian Joiner
 
 
--
Brian Joiner
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xcat-u...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
 
 
 
--
Brian Joiner
 
 
--
Brian Joiner
 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to