Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

2018-05-20 Thread Yuan Y Bai
Hi Jeff:
 
You can do following to check:
 
1, check all nfs-server export directory correctly, including nfs server from `statelite` table and `litetree` table.
    "showmount -e "
 
2, You can use "lsdef -t osimage  -i pkglist" to find pkglist path, then you can add package names like yum into this file.  Then you should execute:
  genimage 
  liteimg 
  nodeset  osimage=
  rsetboot  net    #if your node is not VM
  rpower  reset
 
3, I installed rh7.4, and dracut is following, but I think you use a different OS:
dracut-033-502.el7.x86_64dracut-network-033-502.el7.x86_64
 
Best Regards--Yuan Bai (白媛)CSTL HPC System Management DevelopmentTel:86-10-82451401E-mail: by...@cn.ibm.comAddress: IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,Beijing P.R.China 100193IBM环宇大厦北京市海淀区东北旺西路8号,中关村软件园28号楼邮编:100193
 
 
- Original message -From: Jeff Berry To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or sshDate: Fri, May 18, 2018 7:35 PM  
Some more digging suggests that the image is missing a lot of useful(needed) packages – including yum, and the only installed dracut package is just dracut.x86_64. 
 
It looks like maybe the image build didn’t get the needed packages? 
 
I haven’t been able to get rd.debug output or get any breakpoints to work – I tried them all.  
 
Thanks for everyone’s time, and sorry if I’m  making obvious rookie mistakes ...
 
Jeff Berry
jeff.be...@mrc-cbu.cam.ac.uk
 
 
From: Yuan Y Bai [mailto:by...@cn.ibm.com]Sent: 18 May 2018 02:19To: xcat-user@lists.sourceforge.netCc: xcat-user@lists.sourceforge.netSubject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh
 
 
Did you try all these pre-mount or mount or pre-pivot break points have problem? 
 
 
Best Regards--Yuan Bai (白媛)CSTL HPC System Management DevelopmentTel:86-10-82451401E-mail: by...@cn.ibm.comAddress: IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,Beijing P.R.China 100193IBM环宇大厦北京市海淀区东北旺西路8号,中关村软件园28号楼邮编:100193
 
 
- Original message -From: Jeff Berry To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or sshDate: Thu, May 17, 2018 7:22 PM 
Good afternoon,
 
thanks for the pointers.
The xcat version is: 2.13.11
 
As per Gilad’s suggestion, I tried booting to shell and that worked just fine.
 
I then tried your suggestions below with no luck.   However, it looks like there is a more fundamental problem.  None of the rd.break breakpoints worked, and the node booted to the same point before hanging.  This suggests to me that the dracut hooks are not working properly.   I’m investigating that more thorougly.
 
I did want to thank you both for the replies.
 
Jeff Berry
jeff.be...@mrc-cbu.cam.ac.uk
 
 
 
From: Yuan Y Bai [mailto:by...@cn.ibm.com]Sent: 16 May 2018 06:37To: xcat-user@lists.sourceforge.netCc: xcat-user@lists.sourceforge.netSubject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh
 
Hi Jeff Berry
 
I looked at cluster.log, I guess  xCAT version is not latest version, what is your xCAT version? You can execute "lsxcatd -v" to get it.
 
From log,  you node hang during "Allowing litetree from node-i01",  you can add break point and enter the node-i01 to debug/find more useful information. 
Executing the following to enter node-i01 through console:
 
 chdef node-i01 addkcmdline=rd.break=cleanup
 rinstall node-i01 osimage
 rcons node-i01
 
After you enter node-i01, you can find statelite.log under "/sysroot/.statelite", you can munally check if mount is ok etc. After you check all of them, execute "exit", exit ..., if system is fine it can continue enter the normal statelite system.
 
 
Best Regards--Yuan Bai (白媛)CSTL HPC System Management DevelopmentTel:86-10-82451401E-mail: by...@cn.ibm.comAddress: IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,Beijing P.R.China 100193IBM环宇大厦北京市海淀区东北旺西路8号,中关村软件园28号楼邮编:100193
 
 
- Original message -From: Gilad Berman To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or sshDate: Wed, May 16, 2018 12:02 AM 
Do you have any issues booting to shell or installing standard image?
 
   

Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

2018-05-18 Thread Jeff Berry
Some more digging suggests that the image is missing a lot of useful(needed) 
packages – including yum, and the only installed dracut package is just 
dracut.x86_64.

It looks like maybe the image build didn’t get the needed packages?

I haven’t been able to get rd.debug output or get any breakpoints to work – I 
tried them all.

Thanks for everyone’s time, and sorry if I’m  making obvious rookie mistakes ...

Jeff Berry
jeff.be...@mrc-cbu.cam.ac.uk


From: Yuan Y Bai [mailto:by...@cn.ibm.com]
Sent: 18 May 2018 02:19
To: xcat-user@lists.sourceforge.net
Cc: xcat-user@lists.sourceforge.net
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or 
ssh


Did you try all these pre-mount or mount or pre-pivot break points have problem?


Best Regards
--
Yuan Bai (白媛)

CSTL HPC System Management Development
Tel:86-10-82451401
E-mail: by...@cn.ibm.com
Address: IBM ZGC Campus. Ring Building 28,
ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,
Beijing P.R.China 100193

IBM环宇大厦
北京市海淀区东北旺西路8号,中关村软件园28号楼
邮编:100193


- Original message -
From: Jeff Berry 
>
To: xCAT Users Mailing list 
>
Cc:
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or 
ssh
Date: Thu, May 17, 2018 7:22 PM


Good afternoon,



thanks for the pointers.

The xcat version is: 2.13.11



As per Gilad’s suggestion, I tried booting to shell and that worked just fine.



I then tried your suggestions below with no luck.   However, it looks like 
there is a more fundamental problem.  None of the rd.break breakpoints worked, 
and the node booted to the same point before hanging.  This suggests to me that 
the dracut hooks are not working properly.   I’m investigating that more 
thorougly.



I did want to thank you both for the replies.



Jeff Berry

jeff.be...@mrc-cbu.cam.ac.uk







From: Yuan Y Bai [mailto:by...@cn.ibm.com]
Sent: 16 May 2018 06:37
To: xcat-user@lists.sourceforge.net
Cc: xcat-user@lists.sourceforge.net
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or 
ssh



Hi Jeff Berry



I looked at cluster.log, I guess  xCAT version is not latest version, what is 
your xCAT version? You can execute "lsxcatd -v" to get it.



From log,  you node hang during "Allowing litetree from node-i01",  you can add 
break point and enter the node-i01 to debug/find more useful information.

Executing the following to enter node-i01 through console:



 chdef node-i01 addkcmdline=rd.break=cleanup

 rinstall node-i01 osimage

 rcons node-i01



After you enter node-i01, you can find statelite.log under 
"/sysroot/.statelite", you can munally check if mount is ok etc. After you 
check all of them, execute "exit", exit ..., if system is fine it can continue 
enter the normal statelite system.





Best Regards
--
Yuan Bai (白媛)

CSTL HPC System Management Development
Tel:86-10-82451401
E-mail: by...@cn.ibm.com
Address: IBM ZGC Campus. Ring Building 28,
ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,
Beijing P.R.China 100193

IBM环宇大厦
北京市海淀区东北旺西路8号,中关村软件园28号楼
邮编:100193





- Original message -
From: Gilad Berman >
To: xCAT Users Mailing list 
>
Cc:
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or 
ssh
Date: Wed, May 16, 2018 12:02 AM


Do you have any issues booting to shell or installing standard image?



[http://lenovocentral.lenovo.com/marketing/branding/email_signature/images/gradient.gif]


Gilad Berman
HPC Architect
Lenovo EMEA


[Phone]+972-52-2554262
[Email]gber...@lenovo.com








Lenovo.com 
Twitter | Facebook | 
Instagram | Blogs | 
Forums



[DCG-Hardware]










From: Jeff Berry 
>
Sent: Tuesday, May 15, 2018 2:43 PM
To: xcat-user@lists.sourceforge.net
Subject: [External] [xcat-user] Problem with statelite boot - no console or ssh



Good day,



I am trying to bring up a new cluster to replace our aging system, and am 
trying to duplicate our basic setup  - which is to say statelite provision on 
Dell PowerEdge servers.  After working through the documentation and looking at 
our old setup, I’ve gotten close (I think) 

Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

2018-05-17 Thread Yuan Y Bai
 
Did you try all these pre-mount or mount or pre-pivot break points have problem? 
 
 
Best Regards--Yuan Bai (白媛)CSTL HPC System Management DevelopmentTel:86-10-82451401E-mail: by...@cn.ibm.comAddress: IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,Beijing P.R.China 100193IBM环宇大厦北京市海淀区东北旺西路8号,中关村软件园28号楼邮编:100193
 
 
- Original message -From: Jeff Berry To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or sshDate: Thu, May 17, 2018 7:22 PM  
Good afternoon,
 
thanks for the pointers.
The xcat version is: 2.13.11
 
As per Gilad’s suggestion, I tried booting to shell and that worked just fine.
 
I then tried your suggestions below with no luck.   However, it looks like there is a more fundamental problem.  None of the rd.break breakpoints worked, and the node booted to the same point before hanging.  This suggests to me that the dracut hooks are not working properly.   I’m investigating that more thorougly.
 
I did want to thank you both for the replies.
 
Jeff Berry
jeff.be...@mrc-cbu.cam.ac.uk
 
 
 
From: Yuan Y Bai [mailto:by...@cn.ibm.com]Sent: 16 May 2018 06:37To: xcat-user@lists.sourceforge.netCc: xcat-user@lists.sourceforge.netSubject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh
 
Hi Jeff Berry
 
I looked at cluster.log, I guess  xCAT version is not latest version, what is your xCAT version? You can execute "lsxcatd -v" to get it.
 
From log,  you node hang during "Allowing litetree from node-i01",  you can add break point and enter the node-i01 to debug/find more useful information. 
Executing the following to enter node-i01 through console:
 
 chdef node-i01 addkcmdline=rd.break=cleanup
 rinstall node-i01 osimage
 rcons node-i01
 
After you enter node-i01, you can find statelite.log under "/sysroot/.statelite", you can munally check if mount is ok etc. After you check all of them, execute "exit", exit ..., if system is fine it can continue enter the normal statelite system.
 
 
Best Regards--Yuan Bai (白媛)CSTL HPC System Management DevelopmentTel:86-10-82451401E-mail: by...@cn.ibm.comAddress: IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,Beijing P.R.China 100193IBM环宇大厦北京市海淀区东北旺西路8号,中关村软件园28号楼邮编:100193
 
 
- Original message -From: Gilad Berman To: xCAT Users Mailing list Cc:Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or sshDate: Wed, May 16, 2018 12:02 AM 
Do you have any issues booting to shell or installing standard image?
 
Gilad BermanHPC ArchitectLenovo EMEA+972-52-2554262gber...@lenovo.com  Lenovo.com Twitter |  Facebook | Instagram | Blogs |  Forums 
 
 
From: Jeff Berry 

Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

2018-05-17 Thread Jeff Berry
Good afternoon,

thanks for the pointers.
The xcat version is: 2.13.11

As per Gilad’s suggestion, I tried booting to shell and that worked just fine.

I then tried your suggestions below with no luck.   However, it looks like 
there is a more fundamental problem.  None of the rd.break breakpoints worked, 
and the node booted to the same point before hanging.  This suggests to me that 
the dracut hooks are not working properly.   I’m investigating that more 
thorougly.

I did want to thank you both for the replies.

Jeff Berry
jeff.be...@mrc-cbu.cam.ac.uk



From: Yuan Y Bai [mailto:by...@cn.ibm.com]
Sent: 16 May 2018 06:37
To: xcat-user@lists.sourceforge.net
Cc: xcat-user@lists.sourceforge.net
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or 
ssh

Hi Jeff Berry

I looked at cluster.log, I guess  xCAT version is not latest version, what is 
your xCAT version? You can execute "lsxcatd -v" to get it.

From log,  you node hang during "Allowing litetree from node-i01",  you can add 
break point and enter the node-i01 to debug/find more useful information.
Executing the following to enter node-i01 through console:

 chdef node-i01 addkcmdline=rd.break=cleanup
 rinstall node-i01 osimage
 rcons node-i01

After you enter node-i01, you can find statelite.log under 
"/sysroot/.statelite", you can munally check if mount is ok etc. After you 
check all of them, execute "exit", exit ..., if system is fine it can continue 
enter the normal statelite system.


Best Regards
--
Yuan Bai (白媛)

CSTL HPC System Management Development
Tel:86-10-82451401
E-mail: by...@cn.ibm.com
Address: IBM ZGC Campus. Ring Building 28,
ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,
Beijing P.R.China 100193

IBM环宇大厦
北京市海淀区东北旺西路8号,中关村软件园28号楼
邮编:100193


- Original message -
From: Gilad Berman >
To: xCAT Users Mailing list 
>
Cc:
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or 
ssh
Date: Wed, May 16, 2018 12:02 AM


Do you have any issues booting to shell or installing standard image?



[http://lenovocentral.lenovo.com/marketing/branding/email_signature/images/gradient.gif]


Gilad Berman
HPC Architect
Lenovo EMEA


[Phone]+972-52-2554262
[Email]gber...@lenovo.com







Lenovo.com 
Twitter | Facebook | 
Instagram | Blogs | 
Forums



[DCG-Hardware]









From: Jeff Berry 
>
Sent: Tuesday, May 15, 2018 2:43 PM
To: xcat-user@lists.sourceforge.net
Subject: [External] [xcat-user] Problem with statelite boot - no console or ssh



Good day,



I am trying to bring up a new cluster to replace our aging system, and am 
trying to duplicate our basic setup  - which is to say statelite provision on 
Dell PowerEdge servers.  After working through the documentation and looking at 
our old setup, I’ve gotten close (I think) to success but have now been beating 
my head against a wall for a few weeks.



In short, when I boot the node, it seems to come up successfully, at least as 
far as networking and syslogging goes, but I can neither ssh to the node, nor 
can I get a console login.  I’m afraid I’m missing something obvious, but it’s 
starting to drive me crazy.  Data to follow:





The node is a Dell C6420, the OS is SciLinux 7.4, and I’m trying for a 
statelite boot.

lsdef on the node:

Object name: node-i01

addkcmdline=bond=bond0:eth0,eth1:mode=4

arch=x86_64

bmc=

bmcpassword=

bmcusername=

cons=ipmi

consoleenabled=1

currstate=statelite SL7-x86_64-compute

groups=all,node-i,c6420

hostnames=node-i01

initrd=xcat/netboot/SL7/x86_64/compute-v1/initrd-statelite.gz

ip=


kcmdline=root=nfs::/export/install/netboot/SL7/x86_64/compute-v1/rootimg:ro
 STATEMNT=:/state XCAT=!myipfn!:3001 console=tty0 
console=ttyS1,115200n8r MNTOPTS=

kernel=xcat/netboot/SL7/x86_64/compute-v1/kernel

mac=

mgt=ipmi

netboot=pxe

nfsserver=

nodetype=osi

os=SL7

postbootscripts=otherpkgs

postscripts=syslog,remoteshell,syncfiles

primarynic=mac

profile=compute

provmethod=SL7.4-compute-v1-201804

serialflow=hard

serialport=1

serialspeed=115200

status=netbooting

statustime=05-15-2018 10:01:01



lsdef on the osimage

Object name: SL7.4-compute-v1-201804

exlist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.exlist

imagetype=linux

osarch=x86_64

osdistroname=SL

osname=Linux

osvers=SL7


Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

2018-05-15 Thread Gilad Berman
Do you have any issues booting to shell or installing standard image?

[http://lenovocentral.lenovo.com/marketing/branding/email_signature/images/gradient.gif]

Gilad Berman
HPC Architect
Lenovo EMEA

[Phone]+972-52-2554262
[Email]gber...@lenovo.com



Lenovo.com 
Twitter | Facebook | 
Instagram | Blogs | 
Forums


[DCG-Hardware]




From: Jeff Berry 
Sent: Tuesday, May 15, 2018 2:43 PM
To: xcat-user@lists.sourceforge.net
Subject: [External] [xcat-user] Problem with statelite boot - no console or ssh

Good day,

I am trying to bring up a new cluster to replace our aging system, and am 
trying to duplicate our basic setup  - which is to say statelite provision on 
Dell PowerEdge servers.  After working through the documentation and looking at 
our old setup, I've gotten close (I think) to success but have now been beating 
my head against a wall for a few weeks.

In short, when I boot the node, it seems to come up successfully, at least as 
far as networking and syslogging goes, but I can neither ssh to the node, nor 
can I get a console login.  I'm afraid I'm missing something obvious, but it's 
starting to drive me crazy.  Data to follow:


The node is a Dell C6420, the OS is SciLinux 7.4, and I'm trying for a 
statelite boot.
lsdef on the node:
Object name: node-i01
addkcmdline=bond=bond0:eth0,eth1:mode=4
arch=x86_64
bmc=
bmcpassword=
bmcusername=
cons=ipmi
consoleenabled=1
currstate=statelite SL7-x86_64-compute
groups=all,node-i,c6420
hostnames=node-i01
initrd=xcat/netboot/SL7/x86_64/compute-v1/initrd-statelite.gz
ip=

kcmdline=root=nfs::/export/install/netboot/SL7/x86_64/compute-v1/rootimg:ro
 STATEMNT=:/state XCAT=!myipfn!:3001 console=tty0 
console=ttyS1,115200n8r MNTOPTS=
kernel=xcat/netboot/SL7/x86_64/compute-v1/kernel
mac=
mgt=ipmi
netboot=pxe
nfsserver=
nodetype=osi
os=SL7
postbootscripts=otherpkgs
postscripts=syslog,remoteshell,syncfiles
primarynic=mac
profile=compute
provmethod=SL7.4-compute-v1-201804
serialflow=hard
serialport=1
serialspeed=115200
status=netbooting
statustime=05-15-2018 10:01:01

lsdef on the osimage
Object name: SL7.4-compute-v1-201804
exlist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.exlist
imagetype=linux
osarch=x86_64
osdistroname=SL
osname=Linux
osvers=SL7
otherpkgdir=/install/post/otherpkgs/SL7/x86_64
permission=755
pkgdir=/install/SL7.x/x86_64
pkglist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.pkglist
profile=compute
provmethod=statelite
rootimgdir=/install/netboot/SL7/x86_64/compute-v1

when I boot with rcons running to the node I get the usual boot data and then:
PXELINUX 4.05 0x581bd748  Copyright (C) 1994-2011 H. Peter Anvin et al
!PXE entry point found (we hope) at 9878:0106 via plan A
UNDI code segment at 9878 len 4A10
UNDI data segment at 90FF len 7790
Getting cached packet  01 02 03
My IP address seems to be 
ip=
BOOTIF=
SYSUUID=
TFTP prefix:
Trying to load: pxelinux.cfg/   ok
Loading xcat/osimage/SL7.4-compute-v1-201804/kernel
Loading xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz
ready.

So that looks good, but then it hangs.

>From the idrac console itself, I get that data and then a lot of other boot 
>data before it hangs.  Useful snippets from that data:
the kernel command line is being passed properly.
console tty0 and ttyS1 are enabled
the console USB Keyboard/Mouse say they are found.
systemd says no hostname, random generator for ID
swap is started
"Started Dispatch Password Requests to Console Directory Watch"
"Starting dracut cmdline hook ..."
"random: crng init done"
the network comes up then ... and is pingable
dns_resolver registered
then hangs, and eventually systemd-journald crashes and restarts.


The log messages below indicate that the node is not getting the correct time 
settings (the time stamps in the logs are an hour off), which may or may not be 
related.

/var/log/messages gives me this:
May 15 09:34:54 master02 xcat[188522]: xCAT: Allowing rpower to node-i01 status 
for root from localhost
May 15 09:35:03 master02 xcat[188529]: xCAT: Allowing rpower to node-i01 reset 
for root from localhost
May 15 09:35:03 master02 xcat[188530]: node-i01 status: powering-on statustime: 
05-15-2018 09:35:03
May 15 09:35:15 master02 xcat[188560]: xCAT: Allowing tabdump site for root 
from localhost
May 15 09:35:16 master02 xcat[188570]: xCAT: Allowing tabdump site for root 
from localhost
May 15 09:35:16 master02 xcat[188591]: xCAT: Allowing nodels to node-i01 
nodehm.conserver for root from localhost
May 15 09:36:15 master02 in.tftpd[188658]: RRQ from  filename