Hi Jeff:
 
You can do following to check:
 
1, check all nfs-server export directory correctly, including nfs server from `statelite` table and `litetree` table.
    "showmount -e <nfs-server>"
 
2, You can use "lsdef -t osimage <your_osimage_name> -i pkglist" to find pkglist path, then you can add package names like yum into this file.  Then you should execute:
      genimage <your osimage name>
      liteimg <your osimage name>
      nodeset <your compute node> osimage=<your osimage name>
      rsetboot <your compute node> net    #if your node is not VM
      rpower <your compute node> reset
 
3, I installed rh7.4, and dracut is following, but I think you use a different OS:
dracut-033-502.el7.x86_64
dracut-network-033-502.el7.x86_64
 
Best Regards
--------------------------------------------------
Yuan Bai (白媛)

CSTL HPC System Management Development
Tel:86-10-82451401
E-mail: by...@cn.ibm.com
Address: IBM ZGC Campus. Ring Building 28,
ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,
Beijing P.R.China 100193

IBM环宇大厦
北京市海淀区东北旺西路8号,中关村软件园28号楼
邮编:100193
 
 
----- Original message -----
From: Jeff Berry <jeff.be...@mrc-cbu.cam.ac.uk>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Cc:
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh
Date: Fri, May 18, 2018 7:35 PM
 

Some more digging suggests that the image is missing a lot of useful(needed) packages – including yum, and the only installed dracut package is just dracut.x86_64.

 

It looks like maybe the image build didn’t get the needed packages?

 

I haven’t been able to get rd.debug output or get any breakpoints to work – I tried them all. 

 

Thanks for everyone’s time, and sorry if I’m  making obvious rookie mistakes ...

 

Jeff Berry

jeff.be...@mrc-cbu.cam.ac.uk

 

 

From: Yuan Y Bai [mailto:by...@cn.ibm.com]
Sent: 18 May 2018 02:19
To: xcat-user@lists.sourceforge.net
Cc: xcat-user@lists.sourceforge.net
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

 

 

Did you try all these pre-mount or mount or pre-pivot break points have problem? 

 

 

Best Regards
--------------------------------------------------
Yuan Bai (
白媛)

CSTL HPC System Management Development
Tel:86-10-82451401
E-mail: by...@cn.ibm.com
Address: IBM ZGC Campus. Ring Building 28,
ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,
Beijing P.R.China 100193

IBM
环宇大厦
北京市海淀区东北旺西路8号,中关村软件园28号楼
邮编:100193

 

 

----- Original message -----
From: Jeff Berry <jeff.be...@mrc-cbu.cam.ac.uk>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Cc:
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh
Date: Thu, May 17, 2018 7:22 PM
 

Good afternoon,

 

thanks for the pointers.

The xcat version is: 2.13.11

 

As per Gilad’s suggestion, I tried booting to shell and that worked just fine.

 

I then tried your suggestions below with no luck.   However, it looks like there is a more fundamental problem.  None of the rd.break breakpoints worked, and the node booted to the same point before hanging.  This suggests to me that the dracut hooks are not working properly.   I’m investigating that more thorougly.

 

I did want to thank you both for the replies.

 

Jeff Berry

jeff.be...@mrc-cbu.cam.ac.uk

 

 

 

From: Yuan Y Bai [mailto:by...@cn.ibm.com]
Sent: 16 May 2018 06:37
To: xcat-user@lists.sourceforge.net
Cc: xcat-user@lists.sourceforge.net
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh

 

Hi Jeff Berry

 

I looked at cluster.log, I guess  xCAT version is not latest version, what is your xCAT version? You can execute "lsxcatd -v" to get it.

 

From log,  you node hang during "Allowing litetree from node-i01",  you can add break point and enter the node-i01 to debug/find more useful information.

Executing the following to enter node-i01 through console:

 

     chdef node-i01 addkcmdline=rd.break=cleanup

     rinstall node-i01 osimage

     rcons node-i01

 

After you enter node-i01, you can find statelite.log under "/sysroot/.statelite", you can munally check if mount is ok etc. After you check all of them, execute "exit", exit ..., if system is fine it can continue enter the normal statelite system.

 

 

Best Regards
--------------------------------------------------
Yuan Bai (
白媛)

CSTL HPC System Management Development
Tel:86-10-82451401
E-mail: by...@cn.ibm.com
Address: IBM ZGC Campus. Ring Building 28,
ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District,
Beijing P.R.China 100193

IBM
环宇大厦
北京市海淀区东北旺西路8号,中关村软件园28号楼
邮编:100193

 

 

----- Original message -----
From: Gilad Berman <gber...@lenovo.com>
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Cc:
Subject: Re: [xcat-user] [External] Problem with statelite boot - no console or ssh
Date: Wed, May 16, 2018 12:02 AM
 

Do you have any issues booting to shell or installing standard image?

 

http://lenovocentral.lenovo.com/marketing/branding/email_signature/images/gradient.gif

Gilad Berman
HPC Architect
Lenovo EMEA

Phone+972-52-2554262
Emailgber...@lenovo.com

 

 

Lenovo.com
Twitter | Facebook | Instagram | Blogs | Forums

DCG-Hardware

 

 

 

From: Jeff Berry <jeff.be...@mrc-cbu.cam.ac.uk>
Sent: Tuesday, May 15, 2018 2:43 PM
To: xcat-user@lists.sourceforge.net
Subject: [External] [xcat-user] Problem with statelite boot - no console or ssh

 

Good day,

 

I am trying to bring up a new cluster to replace our aging system, and am trying to duplicate our basic setup  - which is to say statelite provision on Dell PowerEdge servers.  After working through the documentation and looking at our old setup, I’ve gotten close (I think) to success but have now been beating my head against a wall for a few weeks.

 

In short, when I boot the node, it seems to come up successfully, at least as far as networking and syslogging goes, but I can neither ssh to the node, nor can I get a console login.  I’m afraid I’m missing something obvious, but it’s starting to drive me crazy.  Data to follow:

 

 

The node is a Dell C6420, the OS is SciLinux 7.4, and I’m trying for a statelite boot.

lsdef on the node:

Object name: node-i01

    addkcmdline=bond=bond0:eth0,eth1:mode=4

    arch=x86_64

    bmc=<redactBMCIP>

    bmcpassword=<redact>

    bmcusername=<redact>

    cons=ipmi

    consoleenabled=1

    currstate=statelite SL7-x86_64-compute

    groups=all,node-i,c6420

    hostnames=node-i01<domain>

    initrd=xcat/netboot/SL7/x86_64/compute-v1/initrd-statelite.gz

    ip=<redactNodeIP>

    kcmdline=root=nfs:<redactMasterIP>:/export/install/netboot/SL7/x86_64/compute-v1/rootimg:ro STATEMNT=<redactMasterIP>:/state XCAT=!myipfn!:3001 console=tty0 console=ttyS1,115200n8r MNTOPTS=

    kernel=xcat/netboot/SL7/x86_64/compute-v1/kernel

    mac=<redact>

    mgt=ipmi

    netboot=pxe

    nfsserver=<redactMasterIP>

    nodetype=osi

    os=SL7

    postbootscripts=otherpkgs

    postscripts=syslog,remoteshell,syncfiles

    primarynic=mac

    profile="">

    provmethod=SL7.4-compute-v1-201804

    serialflow=hard

    serialport=1

    serialspeed=115200

    status=netbooting

    statustime=05-15-2018 10:01:01

 

lsdef on the osimage

Object name: SL7.4-compute-v1-201804

    exlist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.exlist

    imagetype=linux

    osarch=x86_64

    osdistroname=SL

    osname=Linux

    osvers=SL7

    otherpkgdir=/install/post/otherpkgs/SL7/x86_64

    permission=755

    pkgdir=/install/SL7.x/x86_64

    pkglist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.pkglist

    profile="">

    provmethod=statelite

    rootimgdir=/install/netboot/SL7/x86_64/compute-v1

 

when I boot with rcons running to the node I get the usual boot data and then:

PXELINUX 4.05 0x581bd748  Copyright (C) 1994-2011 H. Peter Anvin et al

!PXE entry point found (we hope) at 9878:0106 via plan A

UNDI code segment at 9878 len 4A10

UNDI data segment at 90FF len 7790

Getting cached packet  01 02 03

My IP address seems to be <redact>

ip=<redact>

BOOTIF=<redact>

SYSUUID=<redact>

TFTP prefix:

Trying to load: pxelinux.cfg/<redact>                               ok

Loading xcat/osimage/SL7.4-compute-v1-201804/kernel........

Loading xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz................

....................ready.

 

So that looks good, but then it hangs.

 

From the idrac console itself, I get that data and then a lot of other boot data before it hangs.  Useful snippets from that data:

the kernel command line is being passed properly.

console tty0 and ttyS1 are enabled

the console USB Keyboard/Mouse say they are found.

systemd says no hostname, random generator for ID

swap is started

“Started Dispatch Password Requests to Console Directory Watch”

“Starting dracut cmdline hook ...”

“random: crng init done”

the network comes up then ... and is pingable

dns_resolver registered

then hangs, and eventually systemd-journald crashes and restarts.

 

 

The log messages below indicate that the node is not getting the correct time settings (the time stamps in the logs are an hour off), which may or may not be related.

 

/var/log/messages gives me this:

May 15 09:34:54 master02 xcat[188522]: xCAT: Allowing rpower to node-i01 status for root from localhost

May 15 09:35:03 master02 xcat[188529]: xCAT: Allowing rpower to node-i01 reset for root from localhost

May 15 09:35:03 master02 xcat[188530]: node-i01 status: powering-on statustime: 05-15-2018 09:35:03

May 15 09:35:15 master02 xcat[188560]: xCAT: Allowing tabdump site for root from localhost

May 15 09:35:16 master02 xcat[188570]: xCAT: Allowing tabdump site for root from localhost

May 15 09:35:16 master02 xcat[188591]: xCAT: Allowing nodels to node-i01 nodehm.conserver for root from localhost

May 15 09:36:15 master02 in.tftpd[188658]: RRQ from <redactNodeIP> filename xcat/osimage/SL7.4-compute-v1-201804/kernel

May 15 09:36:15 master02 in.tftpd[188658]: Client <redactNodeIP> finished xcat/osimage/SL7.4-compute-v1-201804/kernel

May 15 09:36:15 master02 in.tftpd[188659]: RRQ from <redactNodeIP> filename xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz

May 15 09:36:17 master02 in.tftpd[188659]: Client <redactNodeIP> finished xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz

May 15 08:58:11 node-i01 kernel: [    0.000000] Command line: initrd=xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz root=nfs:<redactMasterIP>:/install/netboot/SL7/x86_64/compute-v1/rootimg:ro STATEMNT=<redactMasterIP>:/state XCAT=<redactMasterIP>:3001 NODE=node-i01 LOGSERVER=<redactMasterIP>syslog.server=<redactMasterIP>syslog.type=rsyslogd syslog.filter=*.* xcatdebugmode=1 console=tty0 console=ttyS1,115200n8r MNTOPTS= bond=bond0:eth0,eth1:mode=4 selinux=0 rd.shell rd.debug BOOT_IMAGE=xcat/osimage/SL7.4-compute-v1-201804/kernel BOOTIF=<redact>

May 15 10:01:01 master02 xcat[140913]: INFO xcatd received a connection request from <redactNodeIP>

May 15 10:01:01 master02 xcat[140913]: node-i01 status: netbooting statustime: 05-15-2018 10:01:01

May 15 09:01:08 node-i01 xcat: ready

May 15 09:01:08 node-i01 xcat: done

May 15 10:08:01 master02 xcat[189323]: xCAT: Allowing litefile from node-i01

May 15 10:08:01 master02 xcat[189325]: xCAT: Allowing litetree from node-i01

 

/var/log/xcat/cluster.log attached.

 

nmap indicates that port 22 is closed and ssh returns connection refused.

 

It feels like I’m missing something obvious, but (clearly) I don’t know what ... any pointers?

 

Jeff Berry

jeff.be...@mrc-cbu.cam.ac.uk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

 

 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

 

 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to