Do you have any issues booting to shell or installing standard image?
[http://lenovocentral.lenovo.com/marketing/branding/email_signature/images/gradient.gif]
Gilad Berman
HPC Architect
Lenovo EMEA
[Phone]+972-52-2554262
[Email]gber...@lenovo.com<mailto:gber...@lenovo.com>
Lenovo.com <http://www.lenovo.com/>
Twitter<http://twitter.com/lenovo> | Facebook<http://www.facebook.com/lenovo> |
Instagram<https://instagram.com/lenovo> | Blogs<http://blog.lenovo.com/> |
Forums<http://forums.lenovo.com/>
[DCG-Hardware]
From: Jeff Berry <jeff.be...@mrc-cbu.cam.ac.uk>
Sent: Tuesday, May 15, 2018 2:43 PM
To: xcat-user@lists.sourceforge.net
Subject: [External] [xcat-user] Problem with statelite boot - no console or ssh
Good day,
I am trying to bring up a new cluster to replace our aging system, and am
trying to duplicate our basic setup - which is to say statelite provision on
Dell PowerEdge servers. After working through the documentation and looking at
our old setup, I've gotten close (I think) to success but have now been beating
my head against a wall for a few weeks.
In short, when I boot the node, it seems to come up successfully, at least as
far as networking and syslogging goes, but I can neither ssh to the node, nor
can I get a console login. I'm afraid I'm missing something obvious, but it's
starting to drive me crazy. Data to follow:
The node is a Dell C6420, the OS is SciLinux 7.4, and I'm trying for a
statelite boot.
lsdef on the node:
Object name: node-i01
addkcmdline=bond=bond0:eth0,eth1:mode=4
arch=x86_64
bmc=<redactBMCIP>
bmcpassword=<redact>
bmcusername=<redact>
cons=ipmi
consoleenabled=1
currstate=statelite SL7-x86_64-compute
groups=all,node-i,c6420
hostnames=node-i01<domain>
initrd=xcat/netboot/SL7/x86_64/compute-v1/initrd-statelite.gz
ip=<redactNodeIP>
kcmdline=root=nfs:<redactMasterIP>:/export/install/netboot/SL7/x86_64/compute-v1/rootimg:ro
STATEMNT=<redactMasterIP>:/state XCAT=!myipfn!:3001 console=tty0
console=ttyS1,115200n8r MNTOPTS=
kernel=xcat/netboot/SL7/x86_64/compute-v1/kernel
mac=<redact>
mgt=ipmi
netboot=pxe
nfsserver=<redactMasterIP>
nodetype=osi
os=SL7
postbootscripts=otherpkgs
postscripts=syslog,remoteshell,syncfiles
primarynic=mac
profile=compute
provmethod=SL7.4-compute-v1-201804
serialflow=hard
serialport=1
serialspeed=115200
status=netbooting
statustime=05-15-2018 10:01:01
lsdef on the osimage
Object name: SL7.4-compute-v1-201804
exlist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.exlist
imagetype=linux
osarch=x86_64
osdistroname=SL
osname=Linux
osvers=SL7
otherpkgdir=/install/post/otherpkgs/SL7/x86_64
permission=755
pkgdir=/install/SL7.x/x86_64
pkglist=/opt/xcat/share/xcat/netboot/SL/compute.centos7.pkglist
profile=compute
provmethod=statelite
rootimgdir=/install/netboot/SL7/x86_64/compute-v1
when I boot with rcons running to the node I get the usual boot data and then:
PXELINUX 4.05 0x581bd748 Copyright (C) 1994-2011 H. Peter Anvin et al
!PXE entry point found (we hope) at 9878:0106 via plan A
UNDI code segment at 9878 len 4A10
UNDI data segment at 90FF len 7790
Getting cached packet 01 02 03
My IP address seems to be <redact>
ip=<redact>
BOOTIF=<redact>
SYSUUID=<redact>
TFTP prefix:
Trying to load: pxelinux.cfg/<redact> ok
Loading xcat/osimage/SL7.4-compute-v1-201804/kernel........
Loading xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz................
....................ready.
So that looks good, but then it hangs.
>From the idrac console itself, I get that data and then a lot of other boot
>data before it hangs. Useful snippets from that data:
the kernel command line is being passed properly.
console tty0 and ttyS1 are enabled
the console USB Keyboard/Mouse say they are found.
systemd says no hostname, random generator for ID
swap is started
"Started Dispatch Password Requests to Console Directory Watch"
"Starting dracut cmdline hook ..."
"random: crng init done"
the network comes up then ... and is pingable
dns_resolver registered
then hangs, and eventually systemd-journald crashes and restarts.
The log messages below indicate that the node is not getting the correct time
settings (the time stamps in the logs are an hour off), which may or may not be
related.
/var/log/messages gives me this:
May 15 09:34:54 master02 xcat[188522]: xCAT: Allowing rpower to node-i01 status
for root from localhost
May 15 09:35:03 master02 xcat[188529]: xCAT: Allowing rpower to node-i01 reset
for root from localhost
May 15 09:35:03 master02 xcat[188530]: node-i01 status: powering-on statustime:
05-15-2018 09:35:03
May 15 09:35:15 master02 xcat[188560]: xCAT: Allowing tabdump site for root
from localhost
May 15 09:35:16 master02 xcat[188570]: xCAT: Allowing tabdump site for root
from localhost
May 15 09:35:16 master02 xcat[188591]: xCAT: Allowing nodels to node-i01
nodehm.conserver for root from localhost
May 15 09:36:15 master02 in.tftpd[188658]: RRQ from <redactNodeIP> filename
xcat/osimage/SL7.4-compute-v1-201804/kernel
May 15 09:36:15 master02 in.tftpd[188658]: Client <redactNodeIP> finished
xcat/osimage/SL7.4-compute-v1-201804/kernel
May 15 09:36:15 master02 in.tftpd[188659]: RRQ from <redactNodeIP> filename
xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz
May 15 09:36:17 master02 in.tftpd[188659]: Client <redactNodeIP> finished
xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz
May 15 08:58:11 node-i01 kernel: [ 0.000000] Command line:
initrd=xcat/osimage/SL7.4-compute-v1-201804/initrd-statelite.gz
root=nfs:<redactMasterIP>:/install/netboot/SL7/x86_64/compute-v1/rootimg:ro
STATEMNT=<redactMasterIP>:/state XCAT=<redactMasterIP>:3001 NODE=node-i01
LOGSERVER=<redactMasterIP>syslog.server=<redactMasterIP>syslog.type=rsyslogd
syslog.filter=*.* xcatdebugmode=1 console=tty0 console=ttyS1,115200n8r MNTOPTS=
bond=bond0:eth0,eth1:mode=4 selinux=0 rd.shell rd.debug
BOOT_IMAGE=xcat/osimage/SL7.4-compute-v1-201804/kernel BOOTIF=<redact>
May 15 10:01:01 master02 xcat[140913]: INFO xcatd received a connection request
from <redactNodeIP>
May 15 10:01:01 master02 xcat[140913]: node-i01 status: netbooting statustime:
05-15-2018 10:01:01
May 15 09:01:08 node-i01 xcat: ready
May 15 09:01:08 node-i01 xcat: done
May 15 10:08:01 master02 xcat[189323]: xCAT: Allowing litefile from node-i01
May 15 10:08:01 master02 xcat[189325]: xCAT: Allowing litetree from node-i01
/var/log/xcat/cluster.log attached.
nmap indicates that port 22 is closed and ssh returns connection refused.
It feels like I'm missing something obvious, but (clearly) I don't know what
... any pointers?
Jeff Berry
jeff.be...@mrc-cbu.cam.ac.uk<mailto:jeff.be...@mrc-cbu.cam.ac.uk>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user