1) At the time of running asu, are you trying to run it in band or out of band?
I ask because I'm curious about the output of 'rinv <node> mac' and if that is
more helpful. Also if letting them naturally pxe boot and using either switch
based discovery or 'nodediscover*' commands could be interesting to avoid you
having to transcribe anything in a particularly tedious fashion.
2) If I had to guess, I suspect this is Legacy boot (i.e. no /sys/firmware/efi/
directory) and the missing output is after the legacy PCI rom output (starting
with 'Initializing Legacy USB devices' and before the kernel actually starts
executing). Basically the output of a legacy pxe rom/grub/whathaveyou (though
grub can take over the serial of its own accord). If true, then rsetboot
<node> net -u I think wouldn't be missing any output. If this guess is
correct, it's due to a limitation of the way we were willing to do the default
serial console. In the M4 days, you had to always exlpcitily configure remote
serial console for everything,, in M5 the factory default is now to autosense
the console and provide the output and thus now a lot of folks don't think
about it. However one setting is still needed to be explicitly set for legacy
boot to have full output (during the discussion of autoboot there were concerns
about potential conflicts in legacy boot versus some DOS applications the
firmware team did not want to enable automatically):
DevicesandIOPorts.Com1ActiveAfterBoot=Enable
3) Can I see the output of lsinitrd
/install/netboot/centos6.5/x86_64/comp/initrd-stateless.gz? Particularly if it
dooesn't show tg3, then geninitrd may be needed to add the tg3 driver to the
initrd. geninitrd is called just like genimage, but does only the initrd
regeneration.
From: David D Johnson [mailto:[email protected]]
Sent: Thursday, June 25, 2015 3:46 PM
To: xCAT Users Mailing list
Subject: Re: [xcat-user] NextScale deployment kernel crash
Well, that got me a shell, and the system got the correct address on eth0.
Waiting for device with address 40:f2:e9:c5:54:00 to appear..Done
Acquired IPv4 address on eth0: 172.20.204.56/16
FATAL: Module ipmi_si not found.
Dropping to debug shell, exit to check for further action
[xCAT Genesis running on node855 /]# ***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
***A request variable unknown to the server
One of these messages every second.
So several annoying issues:
1) the ASU PXE lines do not show up in any predictable order.
The second new host has them scrambled, #2 is the shared 1GbE port...
PXE.NicPortMacAddress.1=E4:1D:2D:73:57:B1
PXE.NicPortMacAddress.2=40:F2:E9:C5:54:00
PXE.NicPortMacAddress.3=40:F2:E9:C5:54:01
PXE.NicPortMacAddress.4=E4:1D:2D:73:57:B2
I have been used to grabbing Address.1 and stuffing the MAC into tabedit mac.
2) view of Lenovo M5 node booting is very different, some lines do not show
up in the serial redirected rcons view that do show up with a real KVM console.
But the main issue remaining 3) is why the boot fails.
I just tried again nodeset node855 osimage=centos6.5-x86_64-netboot-comp
and rebooted...
Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 20283k freed
dmar: Unsupported device scope
audit: initializing netlink socket (disabled)
type=2000 audit(1435290228.438:1): initialized
HugeTLB registered 2 MB page size, pre-allocated 0 pages
...
usb 2-1.1.5: Manufacturer: IBM
usb 2-1.1.5: configuration #1 chosen from 2 choices
dracut Warning: No root device "1" found
dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
kernel command line.
dracut Warning: Signal caught!
dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
kernel command line.
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Tainted: G --------------- H
2.6.32-358.23.2.el6.x86_64 #1
Call Trace:
[<ffffffff8150daac>] ? panic+0xa7/0x16f
[<ffffffff81073be2>] ? do_exit+0x862/0x870
[<ffffffff81182c85>] ? fput+0x25/0x30
[<ffffffff81073c48>] ? do_group_exit+0x58/0xd0
[<ffffffff81073cd7>] ? sys_exit_group+0x17/0x20
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
On Jun 25, 2015, at 2:04 PM, Jarrod Johnson
<[email protected]<mailto:[email protected]>> wrote:
You can nodeset <nodes> shell
That'll get you an environment that should boot in them regardless, complete
with ssh and all.
From: David D Johnson [mailto:[email protected]]
Sent: Thursday, June 25, 2015 2:00 PM
To: xCAT Users Mailing list
Subject: Re: [xcat-user] NextScale deployment kernel crash
I may have jumped to conclusions about the reason, but in any case the two new
M5 machines don't boot.
This is the line specifying drivers from our build script:
./genimage -i eth0 -n dca,8021q,igb,bnx2,tg3 -o centos6.5 -k
2.6.32-358.23.2.el6.x86_64 -p comp
As to the ethernet interfaces, from M4 machine the relevant ASU output looks
like
PXE.NicPortMacAddress.1=6C:AE:8B:08:94:ED
PXE.NicPortMacAddress.2=6C:AE:8B:08:94:EE
IntelRI350GigabitNetworkConnection-6CAE8B0894ED.LinkStatus=Connected
IntelRI350GigabitNetworkConnection-6CAE8B0894ED.AlternateMACAddress=6C:AE:8B:08:94:ED
IntelRI350GigabitNetworkConnection-6CAE8B0894ED.LinkSpeed=AutoNeg
IntelRI350GigabitNetworkConnection-6CAE8B0894ED.WakeonLAN=Enabled
IntelRI350GigabitNetworkConnection-6CAE8B0894EE.LinkStatus=Disconnected
IntelRI350GigabitNetworkConnection-6CAE8B0894EE.AlternateMACAddress=6C:AE:8B:08:94:EE
IntelRI350GigabitNetworkConnection-6CAE8B0894EE.LinkSpeed=AutoNeg
IntelRI350GigabitNetworkConnection-6CAE8B0894EE.WakeonLAN=Enabled
On the new M5 machines, there are only two ports on the front (no dedicated IMM
port), but there are now four PXE Mac lines, #3 is shared port.
PXE.NicPortMacAddress.1=E4:1D:2D:73:56:01
PXE.NicPortMacAddress.2=E4:1D:2D:73:56:02
PXE.NicPortMacAddress.3=40:F2:E9:C5:51:12
PXE.NicPortMacAddress.4=40:F2:E9:C5:51:13
Now I realize the first two are from the dual port FDR IB mezzanine card
(ConnectX-3 Pro). They can be used as 10/40/56 GbE, I suppose, but we want to
use one of them for FDR IB only, and the other one isn't connected to anything.
The other two are BroadCom / Tg3
I wish I could ssh to the machine so I could poke around and see what the NICs
are called. Maybe I will have to boot off a USB key. No disks in these hosts.
-- ddj
On Jun 25, 2015, at 1:33 PM, Jarrod Johnson
<[email protected]<mailto:[email protected]>> wrote:
What nic driver was built in the initrd? m4 was igb, m5 uses tg3.
" extra unusable Ethernet ports on the motherboard that mess up the interface
naming. Is there a workaround for this???"
I'm interested in what this means and if I can help on that.
From: David Johnson [mailto:[email protected]]
Sent: Thursday, June 25, 2015 11:30 AM
To: xCAT Users Mailing list
Subject: Re: [xcat-user] NextScale deployment kernel crash
Yes, we are seeing exactly the same problem. 300 nodes from nehalem to
nextscale m4 all work fine with the same centos 6.5 image, but not so for the
the Lenovo nextscale M5 nodes. They seem to have extra unusable Ethernet ports
on the motherboard that mess up the interface naming. Is there a workaround for
this???
-- ddj
Dave Johnson
On Jun 25, 2015, at 10:49 AM, Damir Krstic
<[email protected]<mailto:[email protected]>> wrote:
We are trying to boot NextScale nodes with our RedHat 6.4 stateless image. They
are crashing during the initrd boot process with following error:
dracut Warning: No root device "1" found
dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
kernel command line.
dracut Warning: Signal caught!
dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
kernel command line.
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Tainted: G --------------- H
2.6.32-358.el6.x86_64 #1
Call Trace:
[<ffffffff8150cfc8>] ? panic+0xa7/0x16f
[<ffffffff81073ae2>] ? do_exit+0x862/0x870
[<ffffffff81182885>] ? fput+0x25/0x30
[<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
[<ffffffff81073bd7>] ? sys_exit_group+0x17/0x20
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
------------[ cut here ]------------
WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60()
(Tainted: G --------------- H )
Hardware name: IBM NeXtScale nx360 M5: -[5465AC1]-
Modules linked in: sd_mod crc_t10dif ahci mlx4_core [last unloaded:
scsi_wait_scan]
Pid: 1, comm: init Tainted: G --------------- H
2.6.32-358.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff8106e33a>] ? warn_slowpath_null+0x1a/0x20
[<ffffffff8102dd9c>] ? native_smp_send_reschedule+0x5c/0x60
[<ffffffff8105ae28>] ? scheduler_tick+0x208/0x260
[<ffffffff810a7fd0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff810811de>] ? update_process_times+0x6e/0x90
[<ffffffff810a8036>] ? tick_sched_timer+0x66/0xc0
[<ffffffff8109b38e>] ? __run_hrtimer+0x8e/0x1a0
[<ffffffff810a182f>] ? ktime_get_update_offsets+0x4f/0xd0
[<ffffffff8107700f>] ? __do_softirq+0x11f/0x1e0
[<ffffffff8109b6f6>] ? hrtimer_interrupt+0xe6/0x260
[<ffffffff81516d7b>] ? smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff8150d06d>] ? panic+0x14c/0x16f
[<ffffffff8150cffa>] ? panic+0xd9/0x16f
[<ffffffff81073ae2>] ? do_exit+0x862/0x870
[<ffffffff81182885>] ? fput+0x25/0x30
[<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
[<ffffffff81073bd7>] ? sys_exit_group+0x17/0x20
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Any help would be appreciated.
Thanks,
Damir
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
xCAT-user mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
xCAT-user mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
xCAT-user mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user