On Jun 25, 2015, at 4:12 PM, Jarrod Johnson <[email protected]> wrote:
> 1) At the time of running asu, are you trying to run it in band or out of
> band? I ask because I'm curious about the output of 'rinv <node> mac' and if
> that is more helpful. Also if letting them naturally pxe boot and using
> either switch based discovery or 'nodediscover*' commands could be
> interesting to avoid you having to transcribe anything in a particularly
> tedious fashion.
>
>
Out of band.
We've used various node discovery methods, including switch snmp, grepping the
DHCP lines from /var/log/messages ala Rocks cluster, and I've been happiest
with out-of band asu. I collect the
whole "show" output for each host, and then grep the MAC line out. I have to
use alternate methods for
the other mfr nodes, it's tedious but not arduous. The switch methods seem OK,
but at various points
during the last decade (including xCat I days) there have been times when they
broke, and I could always
rely on brute force to deploy new nodes. With ASU I can fire up a loop in the
shell, and come back when it's done.
Here is rinv mac for the two new M5 machines:
[root@mgt1 settings]# rinv node855,node856 mac
[root@mgt1 settings]#
I.E. -- no output --
> 2) If I had to guess, I suspect this is Legacy boot (i.e. no
> /sys/firmware/efi/ directory) and the missing output is after the legacy PCI
> rom output (starting with 'Initializing Legacy USB devices' and before the
> kernel actually starts executing). Basically the output of a legacy pxe
> rom/grub/whathaveyou (though grub can take over the serial of its own
> accord). If true, then rsetboot <node> net -u I think wouldn't be missing
> any output. If this guess is correct, it's due to a limitation of the way we
> were willing to do the default serial console. In the M4 days, you had to
> always exlpcitily configure remote serial console for everything,, in M5 the
> factory default is now to autosense the console and provide the output and
> thus now a lot of folks don't think about it. However one setting is still
> needed to be explicitly set for legacy boot to have full output (during the
> discussion of autoboot there were concerns about potential conflicts in
> legacy boot versus some DOS applications the firmware team did not want to
> enable automatically):
>
> DevicesandIOPorts.Com1ActiveAfterBoot=Enable
>
>
I set this manually using this asu batch script:
set BootOrder.BootOrder "Legacy Only=PXE Network=Hard Disk 0"
set Processors.Hyper-Threading Disable
set DevicesandIOPorts.RemoteConsole Enable
set DevicesandIOPorts.SerialPortSharing Enable
set DevicesandIOPorts.SerialPortAccessMode Dedicated
set DevicesandIOPorts.SPRedirection Disable
set DevicesandIOPorts.Com1TerminalEmulation VT100
set DevicesandIOPorts.Com1ActiveAfterBoot Enable
set DevicesandIOPorts.Com1FlowControl Hardware
set BootModes.OptimizedBoot Disable
I don't remember where the last line came from or why I did it.
SPRedirection might be relevant. Flow control was once a problem (nehalem?)
Is there a best-recipe for M5 diskless-headless nodes like there was back in
the M2 days? I think there was once one associated with 1350/1410 cluster best
practices.
> 3) Can I see the output of lsinitrd
> /install/netboot/centos6.5/x86_64/comp/initrd-stateless.gz? Particularly if
> it dooesn't show tg3, then geninitrd may be needed to add the tg3 driver to
> the initrd. geninitrd is called just like genimage, but does only the initrd
> regeneration.
root@mgt1 settings]# lsinitrd
/install/netboot//centos6.5/x86_64/comp/initrd-stateless.gz | grep tg3
-rw-r--r-- 1 root root 2668 Nov 21 2013
lib/firmware/tigon/tg3.bin
-rw-r--r-- 1 root root 7004 Nov 21 2013
lib/firmware/tigon/tg3_tso.bin
-rw-r--r-- 1 root root 3884 Nov 21 2013
lib/firmware/tigon/tg3_tso5.bin
-rwxr--r-- 1 root root 225896 Apr 20 13:07
lib/modules/2.6.32-358.23.2.el6.x86_64/kernel/drivers/net/tg3.ko
[root@mgt1 settings]#
Thanks for the help. Should I talk to xcat support? (we have a contract).
-- ddj
> From: David D Johnson [mailto:[email protected]]
> Sent: Thursday, June 25, 2015 3:46 PM
> To: xCAT Users Mailing list
> Subject: Re: [xcat-user] NextScale deployment kernel crash
>
> Well, that got me a shell, and the system got the correct address on eth0.
> Waiting for device with address 40:f2:e9:c5:54:00 to appear..Done
> Acquired IPv4 address on eth0: 172.20.204.56/16
> FATAL: Module ipmi_si not found.
> Dropping to debug shell, exit to check for further action
> [xCAT Genesis running on node855 /]# ***A request variable unknown to the
> server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
>
> One of these messages every second.
>
> So several annoying issues:
> 1) the ASU PXE lines do not show up in any predictable order.
> The second new host has them scrambled, #2 is the shared 1GbE port...
> PXE.NicPortMacAddress.1=E4:1D:2D:73:57:B1
> PXE.NicPortMacAddress.2=40:F2:E9:C5:54:00
> PXE.NicPortMacAddress.3=40:F2:E9:C5:54:01
> PXE.NicPortMacAddress.4=E4:1D:2D:73:57:B2
>
> I have been used to grabbing Address.1 and stuffing the MAC into tabedit mac.
>
> 2) view of Lenovo M5 node booting is very different, some lines do not show
> up in the serial redirected rcons view that do show up with a real KVM
> console.
>
> But the main issue remaining 3) is why the boot fails.
> I just tried again nodeset node855 osimage=centos6.5-x86_64-netboot-comp
> and rebooted...
>
> Trying to unpack rootfs image as initramfs...
> Freeing initrd memory: 20283k freed
> dmar: Unsupported device scope
> audit: initializing netlink socket (disabled)
> type=2000 audit(1435290228.438:1): initialized
> HugeTLB registered 2 MB page size, pre-allocated 0 pages
>
> ...
>
> usb 2-1.1.5: Manufacturer: IBM
> usb 2-1.1.5: configuration #1 chosen from 2 choices
>
>
> dracut Warning: No root device "1" found
>
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
> kernel command line.
>
>
> dracut Warning: Signal caught!
>
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
> kernel command line.
> Kernel panic - not syncing: Attempted to kill init!
> Pid: 1, comm: init Tainted: G --------------- H
> 2.6.32-358.23.2.el6.x86_64 #1
> Call Trace:
> [<ffffffff8150daac>] ? panic+0xa7/0x16f
> [<ffffffff81073be2>] ? do_exit+0x862/0x870
> [<ffffffff81182c85>] ? fput+0x25/0x30
> [<ffffffff81073c48>] ? do_group_exit+0x58/0xd0
> [<ffffffff81073cd7>] ? sys_exit_group+0x17/0x20
> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
>
> On Jun 25, 2015, at 2:04 PM, Jarrod Johnson <[email protected]> wrote:
>
>
> You can nodeset <nodes> shell
>
> That'll get you an environment that should boot in them regardless, complete
> with ssh and all.
>
> From: David D Johnson [mailto:[email protected]]
> Sent: Thursday, June 25, 2015 2:00 PM
> To: xCAT Users Mailing list
> Subject: Re: [xcat-user] NextScale deployment kernel crash
>
> I may have jumped to conclusions about the reason, but in any case the two
> new M5 machines don't boot.
>
> This is the line specifying drivers from our build script:
> ./genimage -i eth0 -n dca,8021q,igb,bnx2,tg3 -o centos6.5 -k
> 2.6.32-358.23.2.el6.x86_64 -p comp
>
>
> As to the ethernet interfaces, from M4 machine the relevant ASU output looks
> like
> PXE.NicPortMacAddress.1=6C:AE:8B:08:94:ED
> PXE.NicPortMacAddress.2=6C:AE:8B:08:94:EE
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.LinkStatus=Connected
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.AlternateMACAddress=6C:AE:8B:08:94:ED
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.LinkSpeed=AutoNeg
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.WakeonLAN=Enabled
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.LinkStatus=Disconnected
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.AlternateMACAddress=6C:AE:8B:08:94:EE
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.LinkSpeed=AutoNeg
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.WakeonLAN=Enabled
>
> On the new M5 machines, there are only two ports on the front (no dedicated
> IMM port), but there are now four PXE Mac lines, #3 is shared port.
> PXE.NicPortMacAddress.1=E4:1D:2D:73:56:01
> PXE.NicPortMacAddress.2=E4:1D:2D:73:56:02
> PXE.NicPortMacAddress.3=40:F2:E9:C5:51:12
> PXE.NicPortMacAddress.4=40:F2:E9:C5:51:13
>
> Now I realize the first two are from the dual port FDR IB mezzanine card
> (ConnectX-3 Pro). They can be used as 10/40/56 GbE, I suppose, but we want to
> use one of them for FDR IB only, and the other one isn't connected to
> anything.
> The other two are BroadCom / Tg3
>
> I wish I could ssh to the machine so I could poke around and see what the
> NICs are called. Maybe I will have to boot off a USB key. No disks in these
> hosts.
>
> -- ddj
>
> On Jun 25, 2015, at 1:33 PM, Jarrod Johnson <[email protected]> wrote:
>
>
>
> What nic driver was built in the initrd? m4 was igb, m5 uses tg3.
>
> " extra unusable Ethernet ports on the motherboard that mess up the interface
> naming. Is there a workaround for this???"
>
> I'm interested in what this means and if I can help on that.
>
> From: David Johnson [mailto:[email protected]]
> Sent: Thursday, June 25, 2015 11:30 AM
> To: xCAT Users Mailing list
> Subject: Re: [xcat-user] NextScale deployment kernel crash
>
> Yes, we are seeing exactly the same problem. 300 nodes from nehalem to
> nextscale m4 all work fine with the same centos 6.5 image, but not so for the
> the Lenovo nextscale M5 nodes. They seem to have extra unusable Ethernet
> ports on the motherboard that mess up the interface naming. Is there a
> workaround for this???
>
> -- ddj
> Dave Johnson
>
> On Jun 25, 2015, at 10:49 AM, Damir Krstic <[email protected]> wrote:
>
> We are trying to boot NextScale nodes with our RedHat 6.4 stateless image.
> They are crashing during the initrd boot process with following error:
>
> dracut Warning: No root device "1" found
>
>
>
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
> kernel command line.
>
>
>
> dracut Warning: Signal caught!
>
>
>
>
>
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the
> kernel command line.
>
> Kernel panic - not syncing: Attempted to kill init!
>
> Pid: 1, comm: init Tainted: G --------------- H
> 2.6.32-358.el6.x86_64 #1
>
> Call Trace:
>
> [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
>
> [<ffffffff81073ae2>] ? do_exit+0x862/0x870
>
> [<ffffffff81182885>] ? fput+0x25/0x30
>
> [<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
>
> [<ffffffff81073bd7>] ? sys_exit_group+0x17/0x20
>
> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
>
> ------------[ cut here ]------------
>
> WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60()
> (Tainted: G --------------- H )
>
> Hardware name: IBM NeXtScale nx360 M5: -[5465AC1]-
>
> Modules linked in: sd_mod crc_t10dif ahci mlx4_core [last unloaded:
> scsi_wait_scan]
>
> Pid: 1, comm: init Tainted: G --------------- H
> 2.6.32-358.el6.x86_64 #1
>
> Call Trace:
>
> <IRQ> [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0
>
> [<ffffffff8106e33a>] ? warn_slowpath_null+0x1a/0x20
>
> [<ffffffff8102dd9c>] ? native_smp_send_reschedule+0x5c/0x60
>
> [<ffffffff8105ae28>] ? scheduler_tick+0x208/0x260
>
> [<ffffffff810a7fd0>] ? tick_sched_timer+0x0/0xc0
>
> [<ffffffff810811de>] ? update_process_times+0x6e/0x90
>
> [<ffffffff810a8036>] ? tick_sched_timer+0x66/0xc0
>
> [<ffffffff8109b38e>] ? __run_hrtimer+0x8e/0x1a0
>
> [<ffffffff810a182f>] ? ktime_get_update_offsets+0x4f/0xd0
>
> [<ffffffff8107700f>] ? __do_softirq+0x11f/0x1e0
>
> [<ffffffff8109b6f6>] ? hrtimer_interrupt+0xe6/0x260
>
> [<ffffffff81516d7b>] ? smp_apic_timer_interrupt+0x6b/0x9b
>
> [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
>
> <EOI> [<ffffffff8150d06d>] ? panic+0x14c/0x16f
>
> [<ffffffff8150cffa>] ? panic+0xd9/0x16f
>
> [<ffffffff81073ae2>] ? do_exit+0x862/0x870
>
> [<ffffffff81182885>] ? fput+0x25/0x30
>
> [<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
>
> [<ffffffff81073bd7>] ? sys_exit_group+0x17/0x20
>
> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
>
>
>
> Any help would be appreciated.
>
>
>
> Thanks,
>
> Damir
>
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors
> network devices and physical & virtual servers, alerts via email & sms
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
> _______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors
> network devices and physical & virtual servers, alerts via email & sms
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors
> network devices and physical & virtual servers, alerts via email & sms
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors
> network devices and physical & virtual servers, alerts via email & sms
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user