On Jun 25, 2015, at 4:12 PM, Jarrod Johnson <[email protected]> wrote:

> 1) At the time of running asu, are you trying to run it in band or out of 
> band?  I ask because I'm curious about the output of 'rinv <node> mac' and if 
> that is more helpful.  Also if letting them naturally pxe boot and using 
> either switch based discovery or 'nodediscover*' commands could be 
> interesting to avoid you having to transcribe anything in a particularly 
> tedious fashion.
>  
> 
Out of band.  
We've used various node discovery methods, including switch snmp, grepping the 
DHCP lines from /var/log/messages ala Rocks cluster, and I've been happiest 
with out-of band asu.  I collect the 
whole "show" output for each host, and then grep the MAC line out.  I have to 
use alternate methods for
the other mfr nodes, it's tedious but not arduous.  The switch methods seem OK, 
but at various points
during the last decade (including xCat I days) there have been times when they 
broke, and I could always
rely on brute force to deploy new nodes.  With ASU I can fire up a loop in the 
shell, and come back when it's done.

Here is rinv mac for the two new M5 machines:
[root@mgt1 settings]# rinv node855,node856 mac
[root@mgt1 settings]# 

I.E. -- no output --



> 2) If I had to guess, I suspect this is Legacy boot (i.e. no 
> /sys/firmware/efi/ directory) and the missing output is after the legacy PCI 
> rom output (starting with 'Initializing Legacy USB devices' and before the 
> kernel actually starts executing).  Basically the output of a legacy pxe 
> rom/grub/whathaveyou (though grub can take over the serial of its own 
> accord).  If true, then rsetboot <node> net -u I think wouldn't be missing 
> any output.  If this guess is correct, it's due to a limitation of the way we 
> were willing to do the default serial console.  In the M4 days, you had to 
> always exlpcitily configure remote serial console for everything,, in M5 the 
> factory default is now to autosense the console and provide the output and 
> thus now a lot of folks don't think about it.  However one setting is still 
> needed to be explicitly set for legacy boot to have full output (during the 
> discussion of autoboot there were concerns about potential conflicts in 
> legacy boot versus some DOS applications  the firmware team did not want to 
> enable automatically):
>  
> DevicesandIOPorts.Com1ActiveAfterBoot=Enable
>  
> 
I set this manually using this asu batch script:
set BootOrder.BootOrder "Legacy Only=PXE Network=Hard Disk 0"
set Processors.Hyper-Threading Disable
set DevicesandIOPorts.RemoteConsole Enable
set DevicesandIOPorts.SerialPortSharing Enable
set DevicesandIOPorts.SerialPortAccessMode Dedicated
set DevicesandIOPorts.SPRedirection Disable
set DevicesandIOPorts.Com1TerminalEmulation VT100
set DevicesandIOPorts.Com1ActiveAfterBoot Enable
set DevicesandIOPorts.Com1FlowControl Hardware
set BootModes.OptimizedBoot Disable

I don't remember where the last line came from or why I did it.
SPRedirection might be relevant. Flow control was once a problem (nehalem?)
Is there a best-recipe for M5 diskless-headless nodes like there was back in 
the M2 days?  I think there was once one associated with 1350/1410 cluster best 
practices.


> 3) Can I see the output of lsinitrd 
> /install/netboot/centos6.5/x86_64/comp/initrd-stateless.gz?  Particularly if 
> it dooesn't show tg3, then geninitrd may be needed to add the tg3 driver to 
> the initrd.  geninitrd is called just like genimage, but does only the initrd 
> regeneration.
root@mgt1 settings]# lsinitrd 
/install/netboot//centos6.5/x86_64/comp/initrd-stateless.gz | grep tg3
-rw-r--r--   1 root     root         2668 Nov 21  2013 
lib/firmware/tigon/tg3.bin
-rw-r--r--   1 root     root         7004 Nov 21  2013 
lib/firmware/tigon/tg3_tso.bin
-rw-r--r--   1 root     root         3884 Nov 21  2013 
lib/firmware/tigon/tg3_tso5.bin
-rwxr--r--   1 root     root       225896 Apr 20 13:07 
lib/modules/2.6.32-358.23.2.el6.x86_64/kernel/drivers/net/tg3.ko
[root@mgt1 settings]# 

Thanks for the help.  Should I talk to xcat support? (we have a contract).

 -- ddj

> From: David D Johnson [mailto:[email protected]] 
> Sent: Thursday, June 25, 2015 3:46 PM
> To: xCAT Users Mailing list
> Subject: Re: [xcat-user] NextScale deployment kernel crash
>  
> Well, that got me a shell, and the system got the correct address on eth0.
> Waiting for device with address 40:f2:e9:c5:54:00 to appear..Done
> Acquired IPv4 address on eth0: 172.20.204.56/16
> FATAL: Module ipmi_si not found.
> Dropping to debug shell, exit to check for further action
> [xCAT Genesis running on node855 /]# ***A request variable unknown to the 
> server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
> ***A request variable unknown to the server
>  
> One of these messages every second.
>  
> So several annoying issues: 
> 1) the ASU PXE lines do not show up in any predictable order.
> The second new host has them scrambled, #2 is the shared 1GbE port...
> PXE.NicPortMacAddress.1=E4:1D:2D:73:57:B1
> PXE.NicPortMacAddress.2=40:F2:E9:C5:54:00
> PXE.NicPortMacAddress.3=40:F2:E9:C5:54:01
> PXE.NicPortMacAddress.4=E4:1D:2D:73:57:B2
>  
> I have been used to grabbing Address.1 and stuffing the MAC into tabedit mac.
>  
> 2) view of Lenovo M5 node booting is very different, some lines do not show
> up in the serial redirected rcons view that do show up with a real KVM 
> console.
>  
> But the main issue remaining 3) is why the boot fails.
> I just tried again nodeset node855 osimage=centos6.5-x86_64-netboot-comp
> and rebooted...
>  
> Trying to unpack rootfs image as initramfs...
> Freeing initrd memory: 20283k freed
> dmar: Unsupported device scope
> audit: initializing netlink socket (disabled)
> type=2000 audit(1435290228.438:1): initialized
> HugeTLB registered 2 MB page size, pre-allocated 0 pages
>  
> ...
>  
> usb 2-1.1.5: Manufacturer: IBM
> usb 2-1.1.5: configuration #1 chosen from 2 choices
>  
>  
> dracut Warning: No root device "1" found
>  
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the 
> kernel command line.
>  
>  
> dracut Warning: Signal caught!
>  
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the 
> kernel command line.
> Kernel panic - not syncing: Attempted to kill init!
> Pid: 1, comm: init Tainted: G           --------------- H  
> 2.6.32-358.23.2.el6.x86_64 #1
> Call Trace:
>  [<ffffffff8150daac>] ? panic+0xa7/0x16f
>  [<ffffffff81073be2>] ? do_exit+0x862/0x870
>  [<ffffffff81182c85>] ? fput+0x25/0x30
>  [<ffffffff81073c48>] ? do_group_exit+0x58/0xd0
>  [<ffffffff81073cd7>] ? sys_exit_group+0x17/0x20
>  [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
>  
> On Jun 25, 2015, at 2:04 PM, Jarrod Johnson <[email protected]> wrote:
> 
> 
> You can nodeset <nodes> shell
>  
> That'll get you an environment that should boot in them regardless, complete 
> with ssh and all.
>  
> From: David D Johnson [mailto:[email protected]] 
> Sent: Thursday, June 25, 2015 2:00 PM
> To: xCAT Users Mailing list
> Subject: Re: [xcat-user] NextScale deployment kernel crash
>  
> I may have jumped to conclusions about the reason, but in any case the two 
> new M5 machines don't boot.
>  
> This is the line specifying drivers from our build script:
> ./genimage -i eth0 -n dca,8021q,igb,bnx2,tg3 -o centos6.5 -k 
> 2.6.32-358.23.2.el6.x86_64 -p comp
>  
>  
> As to the ethernet interfaces, from M4 machine the relevant ASU output looks 
> like
> PXE.NicPortMacAddress.1=6C:AE:8B:08:94:ED
> PXE.NicPortMacAddress.2=6C:AE:8B:08:94:EE
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.LinkStatus=Connected
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.AlternateMACAddress=6C:AE:8B:08:94:ED
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.LinkSpeed=AutoNeg
> IntelRI350GigabitNetworkConnection-6CAE8B0894ED.WakeonLAN=Enabled
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.LinkStatus=Disconnected
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.AlternateMACAddress=6C:AE:8B:08:94:EE
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.LinkSpeed=AutoNeg
> IntelRI350GigabitNetworkConnection-6CAE8B0894EE.WakeonLAN=Enabled
>  
> On the new M5 machines, there are only two ports on the front (no dedicated 
> IMM port), but there are now four PXE Mac lines, #3 is shared port.
> PXE.NicPortMacAddress.1=E4:1D:2D:73:56:01
> PXE.NicPortMacAddress.2=E4:1D:2D:73:56:02
> PXE.NicPortMacAddress.3=40:F2:E9:C5:51:12
> PXE.NicPortMacAddress.4=40:F2:E9:C5:51:13
>  
> Now I realize the first two are from the dual port FDR IB mezzanine card 
> (ConnectX-3 Pro). They can be used as 10/40/56 GbE, I suppose, but we want to 
> use one of them for FDR IB only, and the other one isn't connected to 
> anything.
> The other two are BroadCom / Tg3
>  
> I wish I could ssh to the machine so I could poke around and see what the 
> NICs are called. Maybe I will have to boot off a USB key.  No disks in these 
> hosts.
>  
>  -- ddj
>  
> On Jun 25, 2015, at 1:33 PM, Jarrod Johnson <[email protected]> wrote:
> 
> 
> 
> What nic driver was built in the initrd?  m4 was igb, m5 uses tg3.
>  
> " extra unusable Ethernet ports on the motherboard that mess up the interface 
> naming. Is there a workaround for this???"
>  
> I'm interested in what this means and if I can help on that.
>  
> From: David Johnson [mailto:[email protected]] 
> Sent: Thursday, June 25, 2015 11:30 AM
> To: xCAT Users Mailing list
> Subject: Re: [xcat-user] NextScale deployment kernel crash
>  
> Yes, we are seeing exactly the same problem. 300 nodes from nehalem to 
> nextscale m4 all work fine with the same centos 6.5 image, but not so for the 
> the Lenovo nextscale M5 nodes. They seem to have extra unusable Ethernet 
> ports on the motherboard that mess up the interface naming. Is there a 
> workaround for this???
> 
>   -- ddj
> Dave Johnson
> 
> On Jun 25, 2015, at 10:49 AM, Damir Krstic <[email protected]> wrote:
> 
> We are trying to boot NextScale nodes with our RedHat 6.4 stateless image. 
> They are crashing during the initrd boot process with following error:
>  
> dracut Warning: No root device "1" found
> 
>  
> 
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the 
> kernel command line.
> 
>  
> 
> dracut Warning: Signal caught!
> 
>  
> 
>  
> 
> dracut Warning: Boot has failed. To debug this issue add "rdshell" to the 
> kernel command line.
> 
> Kernel panic - not syncing: Attempted to kill init!
> 
> Pid: 1, comm: init Tainted: G           --------------- H  
> 2.6.32-358.el6.x86_64 #1
> 
> Call Trace:
> 
>  [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
> 
>  [<ffffffff81073ae2>] ? do_exit+0x862/0x870
> 
>  [<ffffffff81182885>] ? fput+0x25/0x30
> 
>  [<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
> 
>  [<ffffffff81073bd7>] ? sys_exit_group+0x17/0x20
> 
>  [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
> 
> ------------[ cut here ]------------
> 
> WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60() 
> (Tainted: G           --------------- H )
> 
> Hardware name: IBM NeXtScale nx360 M5: -[5465AC1]-
> 
> Modules linked in: sd_mod crc_t10dif ahci mlx4_core [last unloaded: 
> scsi_wait_scan]
> 
> Pid: 1, comm: init Tainted: G           --------------- H  
> 2.6.32-358.el6.x86_64 #1
> 
> Call Trace:
> 
>  <IRQ>  [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0
> 
>  [<ffffffff8106e33a>] ? warn_slowpath_null+0x1a/0x20
> 
>  [<ffffffff8102dd9c>] ? native_smp_send_reschedule+0x5c/0x60
> 
>  [<ffffffff8105ae28>] ? scheduler_tick+0x208/0x260
> 
>  [<ffffffff810a7fd0>] ? tick_sched_timer+0x0/0xc0
> 
>  [<ffffffff810811de>] ? update_process_times+0x6e/0x90
> 
>  [<ffffffff810a8036>] ? tick_sched_timer+0x66/0xc0
> 
>  [<ffffffff8109b38e>] ? __run_hrtimer+0x8e/0x1a0
> 
>  [<ffffffff810a182f>] ? ktime_get_update_offsets+0x4f/0xd0
> 
>  [<ffffffff8107700f>] ? __do_softirq+0x11f/0x1e0
> 
>  [<ffffffff8109b6f6>] ? hrtimer_interrupt+0xe6/0x260
> 
>  [<ffffffff81516d7b>] ? smp_apic_timer_interrupt+0x6b/0x9b
> 
>  [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
> 
>  <EOI>  [<ffffffff8150d06d>] ? panic+0x14c/0x16f
> 
>  [<ffffffff8150cffa>] ? panic+0xd9/0x16f
> 
>  [<ffffffff81073ae2>] ? do_exit+0x862/0x870
> 
>  [<ffffffff81182885>] ? fput+0x25/0x30
> 
>  [<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
> 
>  [<ffffffff81073bd7>] ? sys_exit_group+0x17/0x20
> 
>  [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
> 
>  
> 
> Any help would be appreciated.
> 
>  
> 
> Thanks,
> 
> Damir
> 
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors 
> network devices and physical & virtual servers, alerts via email & sms 
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
> _______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors 
> network devices and physical & virtual servers, alerts via email & sms 
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>  
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors 
> network devices and physical & virtual servers, alerts via email & sms 
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>  
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors 
> network devices and physical & virtual servers, alerts via email & sms 
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o_______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user

------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to