Hi,

I do not know much about Supermicro machines.
If you have a serial console configured do you get any output on it (rcons 
gn001) ?
If so do you see BIOS messages but not operating system boot messages  ?

On some of our server models we found the Linux boot process would hang if 
hardware flow control was configured in xCAT so we turned that off.
(we disabled the xCAT nodehm.serialflow setting for the nodes).
Maybe that is worth a try ?


Regards,

David


From: Angelo Cavalcanti <angelo.cavalca...@gmail.com>
Sent: Saturday, March 30, 2019 11:54 AM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: Re: [xcat-user] [External] Netboot process stuck

Hi Song,

The status has been changed to "netbooting" (below) but the boot process is 
still hang.

# lsdef -t node gn001
Object name: gn001
   addkcmdline=debug ignore_loglevel
   arch=x86_64
   bmc=10.2.2.1
   bmcpassword=admin
   bmcusername=admin
   cons=ipmi
   consoleenabled=1
   currchain=boot
   currstate=netboot centos7.6-x86_64-compute
   groups=all,compute
   ip=10.1.2.1
   mac=00:25:90:6c:a8:a2
   mgt=ipmi
   netboot=pxe
   nicips.ib0=10.3.2.1
   nicnetworks.ib0=10_3_0_0-255_255_0_0
   nictypes.ib0=Infiniband
   os=centos7.6
   postbootscripts=otherpkgs
   postscripts=syslog,remoteshell,syncfiles
   primarynic=mac
   profile=compute
   provmethod=centos7.6-x86_64-netboot-gpu-compute
   serialflow=hard
   serialport=1
   serialspeed=115200
   status=netbooting
   statustime=03-19-2019 11:51:04
   updatestatus=failed
   updatestatustime=03-13-2019 09:35:38


The log file is attached.

Regards,
--
Angelo Cavalcanti
br.linkedin.com/in/angelocr<http://br.linkedin.com/in/angelocr>


Em qua, 20 de mar de 2019 às 07:34, Song BJ Yang 
<yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>> escreveu:
Hi Angelo,

From the xcatprobe, the rootfs tarball has been downloaded and extracted. It is 
strange that status is still `powering-off` instead of "netbooting", since we 
can find that node has reported its status to MN inside dracut. Would you 
please provide the log file `log.txt`  generated by the following command?

```
journalctl -x -u xcatd -l > log.txt
```
------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193


----- Original message -----
From: Angelo Cavalcanti 
<angelo.cavalca...@gmail.com<mailto:angelo.cavalca...@gmail.com>>
To: xCAT Users Mailing list 
<xcat-user@lists.sourceforge.net<mailto:xcat-user@lists.sourceforge.net>>
Cc:
Subject: Re: [xcat-user] [External] Netboot process stuck
Date: Wed, Mar 20, 2019 1:37 AM

Hi Song,

The xCAT probe session output is below:

# xcatprobe osdeploy -n gn001
The install NIC in current server is p2p2                                       
                                  [INFO]
All nodes to be deployed are valid                                              
                                  [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[gn001] 10:04:13 Receive DHCPDISCOVER via p2p2
[gn001] 10:04:13 Send DHCPOFFER on 10.1.2.1 back to 00:25:90:6c:a8:a2 via p2p2
[gn001] 10:04:15 DHCPREQUEST for 10.1.2.1 (10.1.0.254) from 00:25:90:6c:a8:a2 
via p2p2
[gn001] 10:04:15 Send DHCPACK on 10.1.2.1 back to 00:25:90:6c:a8:a2 via p2p2
[gn001] 10:04:15 Via TFTP download pxelinux.0
[gn001] 10:04:15 Via TFTP download pxelinux.0
[gn001] 10:04:15 Via TFTP download 
pxelinux.cfg/00000000-0000-0000-0000-0025906ca8a2
[gn001] 10:04:15 Via TFTP download pxelinux.cfg/01-00-25-90-6c-a8-a2
[gn001] 10:04:15 Via TFTP download pxelinux.cfg/0A010201
[gn001] 10:04:15 Via TFTP download 
xcat/osimage/centos7.6-x86_64-netboot-gpu-compute/kernel
[gn001] 10:04:16 Via TFTP download 
xcat/osimage/centos7.6-x86_64-netboot-gpu-compute/initrd-stateless.gz
[gn001] 10:20:10 Receive DHCPDISCOVER via p2p2
[gn001] 10:20:10 Send DHCPOFFER on 10.1.2.1 back to 00:25:90:6c:a8:a2 via p2p2
[gn001] 10:20:10 DHCPREQUEST for 10.1.2.1 (10.1.0.254) from 00:25:90:6c:a8:a2 
via p2p2
[gn001] 10:20:10 Send DHCPACK on 10.1.2.1 back to 00:25:90:6c:a8:a2 via p2p2
[gn001] 10:20:17 INFO =============deployment starting====================
[gn001] 10:20:17 INFO =============deployment starting====================
[gn001] 10:20:17 INFO Executing xcatroot to prepare for netbooting 
(dracut_33)...
[gn001] 10:20:17 INFO Sending request to 
10.1.0.254:3002<http://10.1.0.254:3002> for changing status to netbooting...
[gn001] 10:20:18 Node status is changed to netbooting
[gn001] 10:20:18 INFO Downloading rootfs image from 
http://10.1.0.254:80//install/netboot/centos7.6/x86_64/gpu-.<http://10.1.0.254:80/install/netboot/centos7.6/x86_64/gpu-.>..
[gn001] 10:20:17 INFO Executing xcatroot to prepare for netbooting 
(dracut_33)...
[gn001] 10:20:17 INFO Sending request to 
10.1.0.254:3002<http://10.1.0.254:3002> for changing status to netbooting...
[gn001] 10:20:18 INFO Downloading rootfs image from 
http://10.1.0.254:80//install/netboot/centos7.6/x86_64/gpu-.<http://10.1.0.254:80/install/netboot/centos7.6/x86_64/gpu-.>..
[gn001] 10:20:18 Via HTTP get 
//install/netboot/centos7.6/x86_64/gpu-compute/rootimg.cpio.gz
[gn001] 10:20:27 INFO Setting up RAM-root tmpfs on downloaded 
rootimg.cpio.[gz/xz]...
[gn001] 10:20:27 INFO Setting up RAM-root tmpfs on downloaded 
rootimg.cpio.[gz/xz]...
[gn001] 10:20:47 INFO Exiting xcatroot...
[gn001] 10:20:47 INFO Exiting xcatroot...

And below, the piece of the DHCP lease file for the node:

host gn001 {
 dynamic;
 hardware ethernet 00:25:90:6c:a8:a2;
 uid 00:25:90:6c:a8:a2;
 fixed-address 10.1.2.1;
       supersede server.ddns-hostname = "gn001";
       supersede host-name = "gn001";
       if option vendor-class-identifier = "ScaleMP" {
         supersede server.filename = "vsmp/pxelinux.0";
       } else {
         supersede server.filename = "pxelinux.0";
       }
}

Regards

--
Angelo Cavalcanti
br.linkedin.com/in/angelocr<http://br.linkedin.com/in/angelocr>


Em sáb, 16 de mar de 2019 às 09:38, Angelo Cavalcanti 
<angelo.cavalca...@gmail.com<mailto:angelo.cavalca...@gmail.com>> escreveu:
1. The status is "powering-on"

2. Yes, the issue happens in the same node

3. Ok. I will send the xCAT-probe output session

Angelo Cavalcanti
br.linkedin.com/in/angelocr<http://br.linkedin.com/in/angelocr>


Em sex, 15 de mar de 2019 às 07:37, Song BJ Yang 
<yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>> escreveu:
Hi,

If the console output covers the whole process, seems the the initrd boot up 
process did not reach the rootimg download phase. And there is a 
`[2019-03-14T10:20:39-03:00] [   37.280041] systemd-fstab-generator[261]: Could 
not find a root= entry on the kernel command line.`,

several questions:

1. what is the node status: `lsdef <node> -i status,statustime`? is it changed 
to "netbooting"?
2. did you provision a batch of nodes with the same osimage? did the issue 
always appear on the same node?
3. please install xCAT-probe on you MN, run `xcatprobe xcatmn` to check if any 
configuration issue.
and watch `xcatprobe osdeploy -n <failing node>` in 1 terminal session, and 
then kick off provision. You will find the provision progress in the xcatprobe 
session. Please provide that session output.
------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193


----- Original message -----
From: Angelo Cavalcanti 
<angelo.cavalca...@gmail.com<mailto:angelo.cavalca...@gmail.com>>
To: Song BJ Yang <yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>>
Cc: xcat-user@lists.sourceforge.net<mailto:xcat-user@lists.sourceforge.net>
Subject: Re: [xcat-user] [External] Netboot process stuck
Date: Fri, Mar 15, 2019 10:37 AM

Thanks Song,

I added the following kernel parameters:

debug ignore_loglevel log_buf_len=10M print_fatal_signals=1

The console output file is attached. I noticed that the machine's devices were 
not found in the udev database.

Regards,

Angelo Cavalcanti
br.linkedin.com/in/angelocr<http://br.linkedin.com/in/angelocr>

Enviado do Gmail Android App

Em qui, 14 de mar de 2019 06:41, Song BJ Yang 
<yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>> escreveu:
Hi,

We encountered a similar issue  https://github.com/xcat2/xcat-core/issues/274 , 
but in this case the console uncovered the root cause.


However, your console output does not show why the boot up process hang.   I 
suggest you add more verbose output during boot up, this is a reference 
dochttps://www.askapache.com/linux/linux-debugging/  on how to get more debug 
info during kernel boot up.

To apply the kernel options during diskless kernel boot up, you can leverage 
the `addkcmdline` attribute, an example for addkcmdline usage ,
chdef mid05tor12cn05  addkcmdline="debug ignore_loglevel log_buf_len=10M 
print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug"


good luck

------------------------------------------------------------------------------
YANG Song (杨嵩)
IBM China System Technology Laboratory
Tel: 86-10-82452903
Email: yang...@cn.ibm.com<mailto:yang...@cn.ibm.com>
Address: Building 28, ZhongGuanCun Software Park,
No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC

北京市海淀区东北旺西路8号中关村软件园28号楼
邮编: 100193


----- Original message -----
From: Jarrod Johnson <jjohns...@lenovo.com<mailto:jjohns...@lenovo.com>>
To: xCAT Users Mailing list 
<xcat-user@lists.sourceforge.net<mailto:xcat-user@lists.sourceforge.net>>
Cc:
Subject: Re: [xcat-user] [External] Netboot process stuck
Date: Thu, Mar 14, 2019 5:33 AM


What does the boot kernel get command line wise (e.g. 
/tftpboot/xcat/xnba/nodes/<nodename>)



From: Angelo Cavalcanti 
<angelo.cavalca...@gmail.com<mailto:angelo.cavalca...@gmail.com>>
Sent: Wednesday, March 13, 2019 4:23 PM
To: xcat-user@lists.sourceforge.net<mailto:xcat-user@lists.sourceforge.net>
Subject: [External] [xcat-user] Netboot process stuck



Hi everyone,



I've setup various nodes to netboot image but one of them stuck in boot 
process, below:



[  485.127448] systemd[1]: Reached target Sockets.

[  515.134248] systemd[1]: Started Journal Service.

[  635.252091] RPC: Registered named UNIX socket transport module.

[  635.258163] RPC: Registered udp transport module.

[  635.263004] RPC: Registered tcp transport module.

[  635.267847] RPC: Registered tcp NFSv4.1 backchannel transport module.

[  845.308189] pps_core: LinuxPPS API ver. 1 registered

[  845.313333] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo 
Giometti <giome...@linux.it<mailto:giome...@linux.it>>

[  845.325629] PTP clock support registered

[  845.332035] dca service started, version 1.12.1

[  845.343852] mlx4_core: Mellanox ConnectX core driver v4.0-0

[  845.349585] mlx4_core: Initializing 0000:04:00.0

[  845.364042] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k

[  845.368351] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3f 
impl SATA mode

[  845.368355] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum 
part ems apst

[  845.380249] scsi host0: ahci

[  845.380938] scsi host1: ahci

[  845.382757] scsi host2: ahci

[  845.383161] scsi host3: ahci

[  845.386805] scsi host4: ahci

[  845.387739] scsi host5: ahci

[  845.387852] ata1: SATA max UDMA/133 abar m2048@0xde100000 port 0xde100100 
irq 39

[  845.387855] ata2: SATA max UDMA/133 abar m2048@0xde100000 port 0xde100180 
irq 39

[  845.387857] ata3: SATA max UDMA/133 abar m2048@0xde100000 port 0xde100200 
irq 39

[  845.387860] ata4: SATA max UDMA/133 abar m2048@0xde100000 port 0xde100280 
irq 39

[  845.387862] ata5: SATA max UDMA/133 abar m2048@0xde100000 port 0xde100300 
irq 39

[  845.387865] ata6: SATA max UDMA/133 abar m2048@0xde100000 port 0xde100380 
irq 39

[  845.451810] igb: Copyright (c) 2007-2014 Intel Corporation.

[  845.511186] igb 0000:81:00.0: added PHC on eth0

[  845.515869] igb 0000:81:00.0: Intel(R) Gigabit Ethernet Network Connection

[  845.522896] igb 0000:81:00.0: eth0: (PCIe:5.0Gb/s:Width x4) 00:25:90:6c:a8:a2

[  845.530236] igb 0000:81:00.0: eth0: PBA No: 104900-000

[  845.535510] igb 0000:81:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx 
queue(s)

[  845.598681] igb 0000:81:00.1: added PHC on eth1

[  845.603379] igb 0000:81:00.1: Intel(R) Gigabit Ethernet Network Connection

[  845.610404] igb 0000:81:00.1: eth1: (PCIe:5.0Gb/s:Width x4) 00:25:90:6c:a8:a3

[  845.617768] igb 0000:81:00.1: eth1: PBA No: 104900-000

[  845.623050] igb 0000:81:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx 
queue(s)

[  845.694180] ata4: SATA link down (SStatus 0 SControl 300)

[  845.699730] ata3: SATA link down (SStatus 0 SControl 300)

[  845.705276] ata2: SATA link down (SStatus 0 SControl 300)

[  845.710813] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

[  845.717162] ata5: SATA link down (SStatus 0 SControl 300)

[  845.722738] ata6: SATA link down (SStatus 0 SControl 300)

[  845.728523] ata1.00: ATA-8: ST91000640NS, SN03, max UDMA/133

[  845.734330] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32)

[  845.742162] ata1.00: configured for UDMA/133

[  845.746833] scsi 0:0:0:0: Direct-Access     ATA      ST91000640NS     SN03 
PQ: 0 ANSI: 5

[  845.791663] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 
TB/931 GiB)

[  845.799600] sd 0:0:0:0: [sda] Write Protect is off

[  845.804545] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA

[  845.820557]  sda: sda1 sda2

[  845.823892] sd 0:0:0:0: [sda] Attached SCSI disk

[  851.888095] mlx4_core 0000:04:00.0: Old device ETS support detected

[  851.894490] mlx4_core 0000:04:00.0: Consider upgrading device FW.

[  852.632012] mlx4_core 0000:04:00.0: PCIe link speed is 8.0GT/s, device 
supports 8.0GT/s

[  852.640265] mlx4_core 0000:04:00.0: PCIe link width is x8, device supports x8

[  852.787509] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

[  945.585085] igb 0000:81:00.0: changing MTU from 1500 to 2044

[  949.821377] igb 0000:81:00.0 enp129s0f0: igb: enp129s0f0 NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: RX

[  950.903525] random: crng init done



Notice that the boot process is slow.



The machine has the following configuration:

2x Intel Xeon E5-2670

256GB RAM

Motherboard Supermicro X9DRG-HF

HDD 1TB

Mellanox Infiniband ConnectX-3 card (MT27500)

GPGPU nVidia Tesla M2075



I removed all off-board cards and HDD. The boot process stays stuck in the same 
stage. I installed CentOS 7 minimal ISO on HDD and the problem did not occur.



Regards,



--

Angelo Cavalcanti
br.linkedin.com/in/angelocr<http://br.linkedin.com/in/angelocr>
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net<mailto:xCAT-user@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user



_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net<mailto:xCAT-user@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net<mailto:xCAT-user@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net<mailto:xCAT-user@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to