Hi David,

No problem with serial console was detected. There are console outputs
attached in previous messages.

An alternative solution was applied: I've install the CentOS 7.6 on the
machine's HDD, then install OpenHPC packages (PBSPro, GCC, etc) and copy
some configuration files from the other nodes to the gn001 (pbs.conf, etc).

Now, the node is up but some xCAT features (as "updatenode" command, for
example), logically, not working.

Regards,

Angelo Cavalcanti
br.linkedin.com/in/angelocr



Em qui, 4 de abr de 2019 às 14:31, David Rajendra <drajen...@lenovo.com>
escreveu:

> Hi,
>
>
>
> I do not know much about Supermicro machines.
>
> If you have a serial console configured do you get any output on it (rcons
> gn001) ?
> If so do you see BIOS messages but not operating system boot messages  ?
>
>
>
> On some of our server models we found the Linux boot process would hang if
> hardware flow control was configured in xCAT so we turned that off.
>
> (we disabled the xCAT nodehm.serialflow setting for the nodes).
>
> Maybe that is worth a try ?
>
>
>
>
>
> Regards,
>
>
>
> David
>
>
>
>
>
> *From:* Angelo Cavalcanti <angelo.cavalca...@gmail.com>
> *Sent:* Saturday, March 30, 2019 11:54 AM
> *To:* xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
> *Subject:* Re: [xcat-user] [External] Netboot process stuck
>
>
>
> Hi Song,
>
>
>
> The status has been changed to "netbooting" (below) but the boot process
> is still hang.
>
>
>
> # lsdef -t node gn001
> Object name: gn001
>    addkcmdline=debug ignore_loglevel
>    arch=x86_64
>    bmc=10.2.2.1
>    bmcpassword=admin
>    bmcusername=admin
>    cons=ipmi
>    consoleenabled=1
>    currchain=boot
>    currstate=netboot centos7.6-x86_64-compute
>    groups=all,compute
>    ip=10.1.2.1
>    mac=00:25:90:6c:a8:a2
>    mgt=ipmi
>    netboot=pxe
>    nicips.ib0=10.3.2.1
>    nicnetworks.ib0=10_3_0_0-255_255_0_0
>    nictypes.ib0=Infiniband
>    os=centos7.6
>    postbootscripts=otherpkgs
>    postscripts=syslog,remoteshell,syncfiles
>    primarynic=mac
>    profile=compute
>    provmethod=centos7.6-x86_64-netboot-gpu-compute
>    serialflow=hard
>    serialport=1
>    serialspeed=115200
>    status=netbooting
>    statustime=03-19-2019 11:51:04
>    updatestatus=failed
>    updatestatustime=03-13-2019 09:35:38
>
>
>
>
>
> The log file is attached.
>
>
>
> Regards,
>
> --
>
> Angelo Cavalcanti
> br.linkedin.com/in/angelocr
>
>
>
>
>
> Em qua, 20 de mar de 2019 às 07:34, Song BJ Yang <yang...@cn.ibm.com>
> escreveu:
>
> Hi Angelo,
>
>
>
> From the xcatprobe, the rootfs tarball has been downloaded and extracted.
> It is strange that status is still `powering-off` instead of "netbooting",
> since we can find that node has reported its status to MN inside dracut.
> Would you please provide the log file `log.txt`  generated by the following
> command?
>
>
>
> ```
>
> journalctl -x -u xcatd -l > log.txt
>
> ```
>
>
> ------------------------------------------------------------------------------
> YANG Song (杨嵩)
> IBM China System Technology Laboratory
> Tel: 86-10-82452903
> Email: yang...@cn.ibm.com
> Address: Building 28, ZhongGuanCun Software Park,
> No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
>
> 北京市海淀区东北旺西路8号中关村软件园28号楼
> 邮编: 100193
>
>
>
>
>
> ----- Original message -----
> From: Angelo Cavalcanti <angelo.cavalca...@gmail.com>
> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
> Cc:
> Subject: Re: [xcat-user] [External] Netboot process stuck
> Date: Wed, Mar 20, 2019 1:37 AM
>
>
> Hi Song,
>
>
>
> The xCAT probe session output is below:
>
>
>
> # xcatprobe osdeploy -n gn001
>
> The install NIC in current server is p2p2
>                                        [INFO]
>
> All nodes to be deployed are valid
>                                         [ OK ]
>
> -------------------------------------------------------------
>
> Start capturing every message during OS provision process....
>
> -------------------------------------------------------------
>
>
>
> [gn001] 10:04:13 Receive DHCPDISCOVER via p2p2
>
> [gn001] 10:04:13 Send DHCPOFFER on 10.1.2.1 back to 00:25:90:6c:a8:a2 via
> p2p2
>
> [gn001] 10:04:15 DHCPREQUEST for 10.1.2.1 (10.1.0.254) from
> 00:25:90:6c:a8:a2 via p2p2
>
> [gn001] 10:04:15 Send DHCPACK on 10.1.2.1 back to 00:25:90:6c:a8:a2 via
> p2p2
>
> [gn001] 10:04:15 Via TFTP download pxelinux.0
>
> [gn001] 10:04:15 Via TFTP download pxelinux.0
>
> [gn001] 10:04:15 Via TFTP download
> pxelinux.cfg/00000000-0000-0000-0000-0025906ca8a2
>
> [gn001] 10:04:15 Via TFTP download pxelinux.cfg/01-00-25-90-6c-a8-a2
>
> [gn001] 10:04:15 Via TFTP download pxelinux.cfg/0A010201
>
> [gn001] 10:04:15 Via TFTP download
> xcat/osimage/centos7.6-x86_64-netboot-gpu-compute/kernel
>
> [gn001] 10:04:16 Via TFTP download
> xcat/osimage/centos7.6-x86_64-netboot-gpu-compute/initrd-stateless.gz
>
> [gn001] 10:20:10 Receive DHCPDISCOVER via p2p2
>
> [gn001] 10:20:10 Send DHCPOFFER on 10.1.2.1 back to 00:25:90:6c:a8:a2 via
> p2p2
>
> [gn001] 10:20:10 DHCPREQUEST for 10.1.2.1 (10.1.0.254) from
> 00:25:90:6c:a8:a2 via p2p2
>
> [gn001] 10:20:10 Send DHCPACK on 10.1.2.1 back to 00:25:90:6c:a8:a2 via
> p2p2
>
> [gn001] 10:20:17 INFO =============deployment starting====================
>
> [gn001] 10:20:17 INFO =============deployment starting====================
>
> [gn001] 10:20:17 INFO Executing xcatroot to prepare for netbooting
> (dracut_33)...
>
> [gn001] 10:20:17 INFO Sending request to 10.1.0.254:3002 for changing
> status to netbooting...
>
> [gn001] 10:20:18 Node status is changed to netbooting
>
> [gn001] 10:20:18 INFO Downloading rootfs image from
> http://10.1.0.254:80//install/netboot/centos7.6/x86_64/gpu-.
> <http://10.1.0.254:80/install/netboot/centos7.6/x86_64/gpu-.>..
>
> [gn001] 10:20:17 INFO Executing xcatroot to prepare for netbooting
> (dracut_33)...
>
> [gn001] 10:20:17 INFO Sending request to 10.1.0.254:3002 for changing
> status to netbooting...
>
> [gn001] 10:20:18 INFO Downloading rootfs image from
> http://10.1.0.254:80//install/netboot/centos7.6/x86_64/gpu-.
> <http://10.1.0.254:80/install/netboot/centos7.6/x86_64/gpu-.>..
>
> [gn001] 10:20:18 Via HTTP get
> //install/netboot/centos7.6/x86_64/gpu-compute/rootimg.cpio.gz
>
> [gn001] 10:20:27 INFO Setting up RAM-root tmpfs on downloaded
> rootimg.cpio.[gz/xz]...
>
> [gn001] 10:20:27 INFO Setting up RAM-root tmpfs on downloaded
> rootimg.cpio.[gz/xz]...
>
> [gn001] 10:20:47 INFO Exiting xcatroot...
>
> [gn001] 10:20:47 INFO Exiting xcatroot...
>
>
>
> And below, the piece of the DHCP lease file for the node:
>
>
>
> host gn001 {
>  dynamic;
>  hardware ethernet 00:25:90:6c:a8:a2;
>  uid 00:25:90:6c:a8:a2;
>  fixed-address 10.1.2.1;
>        supersede server.ddns-hostname = "gn001";
>        supersede host-name = "gn001";
>        if option vendor-class-identifier = "ScaleMP" {
>          supersede server.filename = "vsmp/pxelinux.0";
>        } else {
>          supersede server.filename = "pxelinux.0";
>        }
> }
>
>
> Regards
>
>
>
> --
>
> Angelo Cavalcanti
> br.linkedin.com/in/angelocr
>
>
>
>
> Em sáb, 16 de mar de 2019 às 09:38, Angelo Cavalcanti <
> angelo.cavalca...@gmail.com> escreveu:
>
> 1. The status is "powering-on"
>
>
>
> 2. Yes, the issue happens in the same node
>
>
>
> 3. Ok. I will send the xCAT-probe output session
>
>
>
> Angelo Cavalcanti
> br.linkedin.com/in/angelocr
>
>
>
>
> Em sex, 15 de mar de 2019 às 07:37, Song BJ Yang <yang...@cn.ibm.com>
> escreveu:
>
> Hi,
>
>
>
> If the console output covers the whole process, seems the the initrd boot
> up process did not reach the rootimg download phase. And there is a 
> `[2019-03-14T10:20:39-03:00]
> [   37.280041] systemd-fstab-generator[261]: Could not find a root= entry
> on the kernel command line.`,
>
>
>
> several questions:
>
>
>
> 1. what is the node status: `lsdef <node> -i status,statustime`? is it
> changed to "netbooting"?
>
> 2. did you provision a batch of nodes with the same osimage? did the issue
> always appear on the same node?
>
> 3. please install xCAT-probe on you MN, run `xcatprobe xcatmn` to check if
> any configuration issue.
>
> and watch `xcatprobe osdeploy -n <failing node>` in 1 terminal session,
> and then kick off provision. You will find the provision progress in
> the xcatprobe session. Please provide that session output.
>
> ------------------------------------------------------------------------------
>
> YANG Song (杨嵩)
> IBM China System Technology Laboratory
> Tel: 86-10-82452903
> Email: yang...@cn.ibm.com
> Address: Building 28, ZhongGuanCun Software Park,
> No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
>
> 北京市海淀区东北旺西路8号中关村软件园28号楼
> 邮编: 100193
>
>
>
>
>
> ----- Original message -----
> From: Angelo Cavalcanti <angelo.cavalca...@gmail.com>
> To: Song BJ Yang <yang...@cn.ibm.com>
> Cc: xcat-user@lists.sourceforge.net
> Subject: Re: [xcat-user] [External] Netboot process stuck
> Date: Fri, Mar 15, 2019 10:37 AM
>
>
> Thanks Song,
>
>
>
> I added the following kernel parameters:
>
>
>
> debug ignore_loglevel log_buf_len=10M print_fatal_signals=1
>
>
>
> The console output file is attached. I noticed that the machine's devices
> were not found in the udev database.
>
>
>
> Regards,
>
>
>
> Angelo Cavalcanti
> br.linkedin.com/in/angelocr
>
> Enviado do Gmail Android App
>
>
>
> Em qui, 14 de mar de 2019 06:41, Song BJ Yang <yang...@cn.ibm.com>
> escreveu:
>
> Hi,
>
>
>
> We encountered a similar issue
> https://github.com/xcat2/xcat-core/issues/274 , but in this case the
> console uncovered the root cause.
>
>
>
>
>
> However, your console output does not show why the boot up process hang.
>   I suggest you add more verbose output during boot up, this is a reference
> dochttps://www.askapache.com/linux/linux-debugging/  on how to get more
> debug info during kernel boot up.
>
>
>
> To apply the kernel options during diskless kernel boot up, you can
> leverage the `addkcmdline` attribute, an example for addkcmdline usage ,
>
> chdef mid05tor12cn05  addkcmdline="debug ignore_loglevel log_buf_len=10M
> print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug"
>
>
>
>
>
> good luck
>
>
>
>
> ------------------------------------------------------------------------------
> YANG Song (杨嵩)
> IBM China System Technology Laboratory
> Tel: 86-10-82452903
> Email: yang...@cn.ibm.com
> Address: Building 28, ZhongGuanCun Software Park,
> No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
>
> 北京市海淀区东北旺西路8号中关村软件园28号楼
> 邮编: 100193
>
>
>
>
>
> ----- Original message -----
> From: Jarrod Johnson <jjohns...@lenovo.com>
> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
> Cc:
> Subject: Re: [xcat-user] [External] Netboot process stuck
> Date: Thu, Mar 14, 2019 5:33 AM
>
>
> What does the boot kernel get command line wise (e.g.
> /tftpboot/xcat/xnba/nodes/<nodename>)
>
>
>
> *From:* Angelo Cavalcanti <angelo.cavalca...@gmail.com>
> *Sent:* Wednesday, March 13, 2019 4:23 PM
> *To:* xcat-user@lists.sourceforge.net
> *Subject:* [External] [xcat-user] Netboot process stuck
>
>
>
> Hi everyone,
>
>
>
> I've setup various nodes to netboot image but one of them stuck in boot
> process, below:
>
>
>
> [  485.127448] systemd[1]: Reached target Sockets.
>
> [  515.134248] systemd[1]: Started Journal Service.
>
> [  635.252091] RPC: Registered named UNIX socket transport module.
>
> [  635.258163] RPC: Registered udp transport module.
>
> [  635.263004] RPC: Registered tcp transport module.
>
> [  635.267847] RPC: Registered tcp NFSv4.1 backchannel transport module.
>
> [  845.308189] pps_core: LinuxPPS API ver. 1 registered
>
> [  845.313333] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo
> Giometti <giome...@linux.it>
>
> [  845.325629] PTP clock support registered
>
> [  845.332035] dca service started, version 1.12.1
>
> [  845.343852] mlx4_core: Mellanox ConnectX core driver v4.0-0
>
> [  845.349585] mlx4_core: Initializing 0000:04:00.0
>
> [  845.364042] igb: Intel(R) Gigabit Ethernet Network Driver - version
> 5.4.0-k
>
> [  845.368351] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps
> 0x3f impl SATA mode
>
> [  845.368355] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio
> slum part ems apst
>
> [  845.380249] scsi host0: ahci
>
> [  845.380938] scsi host1: ahci
>
> [  845.382757] scsi host2: ahci
>
> [  845.383161] scsi host3: ahci
>
> [  845.386805] scsi host4: ahci
>
> [  845.387739] scsi host5: ahci
>
> [  845.387852] ata1: SATA max UDMA/133 abar m2048@0xde100000 port
> 0xde100100 irq 39
>
> [  845.387855] ata2: SATA max UDMA/133 abar m2048@0xde100000 port
> 0xde100180 irq 39
>
> [  845.387857] ata3: SATA max UDMA/133 abar m2048@0xde100000 port
> 0xde100200 irq 39
>
> [  845.387860] ata4: SATA max UDMA/133 abar m2048@0xde100000 port
> 0xde100280 irq 39
>
> [  845.387862] ata5: SATA max UDMA/133 abar m2048@0xde100000 port
> 0xde100300 irq 39
>
> [  845.387865] ata6: SATA max UDMA/133 abar m2048@0xde100000 port
> 0xde100380 irq 39
>
> [  845.451810] igb: Copyright (c) 2007-2014 Intel Corporation.
>
> [  845.511186] igb 0000:81:00.0: added PHC on eth0
>
> [  845.515869] igb 0000:81:00.0: Intel(R) Gigabit Ethernet Network
> Connection
>
> [  845.522896] igb 0000:81:00.0: eth0: (PCIe:5.0Gb/s:Width x4)
> 00:25:90:6c:a8:a2
>
> [  845.530236] igb 0000:81:00.0: eth0: PBA No: 104900-000
>
> [  845.535510] igb 0000:81:00.0: Using MSI-X interrupts. 8 rx queue(s), 8
> tx queue(s)
>
> [  845.598681] igb 0000:81:00.1: added PHC on eth1
>
> [  845.603379] igb 0000:81:00.1: Intel(R) Gigabit Ethernet Network
> Connection
>
> [  845.610404] igb 0000:81:00.1: eth1: (PCIe:5.0Gb/s:Width x4)
> 00:25:90:6c:a8:a3
>
> [  845.617768] igb 0000:81:00.1: eth1: PBA No: 104900-000
>
> [  845.623050] igb 0000:81:00.1: Using MSI-X interrupts. 8 rx queue(s), 8
> tx queue(s)
>
> [  845.694180] ata4: SATA link down (SStatus 0 SControl 300)
>
> [  845.699730] ata3: SATA link down (SStatus 0 SControl 300)
>
> [  845.705276] ata2: SATA link down (SStatus 0 SControl 300)
>
> [  845.710813] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>
> [  845.717162] ata5: SATA link down (SStatus 0 SControl 300)
>
> [  845.722738] ata6: SATA link down (SStatus 0 SControl 300)
>
> [  845.728523] ata1.00: ATA-8: ST91000640NS, SN03, max UDMA/133
>
> [  845.734330] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth
> 31/32)
>
> [  845.742162] ata1.00: configured for UDMA/133
>
> [  845.746833] scsi 0:0:0:0: Direct-Access     ATA      ST91000640NS
>  SN03 PQ: 0 ANSI: 5
>
> [  845.791663] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00
> TB/931 GiB)
>
> [  845.799600] sd 0:0:0:0: [sda] Write Protect is off
>
> [  845.804545] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
> [  845.820557]  sda: sda1 sda2
>
> [  845.823892] sd 0:0:0:0: [sda] Attached SCSI disk
>
> [  851.888095] mlx4_core 0000:04:00.0: Old device ETS support detected
>
> [  851.894490] mlx4_core 0000:04:00.0: Consider upgrading device FW.
>
> [  852.632012] mlx4_core 0000:04:00.0: PCIe link speed is 8.0GT/s, device
> supports 8.0GT/s
>
> [  852.640265] mlx4_core 0000:04:00.0: PCIe link width is x8, device
> supports x8
>
> [  852.787509] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
>
> [  945.585085] igb 0000:81:00.0: changing MTU from 1500 to 2044
>
> [  949.821377] igb 0000:81:00.0 enp129s0f0: igb: enp129s0f0 NIC Link is Up
> 1000 Mbps Full Duplex, Flow Control: RX
>
> [  950.903525] random: crng init done
>
>
>
> Notice that the boot process is slow.
>
>
>
> The machine has the following configuration:
>
> 2x Intel Xeon E5-2670
>
> 256GB RAM
>
> Motherboard Supermicro X9DRG-HF
>
> HDD 1TB
>
> Mellanox Infiniband ConnectX-3 card (MT27500)
>
> GPGPU nVidia Tesla M2075
>
>
>
> I removed all off-board cards and HDD. The boot process stays stuck in the
> same stage. I installed CentOS 7 minimal ISO on HDD and the problem did not
> occur.
>
>
>
> Regards,
>
>
>
> --
>
> Angelo Cavalcanti
> br.linkedin.com/in/angelocr
>
> _______________________________________________
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
>
>
>
>
>
> _______________________________________________
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
> _______________________________________________
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
>
>
>
> _______________________________________________
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
> _______________________________________________
> xCAT-user mailing list
> xCAT-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to