Hi Song,

The xCAT probe session output is below:

# xcatprobe osdeploy -n gn001
The install NIC in current server is p2p2
                                       [INFO]
All nodes to be deployed are valid
                                      [ OK ]
-------------------------------------------------------------
Start capturing every message during OS provision process....
-------------------------------------------------------------

[gn001] 10:04:13 Receive DHCPDISCOVER via p2p2
[gn001] 10:04:13 Send DHCPOFFER on 10.1.2.1 back to 00:25:90:6c:a8:a2 via
p2p2
[gn001] 10:04:15 DHCPREQUEST for 10.1.2.1 (10.1.0.254) from
00:25:90:6c:a8:a2 via p2p2
[gn001] 10:04:15 Send DHCPACK on 10.1.2.1 back to 00:25:90:6c:a8:a2 via p2p2
[gn001] 10:04:15 Via TFTP download pxelinux.0
[gn001] 10:04:15 Via TFTP download pxelinux.0
[gn001] 10:04:15 Via TFTP download
pxelinux.cfg/00000000-0000-0000-0000-0025906ca8a2
[gn001] 10:04:15 Via TFTP download pxelinux.cfg/01-00-25-90-6c-a8-a2
[gn001] 10:04:15 Via TFTP download pxelinux.cfg/0A010201
[gn001] 10:04:15 Via TFTP download
xcat/osimage/centos7.6-x86_64-netboot-gpu-compute/kernel
[gn001] 10:04:16 Via TFTP download
xcat/osimage/centos7.6-x86_64-netboot-gpu-compute/initrd-stateless.gz
[gn001] 10:20:10 Receive DHCPDISCOVER via p2p2
[gn001] 10:20:10 Send DHCPOFFER on 10.1.2.1 back to 00:25:90:6c:a8:a2 via
p2p2
[gn001] 10:20:10 DHCPREQUEST for 10.1.2.1 (10.1.0.254) from
00:25:90:6c:a8:a2 via p2p2
[gn001] 10:20:10 Send DHCPACK on 10.1.2.1 back to 00:25:90:6c:a8:a2 via p2p2
[gn001] 10:20:17 INFO =============deployment starting====================
[gn001] 10:20:17 INFO =============deployment starting====================
[gn001] 10:20:17 INFO Executing xcatroot to prepare for netbooting
(dracut_33)...
[gn001] 10:20:17 INFO Sending request to 10.1.0.254:3002 for changing
status to netbooting...
[gn001] 10:20:18 Node status is changed to netbooting
[gn001] 10:20:18 INFO Downloading rootfs image from
http://10.1.0.254:80//install/netboot/centos7.6/x86_64/gpu-...
[gn001] 10:20:17 INFO Executing xcatroot to prepare for netbooting
(dracut_33)...
[gn001] 10:20:17 INFO Sending request to 10.1.0.254:3002 for changing
status to netbooting...
[gn001] 10:20:18 INFO Downloading rootfs image from
http://10.1.0.254:80//install/netboot/centos7.6/x86_64/gpu-...
[gn001] 10:20:18 Via HTTP get
//install/netboot/centos7.6/x86_64/gpu-compute/rootimg.cpio.gz
[gn001] 10:20:27 INFO Setting up RAM-root tmpfs on downloaded
rootimg.cpio.[gz/xz]...
[gn001] 10:20:27 INFO Setting up RAM-root tmpfs on downloaded
rootimg.cpio.[gz/xz]...
[gn001] 10:20:47 INFO Exiting xcatroot...
[gn001] 10:20:47 INFO Exiting xcatroot...

And below, the piece of the DHCP lease file for the node:

host gn001 {
 dynamic;
 hardware ethernet 00:25:90:6c:a8:a2;
 uid 00:25:90:6c:a8:a2;
 fixed-address 10.1.2.1;
       supersede server.ddns-hostname = "gn001";
       supersede host-name = "gn001";
       if option vendor-class-identifier = "ScaleMP" {
         supersede server.filename = "vsmp/pxelinux.0";
       } else {
         supersede server.filename = "pxelinux.0";
       }
}

Regards

--
Angelo Cavalcanti
br.linkedin.com/in/angelocr



Em sáb, 16 de mar de 2019 às 09:38, Angelo Cavalcanti <
angelo.cavalca...@gmail.com> escreveu:

> 1. The status is "powering-on"
>
> 2. Yes, the issue happens in the same node
>
> 3. Ok. I will send the xCAT-probe output session
>
> Angelo Cavalcanti
> br.linkedin.com/in/angelocr
>
>
>
> Em sex, 15 de mar de 2019 às 07:37, Song BJ Yang <yang...@cn.ibm.com>
> escreveu:
>
>> Hi,
>>
>> If the console output covers the whole process, seems the the initrd boot
>> up process did not reach the rootimg download phase. And there is a 
>> `[2019-03-14T10:20:39-03:00]
>> [   37.280041] systemd-fstab-generator[261]: Could not find a root= entry
>> on the kernel command line.`,
>>
>> several questions:
>>
>> 1. what is the node status: `lsdef <node> -i status,statustime`? is it
>> changed to "netbooting"?
>> 2. did you provision a batch of nodes with the same osimage? did the
>> issue always appear on the same node?
>> 3. please install xCAT-probe on you MN, run `xcatprobe xcatmn` to check
>> if any configuration issue.
>> and watch `xcatprobe osdeploy -n <failing node>` in 1 terminal session,
>> and then kick off provision. You will find the provision progress in
>> the xcatprobe session. Please provide that session output.
>>
>> ------------------------------------------------------------------------------
>> YANG Song (杨嵩)
>> IBM China System Technology Laboratory
>> Tel: 86-10-82452903
>> Email: yang...@cn.ibm.com
>> Address: Building 28, ZhongGuanCun Software Park,
>> No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
>>
>> 北京市海淀区东北旺西路8号中关村软件园28号楼
>> 邮编: 100193
>>
>>
>>
>> ----- Original message -----
>> From: Angelo Cavalcanti <angelo.cavalca...@gmail.com>
>> To: Song BJ Yang <yang...@cn.ibm.com>
>> Cc: xcat-user@lists.sourceforge.net
>> Subject: Re: [xcat-user] [External] Netboot process stuck
>> Date: Fri, Mar 15, 2019 10:37 AM
>>
>> Thanks Song,
>>
>> I added the following kernel parameters:
>>
>> debug ignore_loglevel log_buf_len=10M print_fatal_signals=1
>>
>> The console output file is attached. I noticed that the machine's devices
>> were not found in the udev database.
>>
>> Regards,
>>
>> Angelo Cavalcanti
>> br.linkedin.com/in/angelocr
>>
>> Enviado do Gmail Android App
>>
>> Em qui, 14 de mar de 2019 06:41, Song BJ Yang <yang...@cn.ibm.com>
>> escreveu:
>>
>> Hi,
>>
>> We encountered a similar issue
>> https://github.com/xcat2/xcat-core/issues/274 , but in this case the
>> console uncovered the root cause.
>>
>>
>> However, your console output does not show why the boot up process hang.
>>   I suggest you add more verbose output during boot up, this is a reference
>> dochttps://www.askapache.com/linux/linux-debugging/  on how to get more
>> debug info during kernel boot up.
>>
>> To apply the kernel options during diskless kernel boot up, you can
>> leverage the `addkcmdline` attribute, an example for addkcmdline usage ,
>>
>> chdef mid05tor12cn05  addkcmdline="debug ignore_loglevel log_buf_len=10M
>> print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug"
>>
>>
>> good luck
>>
>>
>> ------------------------------------------------------------------------------
>> YANG Song (杨嵩)
>> IBM China System Technology Laboratory
>> Tel: 86-10-82452903
>> Email: yang...@cn.ibm.com
>> Address: Building 28, ZhongGuanCun Software Park,
>> No.8, Dong Bei Wang West Road, Haidian District Beijing 100193, PRC
>>
>> 北京市海淀区东北旺西路8号中关村软件园28号楼
>> 邮编: 100193
>>
>>
>>
>> ----- Original message -----
>> From: Jarrod Johnson <jjohns...@lenovo.com>
>> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
>> Cc:
>> Subject: Re: [xcat-user] [External] Netboot process stuck
>> Date: Thu, Mar 14, 2019 5:33 AM
>>
>>
>> What does the boot kernel get command line wise (e.g.
>> /tftpboot/xcat/xnba/nodes/<nodename>)
>>
>>
>>
>> *From:* Angelo Cavalcanti <angelo.cavalca...@gmail.com>
>> *Sent:* Wednesday, March 13, 2019 4:23 PM
>> *To:* xcat-user@lists.sourceforge.net
>> *Subject:* [External] [xcat-user] Netboot process stuck
>>
>>
>>
>> Hi everyone,
>>
>>
>>
>> I've setup various nodes to netboot image but one of them stuck in boot
>> process, below:
>>
>>
>>
>> [  485.127448] systemd[1]: Reached target Sockets.
>>
>> [  515.134248] systemd[1]: Started Journal Service.
>>
>> [  635.252091] RPC: Registered named UNIX socket transport module.
>>
>> [  635.258163] RPC: Registered udp transport module.
>>
>> [  635.263004] RPC: Registered tcp transport module.
>>
>> [  635.267847] RPC: Registered tcp NFSv4.1 backchannel transport module.
>>
>> [  845.308189] pps_core: LinuxPPS API ver. 1 registered
>>
>> [  845.313333] pps_core: Software ver. 5.3.6 - Copyright 2005-2007
>> Rodolfo Giometti <giome...@linux.it>
>>
>> [  845.325629] PTP clock support registered
>>
>> [  845.332035] dca service started, version 1.12.1
>>
>> [  845.343852] mlx4_core: Mellanox ConnectX core driver v4.0-0
>>
>> [  845.349585] mlx4_core: Initializing 0000:04:00.0
>>
>> [  845.364042] igb: Intel(R) Gigabit Ethernet Network Driver - version
>> 5.4.0-k
>>
>> [  845.368351] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps
>> 0x3f impl SATA mode
>>
>> [  845.368355] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio
>> slum part ems apst
>>
>> [  845.380249] scsi host0: ahci
>>
>> [  845.380938] scsi host1: ahci
>>
>> [  845.382757] scsi host2: ahci
>>
>> [  845.383161] scsi host3: ahci
>>
>> [  845.386805] scsi host4: ahci
>>
>> [  845.387739] scsi host5: ahci
>>
>> [  845.387852] ata1: SATA max UDMA/133 abar m2048@0xde100000 port
>> 0xde100100 irq 39
>>
>> [  845.387855] ata2: SATA max UDMA/133 abar m2048@0xde100000 port
>> 0xde100180 irq 39
>>
>> [  845.387857] ata3: SATA max UDMA/133 abar m2048@0xde100000 port
>> 0xde100200 irq 39
>>
>> [  845.387860] ata4: SATA max UDMA/133 abar m2048@0xde100000 port
>> 0xde100280 irq 39
>>
>> [  845.387862] ata5: SATA max UDMA/133 abar m2048@0xde100000 port
>> 0xde100300 irq 39
>>
>> [  845.387865] ata6: SATA max UDMA/133 abar m2048@0xde100000 port
>> 0xde100380 irq 39
>>
>> [  845.451810] igb: Copyright (c) 2007-2014 Intel Corporation.
>>
>> [  845.511186] igb 0000:81:00.0: added PHC on eth0
>>
>> [  845.515869] igb 0000:81:00.0: Intel(R) Gigabit Ethernet Network
>> Connection
>>
>> [  845.522896] igb 0000:81:00.0: eth0: (PCIe:5.0Gb/s:Width x4)
>> 00:25:90:6c:a8:a2
>>
>> [  845.530236] igb 0000:81:00.0: eth0: PBA No: 104900-000
>>
>> [  845.535510] igb 0000:81:00.0: Using MSI-X interrupts. 8 rx queue(s), 8
>> tx queue(s)
>>
>> [  845.598681] igb 0000:81:00.1: added PHC on eth1
>>
>> [  845.603379] igb 0000:81:00.1: Intel(R) Gigabit Ethernet Network
>> Connection
>>
>> [  845.610404] igb 0000:81:00.1: eth1: (PCIe:5.0Gb/s:Width x4)
>> 00:25:90:6c:a8:a3
>>
>> [  845.617768] igb 0000:81:00.1: eth1: PBA No: 104900-000
>>
>> [  845.623050] igb 0000:81:00.1: Using MSI-X interrupts. 8 rx queue(s), 8
>> tx queue(s)
>>
>> [  845.694180] ata4: SATA link down (SStatus 0 SControl 300)
>>
>> [  845.699730] ata3: SATA link down (SStatus 0 SControl 300)
>>
>> [  845.705276] ata2: SATA link down (SStatus 0 SControl 300)
>>
>> [  845.710813] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>>
>> [  845.717162] ata5: SATA link down (SStatus 0 SControl 300)
>>
>> [  845.722738] ata6: SATA link down (SStatus 0 SControl 300)
>>
>> [  845.728523] ata1.00: ATA-8: ST91000640NS, SN03, max UDMA/133
>>
>> [  845.734330] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth
>> 31/32)
>>
>> [  845.742162] ata1.00: configured for UDMA/133
>>
>> [  845.746833] scsi 0:0:0:0: Direct-Access     ATA      ST91000640NS
>>  SN03 PQ: 0 ANSI: 5
>>
>> [  845.791663] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks:
>> (1.00 TB/931 GiB)
>>
>> [  845.799600] sd 0:0:0:0: [sda] Write Protect is off
>>
>> [  845.804545] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>>
>> [  845.820557]  sda: sda1 sda2
>>
>> [  845.823892] sd 0:0:0:0: [sda] Attached SCSI disk
>>
>> [  851.888095] mlx4_core 0000:04:00.0: Old device ETS support detected
>>
>> [  851.894490] mlx4_core 0000:04:00.0: Consider upgrading device FW.
>>
>> [  852.632012] mlx4_core 0000:04:00.0: PCIe link speed is 8.0GT/s, device
>> supports 8.0GT/s
>>
>> [  852.640265] mlx4_core 0000:04:00.0: PCIe link width is x8, device
>> supports x8
>>
>> [  852.787509] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
>>
>> [  945.585085] igb 0000:81:00.0: changing MTU from 1500 to 2044
>>
>> [  949.821377] igb 0000:81:00.0 enp129s0f0: igb: enp129s0f0 NIC Link is
>> Up 1000 Mbps Full Duplex, Flow Control: RX
>>
>> [  950.903525] random: crng init done
>>
>>
>>
>> Notice that the boot process is slow.
>>
>>
>>
>> The machine has the following configuration:
>>
>> 2x Intel Xeon E5-2670
>>
>> 256GB RAM
>>
>> Motherboard Supermicro X9DRG-HF
>>
>> HDD 1TB
>>
>> Mellanox Infiniband ConnectX-3 card (MT27500)
>>
>> GPGPU nVidia Tesla M2075
>>
>>
>>
>> I removed all off-board cards and HDD. The boot process stays stuck in
>> the same stage. I installed CentOS 7 minimal ISO on HDD and the problem did
>> not occur.
>>
>>
>>
>> Regards,
>>
>>
>>
>> --
>>
>> Angelo Cavalcanti
>> br.linkedin.com/in/angelocr
>> _______________________________________________
>> xCAT-user mailing list
>> xCAT-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/xcat-user
>>
>>
>>
>>
>>
>> _______________________________________________
>> xCAT-user mailing list
>> xCAT-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/xcat-user
>>
>
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to