I’m just following up for posterity’s sake. I was able to set serialflow to “” 
in the nodehm table, which had the effect of wiping the value for that setting 
entirely. After the making this change, when our nodes are (re)built they no 
longer have the  ‘n8r’ segment in the Grub config. I made this change several 
weeks ago and haven’t noticed any issues related to booting, deployment, SOL or 
anything else.

Thanks again,

Jake

From: "Rundall, Jacob D" <rund...@illinois.edu>
Reply-To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Date: Thursday, July 27, 2017 at 11:58 AM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: Re: [xcat-user] stateful nodes won't boot after IMM network settings 
are changed or IMM is put on different network

It seems like your theory is correct. We do often watch deployments with rcons; 
the Grub config was indeed how you imagined; and when I removed ‘n8r’ I was 
able to boot as expected.

It sounds like for existing deployments we can edit and remake our Grub config. 
What’s the specific fix for future deployments? Do we set nodehm.serialflow to 
“” or “soft”? Or something else?

Thanks very much,

Jake

From: Jarrod Johnson <jjohns...@lenovo.com>
Reply-To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Date: Thursday, July 27, 2017 at 10:33 AM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: Re: [xcat-user] stateful nodes won't boot after IMM network settings 
are changed or IMM is put on different network

So my theory is that you had SOL connected, and serialflow hard at time of 
deployment, and that’s in your grub config.

When IMM put away, breaking SOL, then hardware flow control would block.

With console=ttyS0,11520n8r on the command (or similar), then a broken SOL 
session would cause kernel hang, anywhere.

The crash cart showing very little may have been related.

If you go into boot menu to edit command and see ‘n8r’, try removing n8r and 
see if boot progresses as you would expect.

I recommend against hardware flow control for console, because in exceptional 
cases it can cause a lot of problems, and the upside is very limited for 
console type data (matters a bit more for the modem days…)

From: Rundall, Jacob D [mailto:rund...@illinois.edu]
Sent: Thursday, July 27, 2017 11:03 AM
To: xCAT Users Mailing list
Subject: Re: [xcat-user] stateful nodes won't boot after IMM network settings 
are changed or IMM is put on different network

Thanks, Jarrod.

# nodels object-data[01-04] nodehm.serialflow
object-data01:
object-data02:
object-data03:
object-data04:

It appears that the reason for this is that I pulled these nodes out of the 
‘all’ group which is the only way we are applying settings in the nodehm table. 
And I would have made this change before the other changes listed in my first 
post for this thread. When they were in the ‘all’ group previously the value 
would have been ‘hard’.

How might this be affecting the boot and would you have any recommendations?

Thanks again,

Jake

From: Jarrod Johnson <jjohns...@lenovo.com<mailto:jjohns...@lenovo.com>>
Reply-To: xCAT Users Mailing list 
<xcat-user@lists.sourceforge.net<mailto:xcat-user@lists.sourceforge.net>>
Date: Thursday, July 27, 2017 at 9:45 AM
To: xCAT Users Mailing list 
<xcat-user@lists.sourceforge.net<mailto:xcat-user@lists.sourceforge.net>>
Subject: Re: [xcat-user] stateful nodes won't boot after IMM network settings 
are changed or IMM is put on different network

Nodels <noderange> nodehm.serialflow
?


From: Rundall, Jacob D [mailto:rund...@illinois.edu]
Sent: Thursday, July 27, 2017 10:12 AM
To: xCAT Users Mailing list
Subject: [xcat-user] stateful nodes won't boot after IMM network settings are 
changed or IMM is put on different network

I ran into something that seems strange to me. I was workig with a few Lenovo 
System x3650 M5 nodes yesterday that were deployed with xCAT. They’re running 
CentOS 7. In order to do some testing I need to move them over to some 
different networks:
- remove public/routed network from the OS;
- move the OS to new, unrouted management network (for SSH & deployment), with 
a different IP address, without any xCAT servers;
- move the IMM to new, unrouted service/IPMI network, with a different IP 
address, also without any xCAT servers.
Also note that I elected to keep their current stateful OS installation rather 
than redeploying them.

After I did this with the first node I found that it wouldn’t fully boot 
anymore. Specifically it attempted to boot from the on-disk OS, proceeded 
through kernel selection, but then would get stuck with “Probing EDD”… showing. 
(It likely was proceeding past this but just not showing anything else on the 
crash cart display, more on that later.)

Through some other experimentation I found that any of the following conditions 
would prevent the node from booting in the same way:
1) without reconfiguring the IMM’s network settings, unplugging the network 
cable
2) without reconfiguring the IMM’s network settings, connecting the IMM to a 
switch port on the new service/IPMI network
3) reconfiguring the IP of the IMM (or resetting to factory then reconfiguring 
on a new IP)
-- this case applies even if I leave the machine connected to our original, 
production service/IPMI network
But in any case, if I make sure the IMM’s IP is reverted (or doesn’t change) 
and connect it back up to our production service/IPMI network, then the node 
will boot again.

I guess my questions are as follow: Is there something on an xCAT-provisioned 
node (perhaps specific to our hardware, perhaps generally speaking) that 
requires the IMM/BMC to be not have its IP and/or network changed in order to 
complete its boot? Is there some kind of communication between the OS and the 
IMM/BMC that depends on the network connectivity of the IMM/BMC? Is 
communication with the xCAT master involved (remember the xCAT master is on the 
production service/IPMI network but not on the new test service/IPMI network)? 
And are there any ways around this issue?

A few more details:

A) I did disable EDD probing on one of these machines and found that it 
actually still got stuck (this time with a flashing cursor), so in the previous 
cases, “Probing EDD”… was simply the last thing that showed on the (crash cart) 
screen before it got stuck. I also configured tty0 as console on that node to 
get some more verbose output about what was occurring. Unfortunately this 
output didn’t appear to make it into /var/log/messages like it seems to on a 
successful boot (probably because the machine didn’t get far enough along in 
the boot process to pass it on). But here is a photo:
https://uofi.box.com/s/dt16qvigbgtp0gbz2t0b26m41huv7wo7
I checked quite a few of the lines that appear here and they seem to show up in 
/var/log/messages after a successful boot as well, so I haven’t uncovered 
anything that is indicative of the failure I’m seeing. Perhaps there’s 
something that’s not showing that is telling but I don’t know what it is.

B) Rebuilding plain CentOS 7 from a USB drive allows the nodes to boot with new 
IMM network settings, with IMMs on the new service/IPMI network.

Thanks much,

Jake Rundall

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to