FYI

Something I am noticing, and it seems to be consistent, is the installation is still failing IF the VM has more than 1 nic.

Even though it PXE boots from the deployment interface, once the install starts it fails to configure the network if a second nic is enabled (in this case the enterprise network).  The network does not come up, DHCP will not get an address, and they have to be configured by hand.  I also (sometimes) have issues where nodediscover adds the right mac but still tries to PXE boot off the wrong interface due to the UUID associated with both interfaces being the same.

xCAT would deploy the node but often the network was still not correct even if the routes table was setup right.  Both nics would setup their own default gateway instead of using one (the enterprise).  The 'setroutes' postscript would fix the issue but it didn't persist after reboot.  Easy enough to fix with a postscript full of nmcli statements.  However, if the node won't deploy with 2 nics present and it has to be added manually as a device in vSphere that defeats a hands off deploy.


So far I only see this with the Ubuntu installs, Rocky seems to deploy (then I fix the default route and custom dns with a bash postscript).

Since the bulk of a cluster is made up of compute nodes this may not be a huge deal, just for login/remote visualization/etc nodes that need connectivity to the corporate network, but I thought I would document it here.

So for now, I only add the one nic, deploy the node, add the second nic and update the netplan file with the new nic, default gw, and proper dns address/search order.

Hope it helps,

Brian J


On 4/25/25 05:48, Jarrod Johnson wrote:
Ok, changes were made to the Ubuntu deployment bootstrap to be a bit more tenacious. There was previously a chance to fall through the automation setup into manual setup. Now it should never do that, either hanging or erroring.  I'll look at a hanging scenario I saw yesterday and make it more clear what is missing, but that shouldn't apply to a normal PXE install.

Also as an aside I've been doing "hardware" control to target VCSA.  I can't seem to get "setboot" to work yet, but could provide power, inventory and text console. The API for boot device doesn't seem to work as I would expect.  I hope to push libvirt, VCSA, and proxmox in the near future.

------------------------------------------------------------------------
*From:* Brian Joiner <martinitime1...@gmail.com>
*Sent:* Thursday, April 24, 2025 8:24 PM
*To:* xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
*Cc:* Jarrod Johnson <jjohns...@lenovo.com>
*Subject:* Re: [xcat-user] [External] Confluent: Anyone get Ubuntu to deploy?
Ok so today I had some time to try again:

Ubuntu server 22.04 pxe boots, starts installer, but becomes interactive and will not detect the network
Added ubuntu server 24.04, same result

Then I saw the email about the new release.  Installed updates normally via yum, retried 24.04 .  No other changes.

It worked!  Fully hands off, hostname correct.

I may not have mentioned that my environment is esxi  7 VM's all the way, so I don't know if there was a problem with the nic firmware or what, but the Confluent update fixed it.

Thanks for all you devs do!

Brian Joiner


On Wed, Apr 9, 2025 at 11:25 AM Brian Joiner <martinitime1...@gmail.com> wrote:

    Awesome, thanks.  Just knowing that a > 18.04 deployment should
    work as expected is a good start. I'll double check paths,
    permissions, logs and the curl test in the other reply and report
    back.




    On 4/1/25 12:57, Jarrod Johnson via xCAT-user wrote:
    The 'password not accepted' can happen if you reboot a deployment
    and retry without doing a 'nodedeploy' again. In confluent you
    have to explicitly say you want to deploy, as there's a security
    mechanism that locks down after a node API token is claimed.

    [root@r3u20 ~]# nodedeploy r3u24
    r3u24: pending: ubuntu-22.04.5-x86_64-default (node
    authentication armed)
    [root@r3u20 ~]#


    Note 'node authentication armed'. This means it is configured to
    allow a single more weakly authenticated request for a node token.

    At some point during an install attempt, you get to:
    [root@r3u20 ~]# nodedeploy r3u24
    r3u24: pending: ubuntu-22.04.5-x86_64-default
    [root@r3u20 ~]#

    In this case, a new attempt must be accompanied by another
    nodedeploy, a reboot without completing deployment will discard
    the node token that was granted.

    However, I assume at least the first attempt did fail for other
    reasons and may need to know more about that.  I just did a
    deployment myself of 22.04.5 without issue and had no issues, so
    at least in theory it should be workable, but will probably need
    to see some files when you get to a manual interaciton point.

    Files like /conf/param.conf, files in
    /custom-installation/confluent/..
    ------------------------------------------------------------------------
    *From:* Brian Joiner <martinitime1...@gmail.com>
    <mailto:martinitime1...@gmail.com>
    *Sent:* Tuesday, April 1, 2025 12:00 PM
    *To:* xcat-user@lists.sourceforge.net
    <xcat-user@lists.sourceforge.net>
    <mailto:xcat-user@lists.sourceforge.net>
    *Subject:* [External] [xcat-user] Confluent: Anyone get Ubuntu to
    deploy?
    I have attempted to deploy an Ubuntu 22.04 server via Confluent, and
    I've run into all kinds of issues.  Either the process dies after
    trying
    to mount the cd, I get the option to enter an emergency command
    prompt,
    or I get sent into a non-interactive setup screen where I'm
    prompted to
    enter account/network/disk info by hand.  I've had the same
    experience
    in my home lab and my office Confluent instance.

    However, if I deploy Rocky 9.x to the same node (with no attribute
    changes) it deploys as expected.  Has anyone gotten Ubuntu server to
    deploy without intervention?  Have I missed some kind of setup
    step for
    Ubuntu based installs?

    My confluent server has all updated packages, and I didn't see
    anything
    of interest in /var/log/confluent

    Attached is one of the fail screen shots if that helps.

    Thanks,

    Brian Joiner


    _______________________________________________
    xCAT-user mailing list
    xCAT-user@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to