Lenovo’s provisioner:

<https://hpc.lenovo.com/users/downloads/22b/>
Downloads<https://hpc.lenovo.com/users/downloads/22b/>
hpc.lenovo.com<https://hpc.lenovo.com/users/downloads/22b/>
[icon.png]<https://hpc.lenovo.com/users/downloads/22b/>


Sent from my iPhone

On 1 Apr 2023, at 10:41, Tomer Shachaf <tomers...@matrix.co.il> wrote:


Can anybody explain me what is confluent?

בברכה ,

תומר שחף | מהנדס אינטגרציה ותשתיות | חטיבת אינטגרציה ותשתיות | מטריקס | נייד 
054-2686841 |
tomers...@matrix.co.il<mailto:tomers...@matrix.co.il> | 
www.matrix.co.il<http://www.matrix.co.il/>
[image001.jpg]


On 29 Mar 2023, at 20:40, Jarrod Johnson <jjohns...@lenovo.com> wrote:


For reference, I did a couple of bittorrent style diskless as a project years 
ago.  Didn't ever mainstream it though.  In the end the performance uplift 
wasn't as noticeable as one might have guessed, for an environment where the 
boot servers had at least 10G.

Note that nowadays I've moved my development attention to confluent.  Also 
note, confluent never pushes private ssh keys (node to node ssh when enabled is 
facilitated through SSH certificate authority and helper to generate 
shosts.equiv).

On confluent diskless, there is an interesting benefit that becomes a challenge 
for bittorrent: a typical diskless node never downloads the whole diskless 
image.  This means less ram sucked up by the diskless image, and also that the 
diskless image can be large without pruning. Further, even the bits 
'downloaded' are permitted to be erased as needed by the kernel memory 
management, so the current expectation is that we don't expend resources on a 
diskless node to retain the image unless we absolutely need it. So a typical 
bittorrent flow would erode this benefit.

One could imagine a bittorrent scenario that would erode less of the value but 
would still come at a price.  If the similar trick were done to only torrent 
the parts as needed locally, then the critical portion for boot would be memory 
resident on each node.  We would still lose the ability for the kernel to free 
up that memory (either as needed or drop_cache), and much of the boot up 
contents do not need to be read again, so dropping their cache after boot can 
offer benefit.

Incidentally, another facet about the diskless image difference between xCAT 
and confluent, the diskless images are now encrypted.  This affords protection 
in case your diskless image contains some sensitive material.  The decryption 
key is available through the confluent API, and is generally authenticated by 
node TPM, so a diskless node persists trust through having the same TPM that 
had been previously authenticated. This fact allows the transport security to 
matter less, though our security policies are pretty insistent that https be 
used at all times.

I would be interested in developing torrent style boot design with confluent, 
with lower hanging fruit of 'untethered' mode, which is still available and 
does download the image (at the expense of ram usage).  Interestingly, the 
logic is no longer inside the packed initramfs, but is loose in the profile.  
The link to RedHat 9 style diskless bootstrap is:
https://github.com/lenovo/confluent/blob/master/confluent_osdeploy/el9-diskless/profiles/default/scripts/imageboot.sh
[https://opengraph.githubassets.com/1f19a279adcddae426f052b5f40da5903b2b87eebc6c45409caf258f36bfab8c/lenovo/confluent]<https://github.com/lenovo/confluent/blob/master/confluent_osdeploy/el9-diskless/profiles/default/scripts/imageboot.sh>
confluent/imageboot.sh at master · 
lenovo/confluent<https://github.com/lenovo/confluent/blob/master/confluent_osdeploy/el9-diskless/profiles/default/scripts/imageboot.sh>
Confluent Cluster Management software. Contribute to lenovo/confluent 
development by creating an account on GitHub.
github.com
Notably:

if [ "untethered" = "$(getarg confluent_imagemethod)" ]; then
    mount -t tmpfs untethered /mnt/remoteimg
    curl 
https://$confluent_whost/confluent-public/os/$confluent_profile/rootimg.sfs -o 
/mnt/remoteimg/rootimg.sfs
else
    confluent_urls="$confluent_urls 
https://$confluent_whost/confluent-public/os/$confluent_profile/rootimg.sfs";
    /opt/confluent/bin/urlmount $confluent_urls /mnt/remoteimg
fi

Is the logic for getting the image.  One thing to note is that a typical 
diskless image boot in confluent, the booted system does not​ see rootimg.sfs, 
so the torrent execution would have to stay in the 'initramfs' world (which 
does persist after boot, as a separate mount namespace)





________________________________
From: Dr. Thomas Orgis <thomas.or...@uni-hamburg.de>
Sent: Wednesday, March 29, 2023 11:37 AM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: [External] [xcat-user] BitTorrent distribution of stateless images 
with xCAT interesting to anyone?

Hi,

I first got into contact with xCAT through our HPC installed in 2015,
with xCAT version … hm …

# nodels --version
Version 2.9.1 (git commit 7f6043fffd62d482931b17b60f9488eb5754fdc1, built Thu 
Mar 19 03:25:35 EDT 2015)

2.9.1 seems to be it. The base system is CentOS 7.x. Since the system
was an en bloc purchase, we never updated xCAT, but I just adapted it
to our needs and then let it do its thing over the years. I did some
little changes, like fixing up /etc/hostname in initrd (not sure if
that was a specific mixup in our setup with long and short hostnames)
and recently the fix for CVE-2023-27486 (being rather annoyed that
/root/.ssh/id_rsa would _ever_ be delivered out to cluster nodes,
should always have been a separate directory where I consciously copied
a key or had it generated). But nothing to rock the boat. CentOS
upgrades up to 7.9 didn't hurt things. We did stick to a certain
vanilla kernel build with our patches, though.

The system will be out of production in the near future and we do not
know what the next installation will be using. I intended to share a
main point of my local hacking, but somehow never got around it, and I
somehow figured that the obvious stuff would appear upstream, anyway.
Example: I enabled squashfs+overlayfs for us with a few lines and I
gather that is a standard thing now.

Obvious to me back in 2015 was the distribution of stateless
filesystem images being slowed down unnecessarily by them being served
via HTTP from the admin node over the 1GbE interface. Booting a cluster
of 400 nodes took ages because of that (well, quarter to half an hour or
so).

Is this still the current mechanism? While you could make the admin
node part of the high-speed network (Infiniband in our case), or just
using 10GbE as baseline today, it just feels right to me to scale out
the distribution capacity with the number of compute nodes.

Is anyone interested in that? Should I propose a formal change to xCAT
for that feature? Did I miss an equivalent option that exists now in
current xCAT? I only found some consulting company boasting about them
having implemented torrents with xCAT for a customer, but nothing
official.

I'll describe what I did, anyway. 6 steps follow.

1. I got hold of a minimal torrent program: ctorrent from

        
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fp%2Fdtorrent%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7k8%2FLjrm96vdj%2BIoh%2BFczvFkGUQXikJkvwu5p5kJ0Aw%3D&reserved=0

2. I wrote the first of the two attached patches to support the cluster
use-case with /dev/loop0 for reading the rootimg (see also

        
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fp%2Fdtorrent%2Fpatches%2F5%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=i5ADUQVSFSHPDR7n%2BpkpGfWlus8f4g9vO%2FfiYeFWeeo%3D&reserved=0

), the second patch then followed to fix a memory issue (see also

        
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fp%2Fdtorrent%2Fpatches%2F7%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xORaUPu38PaPDamKebF73JL5SL9I2UQeUj70BiTVVbE%3D&reserved=0

).

3. I applied a rather small change to the xcatroot dracut script to
   download the image via ctorrent in initrd and prepare seeding later.

---------------8<---------------------
Index: share/xcat/netboot/rh/dracut_033/xcatroot
===================================================================
--- share/xcat/netboot/rh/dracut_033/xcatroot   (Revision 833)
+++ share/xcat/netboot/rh/dracut_033/xcatroot   (Revision 834)
@@ -21,6 +21,12 @@
 /tmp/updateflag $MASTER $XCATIPORT "installstatus netbooting"
 fi

+if [ -e /rootimg.torrent ]; then
+
+  ctorrent -s /rootimg.sfs -e 0 /rootimg.torrent
+
+else
+
 if [ ! -z "$imgurl" ]; then
         if [ xhttp = x${imgurl%%:*} ]; then
                 NFS=0
@@ -43,6 +49,9 @@
                 ROOTDIR=/${ROOTDIR#*/}
         fi
 fi
+
+fi # torrent
+
 #echo 0 > /proc/sys/vm/zone_reclaim_mode #Avoid kernel bug

 if [ -r /rootimg.sfs ]; then
@@ -61,6 +70,15 @@
   mkdir -p $NEWROOT/rw
   mount --move /ro $NEWROOT/ro
   mount --move /rw $NEWROOT/rw
+  if [ -e /rootimg.torrent ]; then
+    # Prepare for seeding the rootimg.
+    # Note that this demands the patched dnh3.2.2thor1 ctorrent binary.
+    mkdir $NEWROOT/.sysdist
+    cp /usr/bin/ctorrent /rootimg.torrent $NEWROOT/.sysdist
+    rrz_distfile=$(ctorrent -x /rootimg.torrent | grep rootimg.sfs | cut -f 2 
-d ' ')
+    mkdir -p $NEWROOT/.sysdist/$(dirname $rrz_distfile)
+    ln -s /dev/loop0 $NEWROOT/.sysdist/$rrz_distfile
+  fi
 elif [ -r /rootimg.gz ]; then
   echo Setting up RAM-root tmpfs.
   if [ -z $rootlimit ];then
--------------->8---------------------

4. Include the torrent stuff in the image generation script.

---------------8<--------------------
#!/bin/sh

scriptdir=$(cd $(dirname $0) && pwd)
PATH=$scriptdir:$PATH
sysbase=centos79
osimage=$sysbase-x86_64-stateless-gpu
imgdir=/install/netboot/$sysbase/x86_64/gpu
xcatinitrd=$imgdir/initrd-stateless.gz

# normal image generation
# packimage, etc.

# stop main seeding service on admin node
service rrz-dist-mainseed stop

# Create torrent file for efficient distribution.
torrfile=gpu-$sysbase-$timecode-rootimg.torrent
cd /install/dist
ctorrent -t \
  -s $torrfile  \
  -u http://$admin_ip:81/announce \
  os/gpu-$sysbase-$timecode/rootimg.sfs

# start seeding again, picking up added .torrent
service rrz-dist-mainseed start

# Disable that in case of weird boot trouble.
# It pulls out lots of drivers/firmware that is not
# obviously needed for booting.
# Initrd loading without torrent is the new bottleneck .
rrz-initrd-reduce $xcatinitrd

# Insert torrent client and torrent file into initrd.
# If that is disabled, standard HTTP download using the
# URL from pxelinux config is done.
initrdir=$(rrz-initrd-unpack $xcatinitrd)
cp -v $scriptdir/ctorrent $initrdir/usr/bin
cp -v /install/dist/$torrfile $initrdir/rootimg.torrent
rrz-initrd-pack $xcatinitrd $initrdir
rrz-initrd-rmdir "$initrdir"

rrz-initrd-ucode $xcatinitrd

# Yes, update the actual copy of the initrd that is used
# during netboot.
cp -v $xcatinitrd $bootinitrd
--------------->8--------------------


5. Added a seed service to syncfiles:

cat /usr/lib/systemd/system/rrz-dist-seed.service
[Unit]
Description=ctorrent node seed for image distribution
After=network.target

[Service]
# It might be that the fresh torrent file is not available right away
# inside /work (?!), so restarting may be needed to really get
# an instance up.
Restart=always
RestartSec=10
WorkingDirectory=/.sysdist
# Not starting as user yet, because root perm needed for preparation
# User=sysdist
ExecStartPre=/bin/chmod 0640 /dev/loop0
ExecStartPre=/bin/chown :sysdist /dev/loop0
ExecStart=/bin/su sysdist -c '/.sysdist/ctorrent -q -m 5 -M 20 -U 102400 -e -1 
rootimg.torrent'

[Install]
WantedBy=multi-user.target


6. Before all that … have the main seeder service:

[root@adm1 xcat]# cat /install/rrz/rrz-dist-mainseed.sh
#!/bin/bash
# called as a system service
pids=
for torr in /install/dist/*.torrent
do
  /install/rrz/ctorrent -q -m 5 -M 20 -U 60000 -e -1 "$torr" &
  pid=$?
  pids+=" $pid"
  echo "torrent for $torr with PID $pid"
done

trap "kill $pids" EXIT

wait
# end

And as tracker, I built an instance of opentracker (below 90K binary,
ctorrent is around 310K, not stripped) from

        
https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ferdgeist.org%2Farts%2Fsoftware%2Fopentracker%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U09RAiRufZhCD1H3LpSg6lpdZINCaHlGsYD2f6YbfiU%3D&reserved=0

(snapshot: 
https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsrc.rrz.uni-hamburg.de%2Ffiles%2Fsrc%2F_unsorted%2Fopentracker-20151001.tar.bz2&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qmGtIzr%2Bkya2UPwOLFpIGgEG7oNfxQpb0QtRSvNpNqo%3D&reserved=0)
with the config file boiling down to three lines:

listen.udp.workers 6
listen.tcp_udp $admin_ip:81
tracker.user    nobody

and this simple call as systemd service:

[Service]
ExecStart=/install/rrz/opentracker -f /install/rrz/opentracker.conf


Now this is a long mail, but a rather complete description of the steps
I took to make booting of my stateless nodes so fast that I didn't
worry about the image distribution part since mid/end of 2015. Now, at
the end of the system lifetime, I start to worry a bit about what will
come next …

Is there interest in the xCAT community to pick this up? One might have
to adopt/fork ctorrent, while opentracker seems to be alive, although
the author didn't bother to name a release yet. In the closed loop
where we use ctorrent as the only client/server with this tracker
servers, this might be acceptable. To me, more acceptable than bloating
the initrd again with some other torrent software more than a few 100K
big.

Should this be supported in xCAT upstream? Having 100G networking in
the admin node might make this obsolete, but this just means that we
could scale to a few thousand nodes more without impacting a single
network link. In clusters, when you can distribute a load, you should
think twice before _not_ doing it, right?


Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg

זהירות: מקור הדואל הזה הוא מחוץ למטריקס. חל איסור ללחוץ על קישורים או לפתוח 
קבצים מצורפים אלא אם כן השולח מוכר והתוכן בטוח
Caution: The source of this email is from outside Matrix. it is forbidden to 
click on links or open attachments unless you recognize the sender and know the 
content is safe.

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to