Lenovo’s provisioner: <https://hpc.lenovo.com/users/downloads/22b/> Downloads<https://hpc.lenovo.com/users/downloads/22b/> hpc.lenovo.com<https://hpc.lenovo.com/users/downloads/22b/> [icon.png]<https://hpc.lenovo.com/users/downloads/22b/>
Sent from my iPhone On 1 Apr 2023, at 10:41, Tomer Shachaf <tomers...@matrix.co.il> wrote: Can anybody explain me what is confluent? בברכה , תומר שחף | מהנדס אינטגרציה ותשתיות | חטיבת אינטגרציה ותשתיות | מטריקס | נייד 054-2686841 | tomers...@matrix.co.il<mailto:tomers...@matrix.co.il> | www.matrix.co.il<http://www.matrix.co.il/> [image001.jpg] On 29 Mar 2023, at 20:40, Jarrod Johnson <jjohns...@lenovo.com> wrote: For reference, I did a couple of bittorrent style diskless as a project years ago. Didn't ever mainstream it though. In the end the performance uplift wasn't as noticeable as one might have guessed, for an environment where the boot servers had at least 10G. Note that nowadays I've moved my development attention to confluent. Also note, confluent never pushes private ssh keys (node to node ssh when enabled is facilitated through SSH certificate authority and helper to generate shosts.equiv). On confluent diskless, there is an interesting benefit that becomes a challenge for bittorrent: a typical diskless node never downloads the whole diskless image. This means less ram sucked up by the diskless image, and also that the diskless image can be large without pruning. Further, even the bits 'downloaded' are permitted to be erased as needed by the kernel memory management, so the current expectation is that we don't expend resources on a diskless node to retain the image unless we absolutely need it. So a typical bittorrent flow would erode this benefit. One could imagine a bittorrent scenario that would erode less of the value but would still come at a price. If the similar trick were done to only torrent the parts as needed locally, then the critical portion for boot would be memory resident on each node. We would still lose the ability for the kernel to free up that memory (either as needed or drop_cache), and much of the boot up contents do not need to be read again, so dropping their cache after boot can offer benefit. Incidentally, another facet about the diskless image difference between xCAT and confluent, the diskless images are now encrypted. This affords protection in case your diskless image contains some sensitive material. The decryption key is available through the confluent API, and is generally authenticated by node TPM, so a diskless node persists trust through having the same TPM that had been previously authenticated. This fact allows the transport security to matter less, though our security policies are pretty insistent that https be used at all times. I would be interested in developing torrent style boot design with confluent, with lower hanging fruit of 'untethered' mode, which is still available and does download the image (at the expense of ram usage). Interestingly, the logic is no longer inside the packed initramfs, but is loose in the profile. The link to RedHat 9 style diskless bootstrap is: https://github.com/lenovo/confluent/blob/master/confluent_osdeploy/el9-diskless/profiles/default/scripts/imageboot.sh [https://opengraph.githubassets.com/1f19a279adcddae426f052b5f40da5903b2b87eebc6c45409caf258f36bfab8c/lenovo/confluent]<https://github.com/lenovo/confluent/blob/master/confluent_osdeploy/el9-diskless/profiles/default/scripts/imageboot.sh> confluent/imageboot.sh at master · lenovo/confluent<https://github.com/lenovo/confluent/blob/master/confluent_osdeploy/el9-diskless/profiles/default/scripts/imageboot.sh> Confluent Cluster Management software. Contribute to lenovo/confluent development by creating an account on GitHub. github.com Notably: if [ "untethered" = "$(getarg confluent_imagemethod)" ]; then mount -t tmpfs untethered /mnt/remoteimg curl https://$confluent_whost/confluent-public/os/$confluent_profile/rootimg.sfs -o /mnt/remoteimg/rootimg.sfs else confluent_urls="$confluent_urls https://$confluent_whost/confluent-public/os/$confluent_profile/rootimg.sfs" /opt/confluent/bin/urlmount $confluent_urls /mnt/remoteimg fi Is the logic for getting the image. One thing to note is that a typical diskless image boot in confluent, the booted system does not see rootimg.sfs, so the torrent execution would have to stay in the 'initramfs' world (which does persist after boot, as a separate mount namespace) ________________________________ From: Dr. Thomas Orgis <thomas.or...@uni-hamburg.de> Sent: Wednesday, March 29, 2023 11:37 AM To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net> Subject: [External] [xcat-user] BitTorrent distribution of stateless images with xCAT interesting to anyone? Hi, I first got into contact with xCAT through our HPC installed in 2015, with xCAT version … hm … # nodels --version Version 2.9.1 (git commit 7f6043fffd62d482931b17b60f9488eb5754fdc1, built Thu Mar 19 03:25:35 EDT 2015) 2.9.1 seems to be it. The base system is CentOS 7.x. Since the system was an en bloc purchase, we never updated xCAT, but I just adapted it to our needs and then let it do its thing over the years. I did some little changes, like fixing up /etc/hostname in initrd (not sure if that was a specific mixup in our setup with long and short hostnames) and recently the fix for CVE-2023-27486 (being rather annoyed that /root/.ssh/id_rsa would _ever_ be delivered out to cluster nodes, should always have been a separate directory where I consciously copied a key or had it generated). But nothing to rock the boat. CentOS upgrades up to 7.9 didn't hurt things. We did stick to a certain vanilla kernel build with our patches, though. The system will be out of production in the near future and we do not know what the next installation will be using. I intended to share a main point of my local hacking, but somehow never got around it, and I somehow figured that the obvious stuff would appear upstream, anyway. Example: I enabled squashfs+overlayfs for us with a few lines and I gather that is a standard thing now. Obvious to me back in 2015 was the distribution of stateless filesystem images being slowed down unnecessarily by them being served via HTTP from the admin node over the 1GbE interface. Booting a cluster of 400 nodes took ages because of that (well, quarter to half an hour or so). Is this still the current mechanism? While you could make the admin node part of the high-speed network (Infiniband in our case), or just using 10GbE as baseline today, it just feels right to me to scale out the distribution capacity with the number of compute nodes. Is anyone interested in that? Should I propose a formal change to xCAT for that feature? Did I miss an equivalent option that exists now in current xCAT? I only found some consulting company boasting about them having implemented torrents with xCAT for a customer, but nothing official. I'll describe what I did, anyway. 6 steps follow. 1. I got hold of a minimal torrent program: ctorrent from https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fp%2Fdtorrent%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7k8%2FLjrm96vdj%2BIoh%2BFczvFkGUQXikJkvwu5p5kJ0Aw%3D&reserved=0 2. I wrote the first of the two attached patches to support the cluster use-case with /dev/loop0 for reading the rootimg (see also https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fp%2Fdtorrent%2Fpatches%2F5%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=i5ADUQVSFSHPDR7n%2BpkpGfWlus8f4g9vO%2FfiYeFWeeo%3D&reserved=0 ), the second patch then followed to fix a memory issue (see also https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceforge.net%2Fp%2Fdtorrent%2Fpatches%2F7%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xORaUPu38PaPDamKebF73JL5SL9I2UQeUj70BiTVVbE%3D&reserved=0 ). 3. I applied a rather small change to the xcatroot dracut script to download the image via ctorrent in initrd and prepare seeding later. ---------------8<--------------------- Index: share/xcat/netboot/rh/dracut_033/xcatroot =================================================================== --- share/xcat/netboot/rh/dracut_033/xcatroot (Revision 833) +++ share/xcat/netboot/rh/dracut_033/xcatroot (Revision 834) @@ -21,6 +21,12 @@ /tmp/updateflag $MASTER $XCATIPORT "installstatus netbooting" fi +if [ -e /rootimg.torrent ]; then + + ctorrent -s /rootimg.sfs -e 0 /rootimg.torrent + +else + if [ ! -z "$imgurl" ]; then if [ xhttp = x${imgurl%%:*} ]; then NFS=0 @@ -43,6 +49,9 @@ ROOTDIR=/${ROOTDIR#*/} fi fi + +fi # torrent + #echo 0 > /proc/sys/vm/zone_reclaim_mode #Avoid kernel bug if [ -r /rootimg.sfs ]; then @@ -61,6 +70,15 @@ mkdir -p $NEWROOT/rw mount --move /ro $NEWROOT/ro mount --move /rw $NEWROOT/rw + if [ -e /rootimg.torrent ]; then + # Prepare for seeding the rootimg. + # Note that this demands the patched dnh3.2.2thor1 ctorrent binary. + mkdir $NEWROOT/.sysdist + cp /usr/bin/ctorrent /rootimg.torrent $NEWROOT/.sysdist + rrz_distfile=$(ctorrent -x /rootimg.torrent | grep rootimg.sfs | cut -f 2 -d ' ') + mkdir -p $NEWROOT/.sysdist/$(dirname $rrz_distfile) + ln -s /dev/loop0 $NEWROOT/.sysdist/$rrz_distfile + fi elif [ -r /rootimg.gz ]; then echo Setting up RAM-root tmpfs. if [ -z $rootlimit ];then --------------->8--------------------- 4. Include the torrent stuff in the image generation script. ---------------8<-------------------- #!/bin/sh scriptdir=$(cd $(dirname $0) && pwd) PATH=$scriptdir:$PATH sysbase=centos79 osimage=$sysbase-x86_64-stateless-gpu imgdir=/install/netboot/$sysbase/x86_64/gpu xcatinitrd=$imgdir/initrd-stateless.gz # normal image generation # packimage, etc. # stop main seeding service on admin node service rrz-dist-mainseed stop # Create torrent file for efficient distribution. torrfile=gpu-$sysbase-$timecode-rootimg.torrent cd /install/dist ctorrent -t \ -s $torrfile \ -u http://$admin_ip:81/announce \ os/gpu-$sysbase-$timecode/rootimg.sfs # start seeding again, picking up added .torrent service rrz-dist-mainseed start # Disable that in case of weird boot trouble. # It pulls out lots of drivers/firmware that is not # obviously needed for booting. # Initrd loading without torrent is the new bottleneck . rrz-initrd-reduce $xcatinitrd # Insert torrent client and torrent file into initrd. # If that is disabled, standard HTTP download using the # URL from pxelinux config is done. initrdir=$(rrz-initrd-unpack $xcatinitrd) cp -v $scriptdir/ctorrent $initrdir/usr/bin cp -v /install/dist/$torrfile $initrdir/rootimg.torrent rrz-initrd-pack $xcatinitrd $initrdir rrz-initrd-rmdir "$initrdir" rrz-initrd-ucode $xcatinitrd # Yes, update the actual copy of the initrd that is used # during netboot. cp -v $xcatinitrd $bootinitrd --------------->8-------------------- 5. Added a seed service to syncfiles: cat /usr/lib/systemd/system/rrz-dist-seed.service [Unit] Description=ctorrent node seed for image distribution After=network.target [Service] # It might be that the fresh torrent file is not available right away # inside /work (?!), so restarting may be needed to really get # an instance up. Restart=always RestartSec=10 WorkingDirectory=/.sysdist # Not starting as user yet, because root perm needed for preparation # User=sysdist ExecStartPre=/bin/chmod 0640 /dev/loop0 ExecStartPre=/bin/chown :sysdist /dev/loop0 ExecStart=/bin/su sysdist -c '/.sysdist/ctorrent -q -m 5 -M 20 -U 102400 -e -1 rootimg.torrent' [Install] WantedBy=multi-user.target 6. Before all that … have the main seeder service: [root@adm1 xcat]# cat /install/rrz/rrz-dist-mainseed.sh #!/bin/bash # called as a system service pids= for torr in /install/dist/*.torrent do /install/rrz/ctorrent -q -m 5 -M 20 -U 60000 -e -1 "$torr" & pid=$? pids+=" $pid" echo "torrent for $torr with PID $pid" done trap "kill $pids" EXIT wait # end And as tracker, I built an instance of opentracker (below 90K binary, ctorrent is around 310K, not stripped) from https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ferdgeist.org%2Farts%2Fsoftware%2Fopentracker%2F&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U09RAiRufZhCD1H3LpSg6lpdZINCaHlGsYD2f6YbfiU%3D&reserved=0 (snapshot: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsrc.rrz.uni-hamburg.de%2Ffiles%2Fsrc%2F_unsorted%2Fopentracker-20151001.tar.bz2&data=05%7C01%7Cjjohnson2%40lenovo.com%7Ca3897902ba2547734b7608db306bb05e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638157015793457649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qmGtIzr%2Bkya2UPwOLFpIGgEG7oNfxQpb0QtRSvNpNqo%3D&reserved=0) with the config file boiling down to three lines: listen.udp.workers 6 listen.tcp_udp $admin_ip:81 tracker.user nobody and this simple call as systemd service: [Service] ExecStart=/install/rrz/opentracker -f /install/rrz/opentracker.conf Now this is a long mail, but a rather complete description of the steps I took to make booting of my stateless nodes so fast that I didn't worry about the image distribution part since mid/end of 2015. Now, at the end of the system lifetime, I start to worry a bit about what will come next … Is there interest in the xCAT community to pick this up? One might have to adopt/fork ctorrent, while opentracker seems to be alive, although the author didn't bother to name a release yet. In the closed loop where we use ctorrent as the only client/server with this tracker servers, this might be acceptable. To me, more acceptable than bloating the initrd again with some other torrent software more than a few 100K big. Should this be supported in xCAT upstream? Having 100G networking in the admin node might make this obsolete, but this just means that we could scale to a few thousand nodes more without impacting a single network link. In clusters, when you can distribute a load, you should think twice before _not_ doing it, right? Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg זהירות: מקור הדואל הזה הוא מחוץ למטריקס. חל איסור ללחוץ על קישורים או לפתוח קבצים מצורפים אלא אם כן השולח מוכר והתוכן בטוח Caution: The source of this email is from outside Matrix. it is forbidden to click on links or open attachments unless you recognize the sender and know the content is safe. _______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user _______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user