> A new filesystem has been written to enable multipath-https on-demand access > for stateless mount, with full certificate validation.
I am very interested in this. Is it part of confluent, or a separate project at Lenovo? Fuse-based? Publicly available yet? Thanks, ~Matt -- Matt Ezell HPC Systems Engineer Oak Ridge National Laboratory From: Jarrod Johnson <jjohns...@lenovo.com> Reply-To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net> Date: Tuesday, May 18, 2021 at 4:38 PM To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net> Subject: [EXTERNAL] [xcat-user] Upcoming Confluent diskless update As an FYI, with confluent we have begun work on its native diskless OS support. I wanted to describe what we are thinking. First as a baseline, my thoughts on xCAT disklesss (stateless/statelite). xCAT stateless manages to be both diskless and completely untethered from storage by booting OS in RAM. Significant downsides are high ram consumption, and the exclusion list along with other factors making the actual boot system 'weird' in some ways for users and software (e.g. a lot of omitted locale data, network manager being incompatible, lot of expected documentation not present on compute nodes). It's a lot of work to tune that exclusion list, particularly for custom content. It also can take a long time to boot as the image scales up in size. xCAT statelite was initially intended to segregate read-only common root filesystem from per-node state to persist each one in nfs. In practice, it usually desired for the low memory use through mostly using a read-only nfs filesystem and specific read-write content. This requires more work and is usually weird in different ways (no capabilities bits, having to curate the exclusion list, not able to do a normal rpm install on a node if you really need to to try something out or you want an onboot process with an rpm install for some reason). Finally, it is frequently tethered to a singular nfs server, though with some work you can have an appropriately HA nfs configuration to cope with this. Both have some common issues too, a difficulty to debug over the network, inadequate security handling (ssh keys being reshared without adequate vetting, nfs statelite having no integrity assurance and no authenticity). So for confluent, we wanted to try to simplify and enhance as we went for diskless. For initial release, it will be a single design that blends the most desirous facets of both above, as well as enhanced security and performance. To that end: -The server TPM2 is now used as the mechanism to persist trust of booting nodes. SSH host keys are regenerated by the host every boot, and the TPM2 is used to authenticate a request for a new certificate, which will automatically grant the system the same trust an install from disk gets -The diskless bootstrap image can either be PXE booted, HTTP booted, or, for Lenovo systems, installed onto the XCC using HTTPS for the maximum security benefit -A new filesystem has been written to enable multipath-https on-demand access for stateless mount, with full certificate validation. Rather than downloading the entire image at boot, the image is mounted over https, and read requests are relayed to the server. The normal kernel caching mechanism will cause the relevant bits of the filesystem to stay memory resident, but is also evictable if need be since it can download at any time. -Rather than requiring a specific set of files to be designated read-write, a tmpfs is overlay mounted over the diskless image. Resembling statelite, but using overlay for simpler 'everything is read-write'. Distinct from statelite as at least for now, no support for per-node state persistence. -Extra work has been invested to be consistent with things like NetworkManager and in other ways generally looking like a normal system installed from disk. -More care has been taken to segregate 'distribution' content from 'confluent add-on'. In xCAT, genimage put a lot of xCAT specific material into the image and complicated the build and maintenance process. In confluent, one small dracut module is added to make an appropriately sized initramfs, but all the scripting and executables use the same 'addons' process to keep confluent mostly out of the built image and add the confluent requirements at node boot time rather than image build time. This is what the prototype looks like booting up: Automatic console configured for ttyS0,115200 Initializng confluent diskless environment udevd: starting version 239 (239-45.el8) Loading drivers...done Scanning for network configuration...Setting up eno1 as static at 172.30.1.5/16 Initializing ssh...ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519 Registering mount path: https://[fe80::38f7:56ff:fe06:8945%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs Registering mount path: https://[fe80::10cc:61ff:fe0f:8f58%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs Registering mount path: https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs Connecting to https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs Successfully connected to https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs Here's what happens if the currently in-use connection breaks in the middle of IO (failover): urlmount: error while communicating with https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs: Failed to connect to 172.30.254.1 port 443: Connection refused urlmount: Connecting to https://[fe80::38f7:56ff:fe06:8945%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs urlmount: error while communicating with https://[fe80::38f7:56ff:fe06:8945%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs: Failed to connect to fe80::38f7:56ff:fe06:8945 port 443: Connection refused urlmount: Connecting to https://[fe80::10cc:61ff:fe0f:8f58%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs urlmount: Successfully connected to https://[fe80::10cc:61ff:fe0f:8f58%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs The memory consumption resembles a system installed from disk: [root@d5 bin]# free -m total used free shared buff/cache available Mem: 23519 3884 19414 126 220 19250 Swap: 0 0 0 NetworkManager acts normal: [root@d5 bin]# nmcli c NAME UUID TYPE DEVICE Wired connection 1 6c3c9e3d-5e56-3242-b35e-b5331f09767f ethernet eno2 eno1 348b636e-154f-421c-99ea-2899f363a960 ethernet eno1 Wired connection 2 79e23780-0d7b-326d-bc4c-5acf140bdea2 ethernet enp0s20f0u1u6 Boot is faster, as it only downloads sectors of the root filesystem as needed. Further, after the system is booted, it persists ssh access to the initramfs environment, for truly untethered debug: [root@mgt1 diskless]# ssh d5 mount|grep ' / ' disklessroot on / type overlay (rw,relatime,lowerdir=/mnt/remote,upperdir=/mnt/overlay/upper,workdir=/mnt/overlay/work) [root@mgt1 diskless]# ssh -p 2222 d5 mount|grep ' / ' rootfs on / type rootfs (rw,size=12024424k,nr_inodes=3006106) Some things are ultimately gone in this design for now: -Truly untethered operational filesystem: The intent is that the new multipath https filesystem mitigates this concern. Worst case scenario is IOs are halted until the webserver reboots -No per-node persistence. My impression is this hasn't been highly desired anyway, and we are using the TPM2 and APIs to recreate likely relevant information. -There is now explicitly no way to try to centrally update an 'nfs' image and have changes appear on nodes. For various reasons, a limitation of overlay mount is that the lower directory is expected to be unchanging. So an update would be done as a new profile to reboot into next time, and optionally running the update on the nodes themselves if wanting to update without a reboot. Attempt to replace the squashfs will just cause I/O errors and corrupt the state of attached nodes.
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user