> A new filesystem has been written to enable multipath-https on-demand access 
> for stateless mount, with full certificate validation.

I am very interested in this. Is it part of confluent, or a separate project at 
Lenovo? Fuse-based? Publicly available yet?

Thanks,
~Matt

--
Matt Ezell
HPC Systems Engineer
Oak Ridge National Laboratory

From: Jarrod Johnson <jjohns...@lenovo.com>
Reply-To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Date: Tuesday, May 18, 2021 at 4:38 PM
To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net>
Subject: [EXTERNAL] [xcat-user] Upcoming Confluent diskless update

As an FYI, with confluent we have begun work on its native diskless OS support. 
I wanted to describe what we are thinking.

First as a baseline, my thoughts on xCAT disklesss (stateless/statelite).

xCAT stateless manages to be both diskless and completely untethered from 
storage by booting OS in RAM. Significant downsides are high ram consumption, 
and the exclusion list along with other factors making the actual boot system 
'weird' in some ways for users and software (e.g. a lot of omitted locale data, 
network manager being incompatible, lot of expected documentation not present 
on compute nodes).  It's a lot of work to tune that exclusion list, 
particularly for custom content. It also can take a long time to boot as the 
image scales up in size.

xCAT statelite was initially intended to segregate read-only common root 
filesystem from per-node state to persist each one in nfs. In practice, it 
usually desired for the low memory use through mostly using a read-only nfs 
filesystem and specific read-write content. This requires more work and is 
usually weird in different ways (no capabilities bits, having to curate the 
exclusion list, not able to do a normal rpm install on a node if you really 
need to to try something out or you want an onboot process with an rpm install 
for some reason). Finally, it is frequently tethered to a singular nfs server, 
though with some work you can have an appropriately HA nfs configuration to 
cope with this.

Both have some common issues too, a difficulty to debug over the network, 
inadequate security handling (ssh keys being reshared without adequate vetting, 
nfs statelite having no integrity assurance and no authenticity).

So for confluent, we wanted to try to simplify and enhance as we went for 
diskless. For initial release, it will be a single design that blends the most 
desirous facets of both above, as well as enhanced security and performance.  
To that end:
-The server TPM2 is now used as the mechanism to persist trust of booting 
nodes.  SSH host keys are regenerated by the host every boot, and the TPM2 is 
used to authenticate a request for a new certificate, which will automatically 
grant the system the same trust an install from disk gets
-The diskless bootstrap image can either be PXE booted, HTTP booted, or, for 
Lenovo systems, installed onto the XCC using HTTPS for the maximum security 
benefit
-A new filesystem has been written to enable multipath-https on-demand access 
for stateless mount, with full certificate validation.  Rather than downloading 
the entire image at boot, the image is mounted over https, and read requests 
are relayed to the server.  The normal kernel caching mechanism will cause the 
relevant bits of the filesystem to stay memory resident, but is also evictable 
if need be since it can download at any time.
-Rather than requiring a specific set of files to be designated read-write, a 
tmpfs is overlay mounted over the diskless image. Resembling statelite, but 
using overlay for simpler 'everything is read-write'.  Distinct from statelite 
as at least for now, no support for per-node state persistence.
-Extra work has been invested to be consistent with things like NetworkManager 
and in other ways generally looking like a normal system installed from disk.
-More care has been taken to segregate 'distribution' content from 'confluent 
add-on'. In xCAT, genimage put a lot of xCAT specific material into the image 
and complicated the build and maintenance process. In confluent, one small 
dracut module is added to make an appropriately sized initramfs, but all the 
scripting and executables use the same 'addons' process to keep confluent 
mostly out of the built image and add the confluent requirements at node boot 
time rather than image build time.

This is what the prototype looks like booting up:
Automatic console configured for ttyS0,115200
Initializng confluent diskless environment
udevd: starting version 239 (239-45.el8)
Loading drivers...done
Scanning for network configuration...Setting up eno1 as static at 172.30.1.5/16
Initializing ssh...ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519
Registering mount path: 
https://[fe80::38f7:56ff:fe06:8945%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs
Registering mount path: 
https://[fe80::10cc:61ff:fe0f:8f58%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs
Registering mount path: 
https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs
Connecting to 
https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs
Successfully connected to 
https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs

Here's what happens if the currently in-use connection breaks in the middle of 
IO (failover):
urlmount: error while communicating with 
https://172.30.254.1/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs:
 Failed to connect to 172.30.254.1 port 443: Connection refused
urlmount: Connecting to 
https://[fe80::38f7:56ff:fe06:8945%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs
urlmount: error while communicating with 
https://[fe80::38f7:56ff:fe06:8945%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs:
 Failed to connect to fe80::38f7:56ff:fe06:8945 port 443: Connection refused
urlmount: Connecting to 
https://[fe80::10cc:61ff:fe0f:8f58%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs
urlmount: Successfully connected to 
https://[fe80::10cc:61ff:fe0f:8f58%253]/confluent-public/os/centos_stream-8.5-x86_64-diskless/rootimg.sfs

The memory consumption resembles a system installed from disk:
[root@d5 bin]# free -m
              total        used        free      shared  buff/cache   available
Mem:          23519        3884       19414         126         220       19250
Swap:             0           0           0

NetworkManager acts normal:
[root@d5 bin]# nmcli c
NAME                UUID                                  TYPE      DEVICE
Wired connection 1  6c3c9e3d-5e56-3242-b35e-b5331f09767f  ethernet  eno2
eno1                348b636e-154f-421c-99ea-2899f363a960  ethernet  eno1
Wired connection 2  79e23780-0d7b-326d-bc4c-5acf140bdea2  ethernet  
enp0s20f0u1u6

Boot is faster, as it only downloads sectors of the root filesystem as needed.

Further, after the system is booted, it persists ssh access to the initramfs 
environment, for truly untethered debug:
[root@mgt1 diskless]# ssh d5 mount|grep ' / '
disklessroot on / type overlay 
(rw,relatime,lowerdir=/mnt/remote,upperdir=/mnt/overlay/upper,workdir=/mnt/overlay/work)
[root@mgt1 diskless]# ssh -p 2222 d5 mount|grep ' / '
rootfs on / type rootfs (rw,size=12024424k,nr_inodes=3006106)

Some things are ultimately gone in this design for now:
-Truly untethered operational filesystem: The intent is that the new multipath 
https filesystem mitigates this concern. Worst case scenario is IOs are halted 
until the webserver reboots
-No per-node persistence. My impression is this hasn't been highly desired 
anyway, and we are using the TPM2 and APIs to recreate likely relevant 
information.
-There is now explicitly no way to try to centrally update an 'nfs' image and 
have changes appear on nodes. For various reasons, a limitation of overlay 
mount is that the lower directory is expected to be unchanging. So an update 
would be done as a new profile to reboot into next time, and optionally running 
the update on the nodes themselves if wanting to update without a reboot. 
Attempt to replace the squashfs will just cause I/O errors and corrupt the 
state of attached nodes.
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to