Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 27.11.20 21:20, Jann Horn wrote: On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin wrote: On 27/11/2020 20:22, Jann Horn wrote: On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin wrote: This patch is a driver that exposes a monotonic incremental Virtual Machine Generation u32 counter via a char-dev FS interface that provides sync and async VmGen counter updates notifications. It also provides VmGen counter retrieval and confirmation mechanisms. The hw provided UUID is not exposed to userspace, it is internally used by the driver to keep accounting for the exposed VmGen counter. The counter starts from zero when the driver is initialized and monotonically increments every time the hw UUID changes (the VM generation changes). On each hw UUID change, the new hypervisor-provided UUID is also fed to the kernel RNG. As for v1: Is there a reasonable usecase for the "confirmation" mechanism? It doesn't seem very useful to me. I think it adds value in complex scenarios with multiple users of the mechanism, potentially at varying layers of the stack, different processes and/or runtime libraries. The driver offers a natural place to handle minimal orchestration support and offer visibility in system-wide status. A high-level service that trusts all system components to properly use the confirmation mechanism can actually block and wait patiently for the system to adjust to the new world. Even if it doesn't trust all components it can still do a best-effort, timeout block. What concrete action would that high-level service be able to take after waiting for such an event? For us, it would only allow incoming requests to the target container after the container has successfully adjusted. You can think of other models too. Your container orchestration engine could prevent network traffic to reach the container applications until the full container is adjusted for example. My model of the vmgenid mechanism is that RNGs and cryptographic libraries that use it need to be fundamentally written such that it is guaranteed that a VM fork can not cause the same random number / counter / ... to be reused for two different cryptographic operations in a way visible to an attacker. This means that e.g. TLS libraries need to, between accepting unencrypted input and sending out encrypted data, check whether the vmgenid changed since the connection was set up, and if a vmgenid change occurred, kill the connection. Can you give a concrete example of a usecase where the vmgenid mechanism is used securely and the confirmation mechanism is necessary as part of that? The main crux here is that we have 2 fundamental approaches: 1) Transactional For an RNG, the natural place to adjust yourself to a resumed snapshot is in the random number retrieval. You just check if your generation is still identical when you fetch the next random number. Ideally, you also do the same for anything consuming such a random number. So random number retrieval would no longer just return ( number ), but instead ( number, generation ). That way you could check at every consumer side of the random number that it's actually still random. That would need to cascade down. So every key you derive from a random number, every uuid you generate, they all would need to store the generation as well and compare if the current generation is still the same as when they were generated. That means you need to convert every data access method to a function call that checks if the value is still consumable and if not, able to regenerate it. The same applies for global values, such as a system global UUID that is shared among multiple processes. If you slowly move away from super integrated environments like a TLS library and start thinking of samba system UUIDs or SSH host keys, you'll quickly see how that approach reaches its limits. 2) Event based Let's take a look at the complicated things to implement with the transactional approach: samba system UUIDs, SSH host keys, global variables in a random Java application that get initialized on application start. All of these are very easy to resolve through an event based mechanism. Based on the "new generation" event, you can just generate a new UUID. Or a new host key. All you would need to know for this to be non-racy is that before you actually use the target services, you know they are self-adjusted. In most container workloads, that can be achieved by not letting network traffic go in before the event is fully processed. What this patch set does is provide both: We allow the transactional approach through mmap() of a shared page to be implemented for stacks where that's easiest. You can use that when your logic is realistically convertable to transactional. We also allow for an asynchronous event, which can be used in environments where the transactional approach is hard because of
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 27.11.20 18:17, Catangiu, Adrian Costin wrote: On 18/11/2020 12:30, Alexander Graf wrote: On 16.11.20 16:34, Catangiu, Adrian Costin wrote: - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - but this is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. That one is pretty important IMHO. How about the following (probably pretty mangled) patch? That seems to work for me. The ACPI change would obviously need to be its own stand alone change and needs proper assessment whether it could possibly break any existing systems. diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index 1682f8b454a2..452443d79d87 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device *device, /* First, check the ACPI/PNP IDs provided by the caller. */ if (acpi_ids) { for (id = acpi_ids; id->id[0] || id->cls; id++) { - if (id->id[0] && !strcmp((char *)id->id, hwid->id)) + if (id->id[0] && !strncmp((char *)id->id, hwid->id, ACPI_ID_LEN - 1)) goto out_acpi_match; if (id->cls && __acpi_match_device_cls(id, hwid)) goto out_acpi_match; diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c index 75a787da8aad..0bfa422cf094 100644 --- a/drivers/virt/vmgenid.c +++ b/drivers/virt/vmgenid.c @@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device *device, u32 event) } static const struct acpi_device_id vmgenid_ids[] = { - {"QEMUVGID", 0}, + /* This really is VM_Gen_Counter, but we can only match 8 characters */ + {"VM_GEN_C", 0}, {"", 0}, }; Looks legit. I can propose a patch with it, but how do we validate it doesn't break any devices? Mainly by proposing it and seeing what the ACPI maintainers say. Maybe they have a better idea even. At least this explictly nudges them. +2) ASYNC simplified example:: + + void handle_io_on_vmgenfd(int vmgenfd) + { + unsigned genid; + + // because of VM generation change, we need to rebuild world + reseed_app_env(); + + // read new gen ID - we need it to confirm we've handled update + read(fd, , sizeof(genid)); This is racy in case two consecutive snapshots happen. The read needs to go before the reseed. Switched them around like you suggest to avoid confusion. But I don't see a problem with this race. The idea here is to trigger reseed_app_env() which doesn't depend on the generation counter value. Whether it gets incremented once or N times is irrelevant, we're just interested that we pause execution and reseed before resuming (in between these, whether N or M generation changes is the same thing). +3) Mapped memory polling simplified example:: + + /* + * app/library function that provides cached secrets + */ + char * safe_cached_secret(app_data_t *app) + { + char *secret; + volatile unsigned *const genid_ptr = get_vmgenid_mapping(app); + again: + secret = __cached_secret(app); + *genid_ptr = 1 cached_genid = 1 + if (unlikely(*genid_ptr != app->cached_genid)) { *genid_ptr = 2 cached_genid = 1 + // rebuild world then confirm the genid update (thru write) + rebuild_caches(app); hypervisor takes another snapshot during rebuild_caches(). Resume path bumps genid + app->cached_genid = *genid_ptr; *genid_ptr = 3 cached_genid = 3 This is racy again. You need to read the genid before rebuild and set it here. I don't see the race. Gen counter is read from volatile mapped mem, on detected change we rebuild world, confirm the update back to the driver then restart the loop. Loop will break when no more changes happen. See above. After the outlined course of things, the snapshot will contain data that will be identical between 2 snapshots. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin wrote: > On 27/11/2020 20:22, Jann Horn wrote: > > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: > >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin > >> wrote: > >>> This patch is a driver that exposes a monotonic incremental Virtual > >>> Machine Generation u32 counter via a char-dev FS interface that > >>> provides sync and async VmGen counter updates notifications. It also > >>> provides VmGen counter retrieval and confirmation mechanisms. > >>> > >>> The hw provided UUID is not exposed to userspace, it is internally > >>> used by the driver to keep accounting for the exposed VmGen counter. > >>> The counter starts from zero when the driver is initialized and > >>> monotonically increments every time the hw UUID changes (the VM > >>> generation changes). > >>> > >>> On each hw UUID change, the new hypervisor-provided UUID is also fed > >>> to the kernel RNG. > >> As for v1: > >> > >> Is there a reasonable usecase for the "confirmation" mechanism? It > >> doesn't seem very useful to me. > > I think it adds value in complex scenarios with multiple users of the > mechanism, potentially at varying layers of the stack, different > processes and/or runtime libraries. > > The driver offers a natural place to handle minimal orchestration > support and offer visibility in system-wide status. > > A high-level service that trusts all system components to properly use > the confirmation mechanism can actually block and wait patiently for the > system to adjust to the new world. Even if it doesn't trust all > components it can still do a best-effort, timeout block. What concrete action would that high-level service be able to take after waiting for such an event? My model of the vmgenid mechanism is that RNGs and cryptographic libraries that use it need to be fundamentally written such that it is guaranteed that a VM fork can not cause the same random number / counter / ... to be reused for two different cryptographic operations in a way visible to an attacker. This means that e.g. TLS libraries need to, between accepting unencrypted input and sending out encrypted data, check whether the vmgenid changed since the connection was set up, and if a vmgenid change occurred, kill the connection. Can you give a concrete example of a usecase where the vmgenid mechanism is used securely and the confirmation mechanism is necessary as part of that? > >> How do you envision integrating this with libraries that have to work > >> in restrictive seccomp sandboxes? If this was in the vDSO, that would > >> be much easier. > > Since this mechanism targets all of userspace stack, the usecase greatly > vary. I doubt we can have a single silver bullet interface. > > For example, the mmap interface targets user space RNGs, where as fast > and as race free as possible is key. But there also higher level > applications that don't manage their own memory or don't have access to > low-level primitives so they can't use the mmap or even vDSO interfaces. > That's what the rest of the logic is there for, the read+poll interface > and all of the orchestration logic. Are you saying that, because people might not want to write proper bindings for this interface while also being unwilling to take the performance hit of calling read() in every place where they would have to do so to be fully correct, you want to build a "best-effort" mechanism that is deliberately designed to allow some cryptographic state reuse in a limited time window? > Like you correctly point out, there are also scenarios like tight > seccomp jails where even the FS interfaces is inaccessible. For cases > like this and others, I believe we will have to work incrementally to > build up the interface diversity to cater to all the user scenarios > diversity. It would be much nicer if we could have one simple interface that lets everyone correctly do what they need to, though...
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
Sorry Jann for missing your original email. On 27/11/2020 20:22, Jann Horn wrote: > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > > [resend in the hope that amazon will accept my mail this time instead > of replying "550 Too many invalid recipients" again] > > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin >> wrote: >>> This patch is a driver that exposes a monotonic incremental Virtual >>> Machine Generation u32 counter via a char-dev FS interface that >>> provides sync and async VmGen counter updates notifications. It also >>> provides VmGen counter retrieval and confirmation mechanisms. >>> >>> The hw provided UUID is not exposed to userspace, it is internally >>> used by the driver to keep accounting for the exposed VmGen counter. >>> The counter starts from zero when the driver is initialized and >>> monotonically increments every time the hw UUID changes (the VM >>> generation changes). >>> >>> On each hw UUID change, the new hypervisor-provided UUID is also fed >>> to the kernel RNG. >> As for v1: >> >> Is there a reasonable usecase for the "confirmation" mechanism? It >> doesn't seem very useful to me. I think it adds value in complex scenarios with multiple users of the mechanism, potentially at varying layers of the stack, different processes and/or runtime libraries. The driver offers a natural place to handle minimal orchestration support and offer visibility in system-wide status. A high-level service that trusts all system components to properly use the confirmation mechanism can actually block and wait patiently for the system to adjust to the new world. Even if it doesn't trust all components it can still do a best-effort, timeout block. >> >> How do you envision integrating this with libraries that have to work >> in restrictive seccomp sandboxes? If this was in the vDSO, that would >> be much easier. Since this mechanism targets all of userspace stack, the usecase greatly vary. I doubt we can have a single silver bullet interface. For example, the mmap interface targets user space RNGs, where as fast and as race free as possible is key. But there also higher level applications that don't manage their own memory or don't have access to low-level primitives so they can't use the mmap or even vDSO interfaces. That's what the rest of the logic is there for, the read+poll interface and all of the orchestration logic. Like you correctly point out, there are also scenarios like tight seccomp jails where even the FS interfaces is inaccessible. For cases like this and others, I believe we will have to work incrementally to build up the interface diversity to cater to all the user scenarios diversity. Thanks, Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
[resend in the hope that amazon will accept my mail this time instead of replying "550 Too many invalid recipients" again] On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: > On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin > wrote: > > This patch is a driver that exposes a monotonic incremental Virtual > > Machine Generation u32 counter via a char-dev FS interface that > > provides sync and async VmGen counter updates notifications. It also > > provides VmGen counter retrieval and confirmation mechanisms. > > > > The hw provided UUID is not exposed to userspace, it is internally > > used by the driver to keep accounting for the exposed VmGen counter. > > The counter starts from zero when the driver is initialized and > > monotonically increments every time the hw UUID changes (the VM > > generation changes). > > > > On each hw UUID change, the new hypervisor-provided UUID is also fed > > to the kernel RNG. > > As for v1: > > Is there a reasonable usecase for the "confirmation" mechanism? It > doesn't seem very useful to me. > > How do you envision integrating this with libraries that have to work > in restrictive seccomp sandboxes? If this was in the vDSO, that would > be much easier.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 18/11/2020 12:30, Alexander Graf wrote: > > > On 16.11.20 16:34, Catangiu, Adrian Costin wrote: >> - Future improvements >> >> Ideally we would want the driver to register itself based on devices' >> _CID and not _HID, but unfortunately I couldn't find a way to do that. >> The problem is that ACPI device matching is done by >> '__acpi_match_device()' which exclusively looks at >> 'acpi_hardware_id *hwid'. >> >> There is a path for platform devices to match on _CID when _HID is >> 'PRP0001' - but this is not the case for the Qemu vmgenid device. >> >> Guidance and help here would be greatly appreciated. > > That one is pretty important IMHO. How about the following (probably > pretty mangled) patch? That seems to work for me. The ACPI change > would obviously need to be its own stand alone change and needs proper > assessment whether it could possibly break any existing systems. > > diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c > index 1682f8b454a2..452443d79d87 100644 > --- a/drivers/acpi/bus.c > +++ b/drivers/acpi/bus.c > @@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device > *device, > /* First, check the ACPI/PNP IDs provided by the caller. */ > if (acpi_ids) { > for (id = acpi_ids; id->id[0] || id->cls; id++) { > - if (id->id[0] && !strcmp((char *)id->id, hwid->id)) > + if (id->id[0] && !strncmp((char *)id->id, hwid->id, > ACPI_ID_LEN - 1)) > goto out_acpi_match; > if (id->cls && __acpi_match_device_cls(id, hwid)) > goto out_acpi_match; > diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c > index 75a787da8aad..0bfa422cf094 100644 > --- a/drivers/virt/vmgenid.c > +++ b/drivers/virt/vmgenid.c > @@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device > *device, u32 event) > } > > static const struct acpi_device_id vmgenid_ids[] = { > - {"QEMUVGID", 0}, > + /* This really is VM_Gen_Counter, but we can only match 8 > characters */ > + {"VM_GEN_C", 0}, > {"", 0}, > }; > Looks legit. I can propose a patch with it, but how do we validate it doesn't break any devices? >> +2) ASYNC simplified example:: >> + >> + void handle_io_on_vmgenfd(int vmgenfd) >> + { >> + unsigned genid; >> + >> + // because of VM generation change, we need to rebuild world >> + reseed_app_env(); >> + >> + // read new gen ID - we need it to confirm we've handled update >> + read(fd, , sizeof(genid)); > > This is racy in case two consecutive snapshots happen. The read needs > to go before the reseed. > Switched them around like you suggest to avoid confusion. But I don't see a problem with this race. The idea here is to trigger reseed_app_env() which doesn't depend on the generation counter value. Whether it gets incremented once or N times is irrelevant, we're just interested that we pause execution and reseed before resuming (in between these, whether N or M generation changes is the same thing). >> +3) Mapped memory polling simplified example:: >> + >> + /* >> + * app/library function that provides cached secrets >> + */ >> + char * safe_cached_secret(app_data_t *app) >> + { >> + char *secret; >> + volatile unsigned *const genid_ptr = get_vmgenid_mapping(app); >> + again: >> + secret = __cached_secret(app); >> + >> + if (unlikely(*genid_ptr != app->cached_genid)) { >> + // rebuild world then confirm the genid update (thru write) >> + rebuild_caches(app); >> + >> + app->cached_genid = *genid_ptr; > > This is racy again. You need to read the genid before rebuild and set > it here. > I don't see the race. Gen counter is read from volatile mapped mem, on detected change we rebuild world, confirm the update back to the driver then restart the loop. Loop will break when no more changes happen. >> + ack_vmgenid_update(app); >> + >> + goto again; >> + } >> + >> + return secret; >> + } >> + >> + >> +static int vmgenid_close(struct inode *inode, struct file *file) >> +{ >> + struct file_data *file_data = file->private_data; >> + struct dev_data *priv = file_data->dev_data; >> + >> + if (file_data->acked_gen_counter != priv->generation_counter) >> + vmgenid_put_outdated_watchers(priv); > > Is this racy? Could there be a snapshot notification coming between > the branch and the put? > This is indeed racy, will fix it in patch v3. >> + atomic_dec(>watchers); >> + kfree(file_data); >> + >> + return 0; >> +} >> +static ssize_t vmgenid_write(struct file *file, const char __user >> *ubuf, >> + size_t count, loff_t *ppos) >> +{ >> + struct file_data *file_data = file->private_data; >> + struct dev_data *priv = file_data->dev_data; >> + unsigned int acked_gen_count; >> + >> + /* disallow partial writes */ >> + if (count != sizeof(acked_gen_count))
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin wrote: > This patch is a driver that exposes a monotonic incremental Virtual > Machine Generation u32 counter via a char-dev FS interface that > provides sync and async VmGen counter updates notifications. It also > provides VmGen counter retrieval and confirmation mechanisms. > > The hw provided UUID is not exposed to userspace, it is internally > used by the driver to keep accounting for the exposed VmGen counter. > The counter starts from zero when the driver is initialized and > monotonically increments every time the hw UUID changes (the VM > generation changes). > > On each hw UUID change, the new hypervisor-provided UUID is also fed > to the kernel RNG. As for v1: Is there a reasonable usecase for the "confirmation" mechanism? It doesn't seem very useful to me. How do you envision integrating this with libraries that have to work in restrictive seccomp sandboxes? If this was in the vDSO, that would be much easier.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
Hello, +Cc Eric, Adrian On 11/19/20 6:36 PM, Alexander Graf wrote: > On 19.11.20 18:38, Mike Rapoport wrote: >> On Thu, Nov 19, 2020 at 01:51:18PM +0100, Alexander Graf wrote: >>> On 19.11.20 13:02, Christian Borntraeger wrote: On 16.11.20 16:34, Catangiu, Adrian Costin wrote: > - Background > > The VM Generation ID is a feature defined by Microsoft (paper: > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > multiple hypervisor vendors. > > The feature is required in virtualized environments by apps that work > with local copies/caches of world-unique data such as random values, > uuids, monotonically increasing counters, etc. > Such apps can be negatively affected by VM snapshotting when the VM > is either cloned or returned to an earlier point in time. > > The VM Generation ID is a simple concept meant to alleviate the issue > by providing a unique ID that changes each time the VM is restored > from a snapshot. The hw provided UUID value can be used to > differentiate between VMs or different generations of the same VM. > > - Problem > > The VM Generation ID is exposed through an ACPI device by multiple > hypervisor vendors but neither the vendors or upstream Linux have no > default driver for it leaving users to fend for themselves. [..] >>> The only piece where I'm unsure is how this will interact with CRIU. >> >> To C/R applications that use /dev/vmgenid CRIU need to be aware of it. >> Checkpointing and restoring withing the same "VM generation" shouldn't be >> a problem, but IMHO, making restore work after genid bump could be >> challenging. >> >> Alex, what scenario involving CRIU did you have in mind? > > You can in theory run into the same situation with containers that this > patch is solving for virtual machines. You could for example do a > snapshot of a prewarmed Java runtime with CRIU to get full JIT speeds > starting from the first request. > > That however means you run into the problem of predictable randomness > again. > >> >>> Can containers emulate ioctls and device nodes? >> >> Containers do not emulate ioctls but they can have /dev/vmgenid inside >> the container, so applications can use it the same way as outside the >> container. > > Hm. I suppose we could add a CAP_ADMIN ioctl interface to /dev/vmgenid > (when container people get to the point of needing it) that sets the > generation to "at least X". That way on restore, you could just call > that with "generation at snapshot"+1. > > That also means we need to have this interface available without virtual > machines then though, right? Sounds like a good idea. I guess, genvmid can be global on host, rather than per-userns or per-process for simplicity. Later if somebody will have a bottleneck on restore when every process on the machine wakes up from read() it could be virtualized, but doing it now sounds too early. ioctl() probably should go under checkpoint_restore_ns_capable(current_user_ns()), rather than CAP_SYS_ADMIN (I believe it should be safe from DOS as only CRIU should run with this capability, but worth to document this). Thanks, Dmitry
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 19.11.20 18:38, Mike Rapoport wrote: On Thu, Nov 19, 2020 at 01:51:18PM +0100, Alexander Graf wrote: On 19.11.20 13:02, Christian Borntraeger wrote: On 16.11.20 16:34, Catangiu, Adrian Costin wrote: - Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. I see that the qemu implementation is still under discussion. What is Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :). the status of the other existing implementations. Do they already exist? In other words is ACPI a given? I think the majority of this driver could be used with just a different backend for platforms without ACPI so in any case we could factor out the backend (acpi, virtio, whatever) but if we are open we could maybe start with something else. I agree 100%. I don't think we really need a new framework in the kernel for that. We can just have for example an s390x specific driver that also provides the same notification mechanism through a device node that is also named "/dev/vmgenid", no? Or alternatively we can split the generic part of this driver as soon as a second one comes along and then have both driver include that generic logic. The only piece where I'm unsure is how this will interact with CRIU. To C/R applications that use /dev/vmgenid CRIU need to be aware of it. Checkpointing and restoring withing the same "VM generation" shouldn't be a problem, but IMHO, making restore work after genid bump could be challenging. Alex, what scenario involving CRIU did you have in mind? You can in theory run into the same situation with containers that this patch is solving for virtual machines. You could for example do a snapshot of a prewarmed Java runtime with CRIU to get full JIT speeds starting from the first request. That however means you run into the problem of predictable randomness again. Can containers emulate ioctls and device nodes? Containers do not emulate ioctls but they can have /dev/vmgenid inside the container, so applications can use it the same way as outside the container. Hm. I suppose we could add a CAP_ADMIN ioctl interface to /dev/vmgenid (when container people get to the point of needing it) that sets the generation to "at least X". That way on restore, you could just call that with "generation at snapshot"+1. That also means we need to have this interface available without virtual machines then though, right? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On Thu, Nov 19, 2020 at 01:51:18PM +0100, Alexander Graf wrote: > > > On 19.11.20 13:02, Christian Borntraeger wrote: > > > > On 16.11.20 16:34, Catangiu, Adrian Costin wrote: > > > - Background > > > > > > The VM Generation ID is a feature defined by Microsoft (paper: > > > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > > > multiple hypervisor vendors. > > > > > > The feature is required in virtualized environments by apps that work > > > with local copies/caches of world-unique data such as random values, > > > uuids, monotonically increasing counters, etc. > > > Such apps can be negatively affected by VM snapshotting when the VM > > > is either cloned or returned to an earlier point in time. > > > > > > The VM Generation ID is a simple concept meant to alleviate the issue > > > by providing a unique ID that changes each time the VM is restored > > > from a snapshot. The hw provided UUID value can be used to > > > differentiate between VMs or different generations of the same VM. > > > > > > - Problem > > > > > > The VM Generation ID is exposed through an ACPI device by multiple > > > hypervisor vendors but neither the vendors or upstream Linux have no > > > default driver for it leaving users to fend for themselves. > > > > I see that the qemu implementation is still under discussion. What is > > Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :). > > > the status of the other existing implementations. Do they already exist? > > In other words is ACPI a given? > > I think the majority of this driver could be used with just a different > > backend for platforms without ACPI so in any case we could factor out > > the backend (acpi, virtio, whatever) but if we are open we could maybe > > start with something else. > > I agree 100%. I don't think we really need a new framework in the kernel for > that. We can just have for example an s390x specific driver that also > provides the same notification mechanism through a device node that is also > named "/dev/vmgenid", no? > > Or alternatively we can split the generic part of this driver as soon as a > second one comes along and then have both driver include that generic logic. > > The only piece where I'm unsure is how this will interact with CRIU. To C/R applications that use /dev/vmgenid CRIU need to be aware of it. Checkpointing and restoring withing the same "VM generation" shouldn't be a problem, but IMHO, making restore work after genid bump could be challenging. Alex, what scenario involving CRIU did you have in mind? > Can containers emulate ioctls and device nodes? Containers do not emulate ioctls but they can have /dev/vmgenid inside the container, so applications can use it the same way as outside the container. > Alex -- Sincerely yours, Mike.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 19.11.20 13:51, Alexander Graf wrote: > > > On 19.11.20 13:02, Christian Borntraeger wrote: >> >> On 16.11.20 16:34, Catangiu, Adrian Costin wrote: >>> - Background >>> >>> The VM Generation ID is a feature defined by Microsoft (paper: >>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by >>> multiple hypervisor vendors. >>> >>> The feature is required in virtualized environments by apps that work >>> with local copies/caches of world-unique data such as random values, >>> uuids, monotonically increasing counters, etc. >>> Such apps can be negatively affected by VM snapshotting when the VM >>> is either cloned or returned to an earlier point in time. >>> >>> The VM Generation ID is a simple concept meant to alleviate the issue >>> by providing a unique ID that changes each time the VM is restored >>> from a snapshot. The hw provided UUID value can be used to >>> differentiate between VMs or different generations of the same VM. >>> >>> - Problem >>> >>> The VM Generation ID is exposed through an ACPI device by multiple >>> hypervisor vendors but neither the vendors or upstream Linux have no >>> default driver for it leaving users to fend for themselves. >> >> I see that the qemu implementation is still under discussion. What is > > Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :). Ah right. Found it. > >> the status of the other existing implementations. Do they already exist? >> In other words is ACPI a given? >> I think the majority of this driver could be used with just a different >> backend for platforms without ACPI so in any case we could factor out >> the backend (acpi, virtio, whatever) but if we are open we could maybe >> start with something else. > > I agree 100%. I don't think we really need a new framework in the kernel for > that. We can just have for example an s390x specific driver that also > provides the same notification mechanism through a device node that is also > named "/dev/vmgenid", no? > > Or alternatively we can split the generic part of this driver as soon as a > second one comes along and then have both driver include that generic logic. Yes. I think it is probably the best variant to check if we split this into a front end /back end or provide a new driver when we have something. > > The only piece where I'm unsure is how this will interact with CRIU. Can > containers emulate ioctls and device nodes? > > > Alex > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 19.11.20 13:02, Christian Borntraeger wrote: On 16.11.20 16:34, Catangiu, Adrian Costin wrote: - Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. I see that the qemu implementation is still under discussion. What is Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :). the status of the other existing implementations. Do they already exist? In other words is ACPI a given? I think the majority of this driver could be used with just a different backend for platforms without ACPI so in any case we could factor out the backend (acpi, virtio, whatever) but if we are open we could maybe start with something else. I agree 100%. I don't think we really need a new framework in the kernel for that. We can just have for example an s390x specific driver that also provides the same notification mechanism through a device node that is also named "/dev/vmgenid", no? Or alternatively we can split the generic part of this driver as soon as a second one comes along and then have both driver include that generic logic. The only piece where I'm unsure is how this will interact with CRIU. Can containers emulate ioctls and device nodes? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 16.11.20 16:34, Catangiu, Adrian Costin wrote: > - Background > > The VM Generation ID is a feature defined by Microsoft (paper: > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > multiple hypervisor vendors. > > The feature is required in virtualized environments by apps that work > with local copies/caches of world-unique data such as random values, > uuids, monotonically increasing counters, etc. > Such apps can be negatively affected by VM snapshotting when the VM > is either cloned or returned to an earlier point in time. > > The VM Generation ID is a simple concept meant to alleviate the issue > by providing a unique ID that changes each time the VM is restored > from a snapshot. The hw provided UUID value can be used to > differentiate between VMs or different generations of the same VM. > > - Problem > > The VM Generation ID is exposed through an ACPI device by multiple > hypervisor vendors but neither the vendors or upstream Linux have no > default driver for it leaving users to fend for themselves. I see that the qemu implementation is still under discussion. What is the status of the other existing implementations. Do they already exist? In other words is ACPI a given? I think the majority of this driver could be used with just a different backend for platforms without ACPI so in any case we could factor out the backend (acpi, virtio, whatever) but if we are open we could maybe start with something else.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On 16.11.20 16:34, Catangiu, Adrian Costin wrote: - Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver that exposes a monotonic incremental Virtual Machine Generation u32 counter via a char-dev FS interface that provides sync and async VmGen counter updates notifications. It also provides VmGen counter retrieval and confirmation mechanisms. The hw provided UUID is not exposed to userspace, it is internally used by the driver to keep accounting for the exposed VmGen counter. The counter starts from zero when the driver is initialized and monotonically increments every time the hw UUID changes (the VM generation changes). On each hw UUID change, the new hypervisor-provided UUID is also fed to the kernel RNG. This patch builds on top of Or Idgar 's proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - but this is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. That one is pretty important IMHO. How about the following (probably pretty mangled) patch? That seems to work for me. The ACPI change would obviously need to be its own stand alone change and needs proper assessment whether it could possibly break any existing systems. diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index 1682f8b454a2..452443d79d87 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device *device, /* First, check the ACPI/PNP IDs provided by the caller. */ if (acpi_ids) { for (id = acpi_ids; id->id[0] || id->cls; id++) { - if (id->id[0] && !strcmp((char *)id->id, hwid->id)) + if (id->id[0] && !strncmp((char *)id->id, hwid->id, ACPI_ID_LEN - 1)) goto out_acpi_match; if (id->cls && __acpi_match_device_cls(id, hwid)) goto out_acpi_match; diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c index 75a787da8aad..0bfa422cf094 100644 --- a/drivers/virt/vmgenid.c +++ b/drivers/virt/vmgenid.c @@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device *device, u32 event) } static const struct acpi_device_id vmgenid_ids[] = { - {"QEMUVGID", 0}, + /* This really is VM_Gen_Counter, but we can only match 8 characters */ + {"VM_GEN_C", 0}, {"", 0}, }; - v1 -> v2: Please put the change log below your Signed-off-by line and separate it with a "---" line. That way, git am will ignore the change log on apply. - expose to userspace a monotonically increasing u32 Vm Gen Counter instead of the hw VmGen UUID - since the hw/hypervisor-provided 128-bit UUID is not public anymore, add it to the kernel RNG as device randomness - insert driver page containing Vm Gen Counter in the user vma in the driver's mmap handler instead of using a fault handler - turn driver into a misc device driver to auto-create /dev/vmgenid - change ioctl arg to avoid leaking kernel structs to userspace - update documentation - various nits (license, unnecessary casting, Kconfig, others) - rebase on top of linus latest Signed-off-by: Adrian Catangiu --- Documentation/virt/vmgenid.rst | 228 drivers/virt/Kconfig | 17
[PATCH v2] drivers/virt: vmgenid: add vm generation id driver
- Background The VM Generation ID is a feature defined by Microsoft (paper: http://go.microsoft.com/fwlink/?LinkId=260709) and supported by multiple hypervisor vendors. The feature is required in virtualized environments by apps that work with local copies/caches of world-unique data such as random values, uuids, monotonically increasing counters, etc. Such apps can be negatively affected by VM snapshotting when the VM is either cloned or returned to an earlier point in time. The VM Generation ID is a simple concept meant to alleviate the issue by providing a unique ID that changes each time the VM is restored from a snapshot. The hw provided UUID value can be used to differentiate between VMs or different generations of the same VM. - Problem The VM Generation ID is exposed through an ACPI device by multiple hypervisor vendors but neither the vendors or upstream Linux have no default driver for it leaving users to fend for themselves. Furthermore, simply finding out about a VM generation change is only the starting point of a process to renew internal states of possibly multiple applications across the system. This process could benefit from a driver that provides an interface through which orchestration can be easily done. - Solution This patch is a driver that exposes a monotonic incremental Virtual Machine Generation u32 counter via a char-dev FS interface that provides sync and async VmGen counter updates notifications. It also provides VmGen counter retrieval and confirmation mechanisms. The hw provided UUID is not exposed to userspace, it is internally used by the driver to keep accounting for the exposed VmGen counter. The counter starts from zero when the driver is initialized and monotonically increments every time the hw UUID changes (the VM generation changes). On each hw UUID change, the new hypervisor-provided UUID is also fed to the kernel RNG. This patch builds on top of Or Idgar 's proposal https://lkml.org/lkml/2018/3/1/498 - Future improvements Ideally we would want the driver to register itself based on devices' _CID and not _HID, but unfortunately I couldn't find a way to do that. The problem is that ACPI device matching is done by '__acpi_match_device()' which exclusively looks at 'acpi_hardware_id *hwid'. There is a path for platform devices to match on _CID when _HID is 'PRP0001' - but this is not the case for the Qemu vmgenid device. Guidance and help here would be greatly appreciated. - v1 -> v2: - expose to userspace a monotonically increasing u32 Vm Gen Counter instead of the hw VmGen UUID - since the hw/hypervisor-provided 128-bit UUID is not public anymore, add it to the kernel RNG as device randomness - insert driver page containing Vm Gen Counter in the user vma in the driver's mmap handler instead of using a fault handler - turn driver into a misc device driver to auto-create /dev/vmgenid - change ioctl arg to avoid leaking kernel structs to userspace - update documentation - various nits (license, unnecessary casting, Kconfig, others) - rebase on top of linus latest Signed-off-by: Adrian Catangiu --- Documentation/virt/vmgenid.rst | 228 drivers/virt/Kconfig | 17 ++ drivers/virt/Makefile | 1 + drivers/virt/vmgenid.c | 390 + include/uapi/linux/vmgenid.h | 13 ++ 5 files changed, 649 insertions(+) create mode 100644 Documentation/virt/vmgenid.rst create mode 100644 drivers/virt/vmgenid.c create mode 100644 include/uapi/linux/vmgenid.h diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst new file mode 100644 index 000..603e8a5 --- /dev/null +++ b/Documentation/virt/vmgenid.rst @@ -0,0 +1,228 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +VMGENID + + +The VM Generation ID is a feature defined by Microsoft (paper: +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by +multiple hypervisor vendors. + +The feature is required in virtualized environments by apps that work +with local copies/caches of world-unique data such as random values, +uuids, monotonically increasing counters, etc. +Such apps can be negatively affected by VM snapshotting when the VM +is either cloned or returned to an earlier point in time. + +The VM Generation ID is a simple concept meant to alleviate the issue +by providing a unique ID that changes each time the VM is restored +from a snapshot. The hw provided UUID value can be used to +differentiate between VMs or different generations of the same VM. + +The VM Generation ID is exposed through an ACPI device by multiple +hypervisor vendors. The driver for it lives at +``drivers/virt/vmgenid.c`` + +The driver exposes a monotonic incremental Virtual Machine Generation +u32 counter via a char-dev FS interface that provides sync and async +VmGen counter updates notifications. It also provides VmGen counter +retrieval and confirmation mechanisms. +