Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin wrote: > On 27/11/2020 20:22, Jann Horn wrote: > > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: > >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin > >> wrote: > >>> This patch is a driver that exposes a monotonic incremental Virtual > >>> Machine Generation u32 counter via a char-dev FS interface that > >>> provides sync and async VmGen counter updates notifications. It also > >>> provides VmGen counter retrieval and confirmation mechanisms. > >>> > >>> The hw provided UUID is not exposed to userspace, it is internally > >>> used by the driver to keep accounting for the exposed VmGen counter. > >>> The counter starts from zero when the driver is initialized and > >>> monotonically increments every time the hw UUID changes (the VM > >>> generation changes). > >>> > >>> On each hw UUID change, the new hypervisor-provided UUID is also fed > >>> to the kernel RNG. > >> As for v1: > >> > >> Is there a reasonable usecase for the "confirmation" mechanism? It > >> doesn't seem very useful to me. > > I think it adds value in complex scenarios with multiple users of the > mechanism, potentially at varying layers of the stack, different > processes and/or runtime libraries. > > The driver offers a natural place to handle minimal orchestration > support and offer visibility in system-wide status. > > A high-level service that trusts all system components to properly use > the confirmation mechanism can actually block and wait patiently for the > system to adjust to the new world. Even if it doesn't trust all > components it can still do a best-effort, timeout block. What concrete action would that high-level service be able to take after waiting for such an event? My model of the vmgenid mechanism is that RNGs and cryptographic libraries that use it need to be fundamentally written such that it is guaranteed that a VM fork can not cause the same random number / counter / ... to be reused for two different cryptographic operations in a way visible to an attacker. This means that e.g. TLS libraries need to, between accepting unencrypted input and sending out encrypted data, check whether the vmgenid changed since the connection was set up, and if a vmgenid change occurred, kill the connection. Can you give a concrete example of a usecase where the vmgenid mechanism is used securely and the confirmation mechanism is necessary as part of that? > >> How do you envision integrating this with libraries that have to work > >> in restrictive seccomp sandboxes? If this was in the vDSO, that would > >> be much easier. > > Since this mechanism targets all of userspace stack, the usecase greatly > vary. I doubt we can have a single silver bullet interface. > > For example, the mmap interface targets user space RNGs, where as fast > and as race free as possible is key. But there also higher level > applications that don't manage their own memory or don't have access to > low-level primitives so they can't use the mmap or even vDSO interfaces. > That's what the rest of the logic is there for, the read+poll interface > and all of the orchestration logic. Are you saying that, because people might not want to write proper bindings for this interface while also being unwilling to take the performance hit of calling read() in every place where they would have to do so to be fully correct, you want to build a "best-effort" mechanism that is deliberately designed to allow some cryptographic state reuse in a limited time window? > Like you correctly point out, there are also scenarios like tight > seccomp jails where even the FS interfaces is inaccessible. For cases > like this and others, I believe we will have to work incrementally to > build up the interface diversity to cater to all the user scenarios > diversity. It would be much nicer if we could have one simple interface that lets everyone correctly do what they need to, though...
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
[resend in the hope that amazon will accept my mail this time instead of replying "550 Too many invalid recipients" again] On Fri, Nov 20, 2020 at 11:29 PM Jann Horn wrote: > On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin > wrote: > > This patch is a driver that exposes a monotonic incremental Virtual > > Machine Generation u32 counter via a char-dev FS interface that > > provides sync and async VmGen counter updates notifications. It also > > provides VmGen counter retrieval and confirmation mechanisms. > > > > The hw provided UUID is not exposed to userspace, it is internally > > used by the driver to keep accounting for the exposed VmGen counter. > > The counter starts from zero when the driver is initialized and > > monotonically increments every time the hw UUID changes (the VM > > generation changes). > > > > On each hw UUID change, the new hypervisor-provided UUID is also fed > > to the kernel RNG. > > As for v1: > > Is there a reasonable usecase for the "confirmation" mechanism? It > doesn't seem very useful to me. > > How do you envision integrating this with libraries that have to work > in restrictive seccomp sandboxes? If this was in the vDSO, that would > be much easier.
Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin wrote: > This patch is a driver that exposes a monotonic incremental Virtual > Machine Generation u32 counter via a char-dev FS interface that > provides sync and async VmGen counter updates notifications. It also > provides VmGen counter retrieval and confirmation mechanisms. > > The hw provided UUID is not exposed to userspace, it is internally > used by the driver to keep accounting for the exposed VmGen counter. > The counter starts from zero when the driver is initialized and > monotonically increments every time the hw UUID changes (the VM > generation changes). > > On each hw UUID change, the new hypervisor-provided UUID is also fed > to the kernel RNG. As for v1: Is there a reasonable usecase for the "confirmation" mechanism? It doesn't seem very useful to me. How do you envision integrating this with libraries that have to work in restrictive seccomp sandboxes? If this was in the vDSO, that would be much easier.
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sat, Oct 17, 2020 at 8:09 PM Alexander Graf wrote: > There are applications way beyond that though. What do you do with > applications that already consumed randomness? For example a cached pool > of SSL keys. Or a higher level language primitive that consumes > randomness and caches its seed somewhere in an internal data structure. For deterministic protection, those would also have to poll some memory location that tells them whether the VmGenID changed: 1. between reading entropy from their RNG pool and using it 2. between collecting data from external sources (user input, clock, ...) and encrypting it and synchronously shoot down the connection if a change happened. If e.g. an application inside the VM has an AES-GCM-encrypted TLS connection and, directly after the VM is restored, triggers an application-level timeout that sends some fixed message across the connection, then the TLS library must guarantee that either the VM was already committed to sending exactly that message before the VM was forked or the message will be blocked. If we don't do that, an attacker who captures both a single packet from the forked VM and traffic from the old VM can decrypt the next message from the old VM after the fork (because AES-GCM is like AES-CTR plus an authenticator, and CTR means that when keystream reuse occurs and one of the plaintexts is known, the attacker can simply recover the other plaintext using XOR). (Or maybe, in disaster failover environments, TLS 1.3 servers could get away with rekeying the connection instead of shooting it down? Ask your resident friendly cryptographer whether that would be secure, I am not one.) I don't think a mechanism based around asynchronously telling the application and waiting for it to confirm the rotation at a later point is going to cut it; we should have some hard semantics on when an application needs to poll this value. > Or even worse: your system's host ssh key. Mmmh... I think I normally would not want a VM to reset its host ssh key after merely restoring a snapshot though? And more importantly, Microsoft's docs say that they also change the VmGenID on disaster failover. I think you very much wouldn't want your server to lose its host key every time disaster failover happens. On the other hand, after importing a public VM image, it might be a good idea. I guess you could push that responsibility on the user, by adding an option to the sshd_config that tells OpenSSH whether the host key should be rotated on an ID change or not... but that still would not be particularly pretty. Ideally we would have the host tell us what type of events happened to the VM, or something like that... or maybe even get the host VM management software to ask the user whether they're importing a public image... I really feel like with Microsoft's current protocol, we don't get enough information to figure out what we should do about private long-term authentication keys.
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sat, Oct 17, 2020 at 8:44 AM Willy Tarreau wrote: > On Sat, Oct 17, 2020 at 07:52:48AM +0200, Jann Horn wrote: > > On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau wrote: > > > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote: > > > > Microsoft's documentation > > > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM > > > > Generation ID that we get after a fork "is a 128-bit, > > > > cryptographically random integer value". If multiple people use the > > > > same image, it guarantees that each use of the image gets its own, > > > > fresh ID: > > > > > > No. It cannot be more unique than the source that feeds that cryptographic > > > transformation. All it guarantees is that the entropy source is protected > > > from being guessed based on the output. Applying cryptography on a simple > > > counter provides apparently random numbers that will be unique for a long > > > period for the same source, but as soon as you duplicate that code between > > > users and they start from the same counter they'll get the same IDs. > > > > > > This is why I think that using a counter is better if you really need > > > something > > > unique. Randoms only reduce predictability which helps avoiding > > > collisions. > > > > Microsoft's spec tells us that they're giving us cryptographically > > random numbers. Where they're getting those from is not our problem. > > (And if even the hypervisor is not able to collect enough entropy to > > securely generate random numbers, worrying about RNG reseeding in the > > guest would be kinda pointless, we'd be fairly screwed anyway.) > > Sorry if I sound annoying, but it's a matter of terminology and needs. > > Cryptograhically random means safe for use with cryptography in that it > is unguessable enough so that you can use it for encryption keys that > nobody will be able to guess. It in no ways guarantees uniqueness, just > like you don't really care if the symmetric crypto key of you VPN has > already been used once somewhere else as long as there's no way to know. > However with the good enough distribution that a CSPRNG provides, > collisions within a *same* generator are bound to a very low, predictable > rate which is by generally considered as acceptable for all use cases. Yes. > Something random (cryptographically or not) *cannot* be unique by > definition, otherwise it's not random anymore, since each draw has an > influence on the remaining list of possible draws, which is contrary to > randomness. And conversely something unique cannot be completely random > because if you know it's unique, you can already rule out all other known > values from the candidates, thus it's more predictable than random. Yes. > With this in mind, picking randoms from a same RNG is often highly > sufficient to consider they're highly likely unique within a long > period. But it's not a guarantee. And it's even less one between two > RNGs (e.g. if uniqueness is required between multiple hypervisors in > case VMs are migrated or centrally managed, which I don't know). > > If what is sought here is a strong guarantee of uniqueness, using a > counter as you first suggested is better. My suggestion is to use a counter *in the UAPI*, not in the hypervisor protocol. (And as long as that counter can only miss increments in a cryptographically negligible fraction of cases, everything's fine.) > If what is sought is pure > randomness (in the sense that it's unpredictable, which I don't think > is needed here), then randoms are better. And this is what *the hypervisor protocol* gives us (which could be very useful for reseeding the kernel RNG). > If both are required, just > concatenate a counter and a random. And if you need them to be spatially > unique, just include a node identifier. > > Now the initial needs in the forwarded message are not entirely clear > to me but I wanted to rule out the apparent mismatch between the expressed > needs for uniqueness and the proposed solutions solely based on randomness. Sure, from a theoretical standpoint, it would be a little bit nicer if the hypervisor protocol included a generation number along with the 128-bit random value. But AFAIU it doesn't, so if we want this to just work under Microsoft's existing hypervisor, we'll have to make do with checking whether the random value changed. :P
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau wrote: > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote: > > Microsoft's documentation > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM > > Generation ID that we get after a fork "is a 128-bit, > > cryptographically random integer value". If multiple people use the > > same image, it guarantees that each use of the image gets its own, > > fresh ID: > > No. It cannot be more unique than the source that feeds that cryptographic > transformation. All it guarantees is that the entropy source is protected > from being guessed based on the output. Applying cryptography on a simple > counter provides apparently random numbers that will be unique for a long > period for the same source, but as soon as you duplicate that code between > users and they start from the same counter they'll get the same IDs. > > This is why I think that using a counter is better if you really need > something > unique. Randoms only reduce predictability which helps avoiding collisions. Microsoft's spec tells us that they're giving us cryptographically random numbers. Where they're getting those from is not our problem. (And if even the hypervisor is not able to collect enough entropy to securely generate random numbers, worrying about RNG reseeding in the guest would be kinda pointless, we'd be fairly screwed anyway.) Also note that we don't actually need to *always* reinitialize RNG state on forks for functional correctness; it is fine if that fails with a probability of 2^-128, because functionally everything will be fine, and an attacker who is that lucky could also just guess an AES key (which has the same probability of being successful). (And also 2^-128 is such a tiny number that it doesn't matter anyway.) > And I'm saying this as someone who had on his external gateway the same SSH > host key as 89 other hosts on the net, each of them using randoms to provide > a universally unique one... If your SSH host key was shared with 89 other hosts, it evidently wasn't generated from cryptographically random numbers. :P Either because the key generator was not properly hooked up to the system's entropy pool (if you're talking about the Debian fiasco), or because the system simply did not have enough entropy available. (Or because the key generator is broken, but I don't think that ever happened with OpenSSH?)
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sat, Oct 17, 2020 at 6:34 AM Colm MacCarthaigh wrote: > On 16 Oct 2020, at 21:02, Jann Horn wrote: > > On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau wrote: > > But in userspace, we just need a simple counter. There's no need for > > us to worry about anything else, like timestamps or whatever. If we > > repeatedly fork a paused VM, the forked VMs will see the same counter > > value, but that's totally fine, because the only thing that matters to > > userspace is that the counter changes when the VM is forked. > > For user-space, even a single bit would do. We added MADVISE_WIPEONFORK > so that userspace libraries can detect fork()/clone() robustly, for the > same reasons. It just wipes a page as the indicator, which is > effectively a single-bit signal, and it works well. On the user-space > side of this, I’m keen to find a solution like that that we can use > fairly easily inside of portable libraries and applications. The “have > I forked” checks do end up in hot paths, so it’s nice if they can be > CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my > favorite. I'm pretty sure a single bit is not enough if you want to have a single page, shared across the entire system, that stores the VM forking state; you need a counter for that. > > And actually, since the value is a cryptographically random 128-bit > > value, I think that we should definitely use it to help reseed the > > kernel's RNG, and keep it secret from userspace. That way, even if the > > VM image is public, we can ensure that going forward, the kernel RNG > > will return securely random data. > > If the image is public, you need some extra new raw entropy from > somewhere. The gen-id could be mixed in, that can’t do any harm as > long as rigorous cryptographic mixing with the prior state is used, but > if that’s all you do then the final state is still deterministic and > non-secret. Microsoft's documentation (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM Generation ID that we get after a fork "is a 128-bit, cryptographically random integer value". If multiple people use the same image, it guarantees that each use of the image gets its own, fresh ID: The table in section "How to implement virtual machine generation ID support in a virtualization platform" says that (among other things) "Virtual machine is imported, copied, or cloned" generates a new generation ID. So the RNG state after mixing in the new VM Generation ID would contain 128 bits of secret entropy not known to anyone else, including people with access to the VM image. Now, 128 bits of cryptographically random data aren't _optimal_; I think something on the order of 256 bits would be nicer from a theoretical standpoint. But in practice I think we'll be good with the 128 bits we're getting (since the number of users who fork a VM image is probably not going to be so large that worst-case collision probabilities matter). > The kernel would need to use the change as a trigger to > measure some entropy (e.g. interrupts and RDRAND, or whatever). Our just > define the machine contract as “this has to be unique random data and > if it’s not unique, or if it’s pubic, you’re toast”. As far as I can tell from Microsoft's spec, that is a guarantee we're already getting.
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau wrote: > On Sat, Oct 17, 2020 at 03:40:08AM +0200, Jann Horn wrote: > > [adding some more people who are interested in RNG stuff: Andy, Jason, > > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this > > concerns some pretty fundamental API stuff related to RNG usage] > > > > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin > > wrote: > > > This patch is a driver which exposes the Virtual Machine Generation ID > > > via a char-dev FS interface that provides ID update sync and async > > > notification, retrieval and confirmation mechanisms: > > > > > > When the device is 'open()'ed a copy of the current vm UUID is > > > associated with the file handle. 'read()' operations block until the > > > associated UUID is no longer up to date - until HW vm gen id changes - > > > at which point the new UUID is provided/returned. Nonblocking 'read()' > > > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > > > > > 'poll()' is implemented to allow polling for UUID updates. Such > > > updates result in 'EPOLLIN' events. > > > > > > Subsequent read()s following a UUID update no longer block, but return > > > the updated UUID. The application needs to acknowledge the UUID update > > > by confirming it through a 'write()'. > > > Only on writing back to the driver the right/latest UUID, will the > > > driver mark this "watcher" as up to date and remove EPOLLIN status. > > > > > > 'mmap()' support allows mapping a single read-only shared page which > > > will always contain the latest UUID value at offset 0. > > > > It would be nicer if that page just contained an incrementing counter, > > instead of a UUID. It's not like the application cares *what* the UUID > > changed to, just that it *did* change and all RNGs state now needs to > > be reseeded from the kernel, right? And an application can't reliably > > read the entire UUID from the memory mapping anyway, because the VM > > might be forked in the middle. > > > > So I think your kernel driver should detect UUID changes and then turn > > those into a monotonically incrementing counter. (Probably 64 bits > > wide?) (That's probably also a little bit faster than comparing an > > entire UUID.) > > I agree with this. Further, I'm observing there is a very common > confusion between "universally unique" and "random". Randoms are > needed when seeking unpredictability. A random number generator > *must* be able to return the same value multiple times in a row > (though this is rare), otherwise it's not random. [...] > If the UUIDs used there are real UUIDs, it could be as simple as > updating them according to their format, i.e. updating the timestamp, > and if the timestamp is already the same, just increase the seq counter. > Doing this doesn't require entropy, doesn't need to block and doesn't > needlessly leak randoms that sometimes make people feel nervous. Those UUIDs are supplied by existing hypervisor code; in that regard, this is almost like a driver for a hardware device. It is written against a fixed API provided by the underlying machine. Making sure that the sequence of UUIDs, as seen from inside the machine, never changes back to a previous one is the responsibility of the hypervisor and out of scope for this driver. Microsoft's spec document (which is a .docx file for reasons I don't understand) actually promises us that it is a cryptographically random 128-bit integer value, which means that if you fork a VM 2^64 times, the probability that any two of those VMs have the same counter is 2^-64. That should be good enough. But in userspace, we just need a simple counter. There's no need for us to worry about anything else, like timestamps or whatever. If we repeatedly fork a paused VM, the forked VMs will see the same counter value, but that's totally fine, because the only thing that matters to userspace is that the counter changes when the VM is forked. And actually, since the value is a cryptographically random 128-bit value, I think that we should definitely use it to help reseed the kernel's RNG, and keep it secret from userspace. That way, even if the VM image is public, we can ensure that going forward, the kernel RNG will return securely random data.
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
[adding some more people who are interested in RNG stuff: Andy, Jason, Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this concerns some pretty fundamental API stuff related to RNG usage] On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin wrote: > - Background > > The VM Generation ID is a feature defined by Microsoft (paper: > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > multiple hypervisor vendors. > > The feature is required in virtualized environments by apps that work > with local copies/caches of world-unique data such as random values, > uuids, monotonically increasing counters, etc. > Such apps can be negatively affected by VM snapshotting when the VM > is either cloned or returned to an earlier point in time. > > The VM Generation ID is a simple concept meant to alleviate the issue > by providing a unique ID that changes each time the VM is restored > from a snapshot. The hw provided UUID value can be used to > differentiate between VMs or different generations of the same VM. > > - Problem > > The VM Generation ID is exposed through an ACPI device by multiple > hypervisor vendors but neither the vendors or upstream Linux have no > default driver for it leaving users to fend for themselves. > > Furthermore, simply finding out about a VM generation change is only > the starting point of a process to renew internal states of possibly > multiple applications across the system. This process could benefit > from a driver that provides an interface through which orchestration > can be easily done. > > - Solution > > This patch is a driver which exposes the Virtual Machine Generation ID > via a char-dev FS interface that provides ID update sync and async > notification, retrieval and confirmation mechanisms: > > When the device is 'open()'ed a copy of the current vm UUID is > associated with the file handle. 'read()' operations block until the > associated UUID is no longer up to date - until HW vm gen id changes - > at which point the new UUID is provided/returned. Nonblocking 'read()' > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > 'poll()' is implemented to allow polling for UUID updates. Such > updates result in 'EPOLLIN' events. > > Subsequent read()s following a UUID update no longer block, but return > the updated UUID. The application needs to acknowledge the UUID update > by confirming it through a 'write()'. > Only on writing back to the driver the right/latest UUID, will the > driver mark this "watcher" as up to date and remove EPOLLIN status. > > 'mmap()' support allows mapping a single read-only shared page which > will always contain the latest UUID value at offset 0. It would be nicer if that page just contained an incrementing counter, instead of a UUID. It's not like the application cares *what* the UUID changed to, just that it *did* change and all RNGs state now needs to be reseeded from the kernel, right? And an application can't reliably read the entire UUID from the memory mapping anyway, because the VM might be forked in the middle. So I think your kernel driver should detect UUID changes and then turn those into a monotonically incrementing counter. (Probably 64 bits wide?) (That's probably also a little bit faster than comparing an entire UUID.) An option might be to put that counter into the vDSO, instead of a separate VMA; but I don't know how the other folks feel about that. Andy, do you have opinions on this? That way, normal userspace code that uses this infrastructure wouldn't have to mess around with a special device at all. And it'd be usable in seccomp sandboxes and so on without needing special plumbing. And libraries wouldn't have to call open() and mess with file descriptor numbers. > The driver also adds support for tracking count of open file handles > that haven't acknowledged an UUID update. This is exposed through > two IOCTLs: > * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of >_outdated_ watchers - number of file handles that were open during >a VM generation change, and which have not yet confirmed the new >Vm-Gen-Id. > * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_ >watchers, or if a 'timeout' argument is provided, until the timeout >expires. Does this mean that code that uses the memory mapping to detect changes is still supposed to confirm generation IDs? What about multithreaded processes, especially ones that use libraries - if a library opens a single file descriptor that is used from multiple threads, is the library required to synchronize all its threads before confirming the change? That seems like it could get messy. And it again means that this interface can't easily be used from inside seccomp sandboxes. [...] > diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst [...] > +``close()``: > + Removes the file handle as a Vm-Gen-Id watcher. (Linux doesn't have "file handles". Technically
[Qemu-devel] seccomp blacklist is not applied to all threads
Hi! I have noticed that when a QEMU build from git master is started with "-seccomp on", the seccomp policy is only applied to the main thread, the vcpu worker thread and the VNC thread (I'm using VNC in my config); the seccomp policy is not applied to e.g. the RCU thread because it is created before the seccomp policy is applied and SECCOMP_FILTER_FLAG_TSYNC isn't used. In practice, this makes the seccomp policy essentially useless. You'll probably want to add something like seccomp_attr_set(ctx, SCMP_FLTATR_CTL_TSYNC, 1) for systems with kernel >=3.17; if you have to support seccomp on older kernels, you'll have to either construct the other threads after initializing seccomp or manually activate seccomp on those threads. This shows up in strace as follows: $ strace -f -e trace=seccomp,clone,prctl x86_64-softmmu/qemu-system-x86_64 -enable-kvm -drive file=isos/debian-live-8.7.1-amd64-standard.iso,format=raw -m 4G -sandbox on -name testvm,debug-threads=on clone(child_stack=0x7f199af49bf0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f199af4a9d0, tls=0x7f199af4a700, child_tidptr=0x7f199af4a9d0) = 10948 strace: Process 10948 attached [pid 10947] prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) = 0 [pid 10947] clone(child_stack=0x7f199a748bf0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f199a7499d0, tls=0x7f199a749700, child_tidptr=0x7f199a7499d0) = 10949 strace: Process 10949 attached [pid 10947] prctl(PR_SET_TIMERSLACK, 1) = 0 [pid 10947] prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0 [pid 10947] seccomp(SECCOMP_SET_MODE_STRICT, 1, NULL) = -1 EINVAL (Invalid argument) [pid 10947] seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=39, filter=0x55a968927830}) = 0 [pid 10947] clone(strace: Process 10950 attached child_stack=0x7f1999d65bf0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f1999d669d0, tls=0x7f1999d66700, child_tidptr=0x7f1999d669d0) = 10950 [pid 10950] prctl(PR_SET_NAME, "CPU 0/KVM") = 0 [pid 10947] clone(strace: Process 10952 attached child_stack=0x7f1993bfebf0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f1993bff9d0, tls=0x7f1993bff700, child_tidptr=0x7f1993bff9d0) = 10952 [pid 10952] prctl(PR_SET_NAME, "worker") = 0 [pid 10947] clone(strace: Process 10953 attached child_stack=0x7f19933fdbf0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f19933fe9d0, tls=0x7f19933fe700, child_tidptr=0x7f19933fe9d0) = 10953 [pid 10953] prctl(PR_SET_NAME, "vnc_worker") = 0 VNC server running on ::1:5900 [pid 10952] +++ exited with 0 +++ Checking in procfs confirms that two of the threads aren't sandboxed: $ for task in /proc/10947/task/*; do cat $task/comm; grep Seccomp $task/status; done qemu-system-x86 Seccomp: 2 qemu-system-x86 Seccomp: 0 qemu-system-x86 Seccomp: 0 CPU 0/KVM Seccomp: 2 vnc_worker Seccomp: 2
Re: [Qemu-devel] insecure git submodule URLs
On Sun, Jul 15, 2018 at 11:18 PM Peter Maydell wrote: > > On 15 July 2018 at 20:50, Jann Horn via Qemu-devel > wrote: > > I noticed that when I build QEMU from git for the first time, it pulls > > in submodules over the insecure git:// protocol - in other words, as > > far as I can tell, if I'm e.g. on an open wifi network while building > > QEMU for the first time, even if I cloned the main repository over > > https, anyone could smuggle in malicious code as part of e.g. a > > submodule's makefile. > > Yes, this came up the other week. > > > I'm not sure what your preferred fix for this is, so I'm not sending a > > patch yet. As far as I can tell, the two options are: > > > > - change .gitmodules to use https for everything > > We should probably do this... > > > - change .gitmodules to use relative URLs > > > > If you want, I'll send a patch that does one of these, although it's > > probably faster if you just do it yourselves. > > > > Relative URLs would have the advantage that if someone is cloning from > > a mirror (in other words, github), the submodules will also > > automatically come from the same mirror. > > Do we mirror all our submodules to github? Looks that way - I cloned from github, patched my .submodules to use relative URLs for everything, then tested: $ git submodule init Submodule 'capstone' (capstone.git) registered for path 'capstone' Submodule 'dtc' (dtc.git) registered for path 'dtc' Submodule 'roms/QemuMacDrivers' (QemuMacDrivers.git) registered for path 'roms/QemuMacDrivers' Submodule 'roms/SLOF' (SLOF.git) registered for path 'roms/SLOF' Submodule 'roms/ipxe' (ipxe.git) registered for path 'roms/ipxe' Submodule 'roms/openbios' (openbios.git) registered for path 'roms/openbios' Submodule 'roms/openhackware' (openhackware.git) registered for path 'roms/openhackware' Submodule 'roms/qemu-palcode' (qemu-palcode.git) registered for path 'roms/qemu-palcode' Submodule 'roms/seabios' (seabios.git) registered for path 'roms/seabios' Submodule 'roms/seabios-hppa' (https://github.com/hdeller/seabios-hppa.git) registered for path 'roms/seabios-hppa' Submodule 'roms/sgabios' (sgabios.git) registered for path 'roms/sgabios' Submodule 'roms/skiboot' (skiboot.git) registered for path 'roms/skiboot' Submodule 'roms/u-boot' (u-boot.git) registered for path 'roms/u-boot' Submodule 'roms/u-boot-sam460ex' (u-boot-sam460ex.git) registered for path 'roms/u-boot-sam460ex' Submodule 'ui/keycodemapdb' (keycodemapdb.git) registered for path 'ui/keycodemapdb' $ git submodule update Submodule path 'capstone': checked out '22ead3e0bfdb87516656453336160e0a37b066bf' Submodule path 'dtc': checked out 'e54388015af1fb4bf04d0bca99caba1074d9cc42' Submodule path 'roms/QemuMacDrivers': checked out 'd4e7d7ac663fcb55f1b93575445fcbca372f17a7' Submodule path 'roms/SLOF': checked out '7d37babcfa48a6eb08e726a8d13b745cb2eebe1c' Submodule path 'roms/ipxe': checked out '0600d3ae94f93efd10fc6b3c7420a9557a3a1670' Submodule path 'roms/openbios': checked out '8fe6f5f96f6ca39f1f62200be7fa130e929f13f2' Submodule path 'roms/openhackware': checked out 'c559da7c8eec5e45ef1f67978827af6f0b9546f5' Submodule path 'roms/qemu-palcode': checked out 'f3c7e44c70254975df2a00af39701eafbac4d471' Submodule path 'roms/seabios': checked out 'f9626ccb91e771f990fbb2da92e427a399d7d918' Submodule path 'roms/seabios-hppa': checked out '1ef99a01572c2581c30e16e6fe69e9ea2ef92ce0' Submodule path 'roms/sgabios': checked out 'cbaee52287e5f32373181cff50a00b6c4ac9015a' Submodule path 'roms/skiboot': checked out 'e0ee24c27a172bcf482f6f2bc905e6211c134bcc' Submodule path 'roms/u-boot': checked out 'd85ca029f257b53a96da6c2fb421e78a003a9943' Submodule path 'roms/u-boot-sam460ex': checked out '60b3916f33e617a815973c5a6df77055b2e3a588' Submodule path 'ui/keycodemapdb': checked out '6b3d716e2b6472eb7189d3220552280ef3d832ce' > > As far as I can tell, the QEMU git server only supports the "dumb" git > > protocol when accessed over HTTPS, not the "smart" protocol. I'm not > > sure whether that might be why QEMU is currently still using the > > insecure git protocol instead of git over HTTPS? > > This is why we haven't switched over the submodules yet, yes. > It's on Jeff's todo list for the server, though. > > thanks > -- PMM
[Qemu-devel] insecure git submodule URLs
Hi! I noticed that when I build QEMU from git for the first time, it pulls in submodules over the insecure git:// protocol - in other words, as far as I can tell, if I'm e.g. on an open wifi network while building QEMU for the first time, even if I cloned the main repository over https, anyone could smuggle in malicious code as part of e.g. a submodule's makefile. I'm not sure what your preferred fix for this is, so I'm not sending a patch yet. As far as I can tell, the two options are: - change .gitmodules to use https for everything - change .gitmodules to use relative URLs If you want, I'll send a patch that does one of these, although it's probably faster if you just do it yourselves. Relative URLs would have the advantage that if someone is cloning from a mirror (in other words, github), the submodules will also automatically come from the same mirror. As far as I can tell, the QEMU git server only supports the "dumb" git protocol when accessed over HTTPS, not the "smart" protocol. I'm not sure whether that might be why QEMU is currently still using the insecure git protocol instead of git over HTTPS?
[Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching
This is an issue in QEMU's system emulation for X86 in TCG mode. The issue permits an attacker who can execute code in guest ring 3 with normal user privileges to inject code into other processes that are running in guest ring 3, in particular root-owned processes. == reproduction steps == - Create an x86-64 VM and install Debian Jessie in it. The following steps should all be executed inside the VM. - Verify that procmail is installed and the correct version: root@qemuvm:~# apt-cache show procmail | egrep 'Version|SHA' Version: 3.22-24 SHA1: 54ed2d51db0e76f027f06068ab5371048c13434c SHA256: 4488cf6975af9134a9b5238d5d70e8be277f70caa45a840dfbefd2dc444bfe7f - Install build-essential and nasm ("apt install build-essential nasm"). - Unpack the exploit, compile it and run it: user@qemuvm:~$ tar xvf procmail_cache_attack.tar procmail_cache_attack/ procmail_cache_attack/shellcode.asm procmail_cache_attack/xp.c procmail_cache_attack/compile.sh procmail_cache_attack/attack.c user@qemuvm:~$ cd procmail_cache_attack user@qemuvm:~/procmail_cache_attack$ ./compile.sh user@qemuvm:~/procmail_cache_attack$ ./attack memory mappings set up child is dead, codegen should be complete executing code as root! :) root@qemuvm:~/procmail_cache_attack# id uid=0(root) gid=0(root) groups=0(root),[...] Note: While the exploit depends on the precise version of procmail, the actual vulnerability is in QEMU, not in procmail. procmail merely serves as a seldomly-executed setuid root binary into which code can be injected. == detailed issue description == QEMU caches translated basic blocks. To look up a translated basic block, the function tb_find() is used, which uses tb_htable_lookup() in its slowpath, which in turn compares translated basic blocks (TranslationBlock) to the lookup information (struct tb_desc) using tb_cmp(). tb_cmp() attempts to ensure (among other things) that both the virtual start address of the basic block and the physical addresses that the basic block covers match. When checking the physical addresses, it assumes that a basic block can span at most two pages. gen_intermediate_code() attempts to enforce this by stopping the translation of a basic block if nearly one page of instructions has been translated already: /* if too long translation, stop generation too */ if (tcg_op_buf_full() || (pc_ptr - pc_start) >= (TARGET_PAGE_SIZE - 32) || num_insns >= max_insns) { gen_jmp_im(pc_ptr - dc->cs_base); gen_eob(dc); break; } However, while real X86 processors have a maximum instruction length of 15 bytes, QEMU's instruction decoder for X86 does not place any limit on the instruction length or the number of instruction prefixes. Therefore, it is possible to create an arbitrarily long instruction by e.g. prepending an arbitrary number of LOCK prefixes to a normal instruction. This permits creating a basic block that spans three pages by simply appending an approximately page-sized instruction to the end of a normal basic block that starts close to the end of a page. Such an overlong basic block causes the basic block caching to fail as follows: If code is generated and cached for a basic block that spans the physical pages (A,E,B), this basic block will be returned by lookups in a process in which the physical pages (A,B,C) are mapped in the same virtual address range (assuming that all other lookup parameters match). This behavior can be abused by an attacker e.g. as follows: If a non-relocatable world-readable setuid executable legitimately contains the pages (A,B,C), an attacker can map (A,E,B) into his own process, at the normal load address of A, where E is an attacker-controlled page. If a legitimate basic block spans the pages A and B, an attacker can write arbitrary non-branch instructions at the start of E, then append an overlong instruction that ends behind the start of C, yielding a modified basic block that spans all three pages. If the attacker then executes the modified basic block in his process, the modified basic block is cached. Next, the attacker can execute the setuid binary, which will reuse the cached modified basic block, executing attacker-controlled instructions in the context of the privileged process. I am sending this to qemu-devel because a QEMU security contact told me that QEMU does not consider privilege escalation inside a TCG VM to be a security concern. procmail_cache_attack.tar Description: Unix tar archive
Re: [Qemu-devel] [PATCH] linux-user: limit number of arguments to execve
On Fri, Mar 3, 2017 at 4:56 PM, Peter Maydellwrote: > On 3 March 2017 at 14:54, Eric Blake wrote: >>> +ret = -TARGET_EFAULT; >>> +break; >>> +} >>> argp = alloca((argc + 1) * sizeof(void *)); >>> envp = alloca((envc + 1) * sizeof(void *)); >> >> ...Uggh. You're using alloca() but allowing an allocation of way more >> than 4k. That means a guest can cause corruption of the stack (or, with >> large enough arguments, even escape out of the stack) before you even >> get to the execve() call to even worry about E2BIG issues. > > Yeah, linux-user is shot through with that kind of alloca() usage. > > (It's not great, but it's not a security hole because we already > give the guest binary complete control to do anything it likes. > Worth fixing bugs if we run into them, though.) It could be a security hole if a benign guest userspace process decides to allow a remote client to specify a bunch of environment variables or so. E.g. HTTP servers with CGI support pass HTTP headers on as environment variables whose names start with "HTTP_"; however, since HTTP servers usually limit the request size, this isn't actually exploitable in that case. But there could theoretically be similar scenarios.
Re: [Qemu-devel] [PATCH] linux-user: limit number of arguments to execve
On Fri, Mar 3, 2017 at 4:55 PM, Peter Maydell <peter.mayd...@linaro.org> wrote: > On 3 March 2017 at 11:25, P J P <ppan...@redhat.com> wrote: >> From: Prasad J Pandit <p...@fedoraproject.org> >> >> Limit the number of arguments passed to execve(2) call from >> a user program, as large number of them could lead to a bad >> guest address error. >> >> Reported-by: Jann Horn <ja...@google.com> >> Signed-off-by: Prasad J Pandit <p...@fedoraproject.org> >> --- >> linux-user/syscall.c | 7 +++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/linux-user/syscall.c b/linux-user/syscall.c >> index 9be8e95..c545c12 100644 >> --- a/linux-user/syscall.c >> +++ b/linux-user/syscall.c >> @@ -7766,6 +7766,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long >> arg1, >> #endif >> case TARGET_NR_execve: >> { >> +#define ARG_MAX 65535 >> char **argp, **envp; >> int argc, envc; >> abi_ulong gp; >> @@ -7794,6 +7795,12 @@ abi_long do_syscall(void *cpu_env, int num, abi_long >> arg1, >> envc++; >> } >> >> +if (argc > ARG_MAX || envc > ARG_MAX) { >> +fprintf(stderr, >> +"argc(%d), envc(%d) exceed %d\n", argc, envc, >> ARG_MAX); >> +ret = -TARGET_EFAULT; >> +break; >> +} >> argp = alloca((argc + 1) * sizeof(void *)); >> envp = alloca((envc + 1) * sizeof(void *)); > > This code is already supposed to handle "argument string too big", > see commit a6f79cc9a5e. > > What's the actual bug case we're trying to handle here? commit a6f79cc9a5e doesn't help here, it only bails out after the two alloca() calls
Re: [Qemu-devel] [PATCH 23/29] 9pfs: local: chmod: don't follow symlinks
On Fri, Feb 24, 2017 at 4:23 PM, Eric Blakewrote: > On 02/24/2017 04:34 AM, Greg Kurz wrote: >> On Thu, 23 Feb 2017 15:10:42 + >> Stefan Hajnoczi wrote: >> >>> On Mon, Feb 20, 2017 at 03:42:19PM +0100, Greg Kurz wrote: The local_chmod() callback is vulnerable to symlink attacks because it calls: (1) chmod() which follows symbolic links for all path elements (2) local_set_xattr()->setxattr() which follows symbolic links for all path elements (3) local_set_mapped_file_attr() which calls in turn local_fopen() and mkdir(), both functions following symbolic links for all path elements but the rightmost one > +fd = local_open_nofollow(fs_ctx, fs_path->data, O_RDONLY, 0); +if (fd == -1) { +return -1; +} +ret = fchmod(fd, credp->fc_mode); +close_preserve_errno(fd); >>> >>> As mentioned on IRC, not sure this is workable since chmod(2) must not >>> require read permission on the file. This patch introduces failures >>> when the file isn't readable. >>> >> >> Yeah :-\ This one is the more painful issue to address. I need a >> chmod() variant that would fail on symlinks instead of following >> them... and I'm kinda suffering to implement this in a race-free >> manner. > > BSD has lchmod(), but Linux does not. And Linux does not yet properly > implement AT_SYMLINK_NOFOLLOW. (The BSD semantics are pretty cool: > restricting mode bits on a symlink can create scenarios where some users > can dereference the symlink but others get ENOENT failures - but I don't > know of any effort to add those semantics to the Linux kernel) > > POSIX says support for AT_SYMLINK_NOFOLLOW is optional, precisely > because POSIX does not require permissions on symlinks to matter, but > the way it requires it is that AT_SYMLINK_NOFOLLOW is supposed to be a > no-op for regular files, and cause an EOPNOTSUPP failure for symlinks. > > Unfortunately, the Linux man page says that Linux fails with ENOTSUP > (which is identical to EOPNOTSUPP) any time AT_SYMLINK_NOFOLLOW is set, > and not just if it is attempted on a symlink. And unfortunately, that flags argument is not actually present in the real syscall. See this glibc code: int fchmodat (int fd, const char *file, mode_t mode, int flag) { if (flag & ~AT_SYMLINK_NOFOLLOW) return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); #ifndef __NR_lchmod /* Linux so far has no lchmod syscall. */ if (flag & AT_SYMLINK_NOFOLLOW) return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP); #endif return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); } and this kernel code: SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, umode_t, mode) { [...] } So to fix this, you'll probably have to add a new syscall fchmodat2() to the kernel, wire it up for all the architectures and get the various libc implementations to adopt that. That's going to be quite tedious. :(
Re: [Qemu-devel] [PATCH 07/29] 9pfs: local: introduce symlink-attack safe xattr helpers
On Thu, Feb 23, 2017 at 4:02 PM, Eric Blakewrote: > On 02/20/2017 08:40 AM, Greg Kurz wrote: >> All operations dealing with extended attributes are vulnerable to symlink >> attacks because they use path-based syscalls which can traverse symbolic >> links while walking through the dirname part of the path. >> >> The solution is to introduce helpers based on opendir_nofollow(). This >> calls for "at" versions of the extended attribute syscalls, which don't >> exist unfortunately. This patch implement them by simulating the "at" >> behavior with fchdir(). Since the current working directory is process >> wide, and we don't want to confuse another thread in QEMU, all the work >> is done in a separate process. > > Can you emulate *at using /proc/fd/nnn/xyz? I don't know much about QEMU internals, but QEMU supports running in a chroot using the -chroot option, right? Does that already require procfs to be mounted inside the chroot?