Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

2020-11-27 Thread Jann Horn
On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin
 wrote:
> On 27/11/2020 20:22, Jann Horn wrote:
> > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn  wrote:
> >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
> >>  wrote:
> >>> This patch is a driver that exposes a monotonic incremental Virtual
> >>> Machine Generation u32 counter via a char-dev FS interface that
> >>> provides sync and async VmGen counter updates notifications. It also
> >>> provides VmGen counter retrieval and confirmation mechanisms.
> >>>
> >>> The hw provided UUID is not exposed to userspace, it is internally
> >>> used by the driver to keep accounting for the exposed VmGen counter.
> >>> The counter starts from zero when the driver is initialized and
> >>> monotonically increments every time the hw UUID changes (the VM
> >>> generation changes).
> >>>
> >>> On each hw UUID change, the new hypervisor-provided UUID is also fed
> >>> to the kernel RNG.
> >> As for v1:
> >>
> >> Is there a reasonable usecase for the "confirmation" mechanism? It
> >> doesn't seem very useful to me.
>
> I think it adds value in complex scenarios with multiple users of the
> mechanism, potentially at varying layers of the stack, different
> processes and/or runtime libraries.
>
> The driver offers a natural place to handle minimal orchestration
> support and offer visibility in system-wide status.
>
> A high-level service that trusts all system components to properly use
> the confirmation mechanism can actually block and wait patiently for the
> system to adjust to the new world. Even if it doesn't trust all
> components it can still do a best-effort, timeout block.

What concrete action would that high-level service be able to take
after waiting for such an event?

My model of the vmgenid mechanism is that RNGs and cryptographic
libraries that use it need to be fundamentally written such that it is
guaranteed that a VM fork can not cause the same random number /
counter / ... to be reused for two different cryptographic operations
in a way visible to an attacker. This means that e.g. TLS libraries
need to, between accepting unencrypted input and sending out encrypted
data, check whether the vmgenid changed since the connection was set
up, and if a vmgenid change occurred, kill the connection.

Can you give a concrete example of a usecase where the vmgenid
mechanism is used securely and the confirmation mechanism is necessary
as part of that?

> >> How do you envision integrating this with libraries that have to work
> >> in restrictive seccomp sandboxes? If this was in the vDSO, that would
> >> be much easier.
>
> Since this mechanism targets all of userspace stack, the usecase greatly
> vary. I doubt we can have a single silver bullet interface.
>
> For example, the mmap interface targets user space RNGs, where as fast
> and as race free as possible is key. But there also higher level
> applications that don't manage their own memory or don't have access to
> low-level primitives so they can't use the mmap or even vDSO interfaces.
> That's what the rest of the logic is there for, the read+poll interface
> and all of the orchestration logic.

Are you saying that, because people might not want to write proper
bindings for this interface while also being unwilling to take the
performance hit of calling read() in every place where they would have
to do so to be fully correct, you want to build a "best-effort"
mechanism that is deliberately designed to allow some cryptographic
state reuse in a limited time window?

> Like you correctly point out, there are also scenarios like tight
> seccomp jails where even the FS interfaces is inaccessible. For cases
> like this and others, I believe we will have to work incrementally to
> build up the interface diversity to cater to all the user scenarios
> diversity.

It would be much nicer if we could have one simple interface that lets
everyone correctly do what they need to, though...



Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

2020-11-27 Thread Jann Horn
[resend in the hope that amazon will accept my mail this time instead
of replying "550 Too many invalid recipients" again]

On Fri, Nov 20, 2020 at 11:29 PM Jann Horn  wrote:
> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
>  wrote:
> > This patch is a driver that exposes a monotonic incremental Virtual
> > Machine Generation u32 counter via a char-dev FS interface that
> > provides sync and async VmGen counter updates notifications. It also
> > provides VmGen counter retrieval and confirmation mechanisms.
> >
> > The hw provided UUID is not exposed to userspace, it is internally
> > used by the driver to keep accounting for the exposed VmGen counter.
> > The counter starts from zero when the driver is initialized and
> > monotonically increments every time the hw UUID changes (the VM
> > generation changes).
> >
> > On each hw UUID change, the new hypervisor-provided UUID is also fed
> > to the kernel RNG.
>
> As for v1:
>
> Is there a reasonable usecase for the "confirmation" mechanism? It
> doesn't seem very useful to me.
>
> How do you envision integrating this with libraries that have to work
> in restrictive seccomp sandboxes? If this was in the vDSO, that would
> be much easier.



Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

2020-11-20 Thread Jann Horn
On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
 wrote:
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface that
> provides sync and async VmGen counter updates notifications. It also
> provides VmGen counter retrieval and confirmation mechanisms.
>
> The hw provided UUID is not exposed to userspace, it is internally
> used by the driver to keep accounting for the exposed VmGen counter.
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the hw UUID changes (the VM
> generation changes).
>
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.

As for v1:

Is there a reasonable usecase for the "confirmation" mechanism? It
doesn't seem very useful to me.

How do you envision integrating this with libraries that have to work
in restrictive seccomp sandboxes? If this was in the vDSO, that would
be much easier.



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-17 Thread Jann Horn
On Sat, Oct 17, 2020 at 8:09 PM Alexander Graf  wrote:
> There are applications way beyond that though. What do you do with
> applications that already consumed randomness? For example a cached pool
> of SSL keys. Or a higher level language primitive that consumes
> randomness and caches its seed somewhere in an internal data structure.

For deterministic protection, those would also have to poll some
memory location that tells them whether the VmGenID changed:

1. between reading entropy from their RNG pool and using it
2. between collecting data from external sources (user input, clock,
...) and encrypting it

and synchronously shoot down the connection if a change happened. If
e.g. an application inside the VM has an AES-GCM-encrypted TLS
connection and, directly after the VM is restored, triggers an
application-level timeout that sends some fixed message across the
connection, then the TLS library must guarantee that either the VM was
already committed to sending exactly that message before the VM was
forked or the message will be blocked. If we don't do that, an
attacker who captures both a single packet from the forked VM and
traffic from the old VM can decrypt the next message from the old VM
after the fork (because AES-GCM is like AES-CTR plus an authenticator,
and CTR means that when keystream reuse occurs and one of the
plaintexts is known, the attacker can simply recover the other
plaintext using XOR).

(Or maybe, in disaster failover environments, TLS 1.3 servers could
get away with rekeying the connection instead of shooting it down? Ask
your resident friendly cryptographer whether that would be secure, I
am not one.)

I don't think a mechanism based around asynchronously telling the
application and waiting for it to confirm the rotation at a later
point is going to cut it; we should have some hard semantics on when
an application needs to poll this value.

> Or even worse: your system's host ssh key.

Mmmh... I think I normally would not want a VM to reset its host ssh
key after merely restoring a snapshot though? And more importantly,
Microsoft's docs say that they also change the VmGenID on disaster
failover. I think you very much wouldn't want your server to lose its
host key every time disaster failover happens. On the other hand,
after importing a public VM image, it might be a good idea.

I guess you could push that responsibility on the user, by adding an
option to the sshd_config that tells OpenSSH whether the host key
should be rotated on an ID change or not... but that still would not
be particularly pretty.

Ideally we would have the host tell us what type of events happened to
the VM, or something like that... or maybe even get the host VM
management software to ask the user whether they're importing a public
image... I really feel like with Microsoft's current protocol, we
don't get enough information to figure out what we should do about
private long-term authentication keys.



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-17 Thread Jann Horn
On Sat, Oct 17, 2020 at 8:44 AM Willy Tarreau  wrote:
> On Sat, Oct 17, 2020 at 07:52:48AM +0200, Jann Horn wrote:
> > On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau  wrote:
> > > On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote:
> > > > Microsoft's documentation
> > > > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
> > > > Generation ID that we get after a fork "is a 128-bit,
> > > > cryptographically random integer value". If multiple people use the
> > > > same image, it guarantees that each use of the image gets its own,
> > > > fresh ID:
> > >
> > > No. It cannot be more unique than the source that feeds that cryptographic
> > > transformation. All it guarantees is that the entropy source is protected
> > > from being guessed based on the output. Applying cryptography on a simple
> > > counter provides apparently random numbers that will be unique for a long
> > > period for the same source, but as soon as you duplicate that code between
> > > users and they start from the same counter they'll get the same IDs.
> > >
> > > This is why I think that using a counter is better if you really need 
> > > something
> > > unique. Randoms only reduce predictability which helps avoiding 
> > > collisions.
> >
> > Microsoft's spec tells us that they're giving us cryptographically
> > random numbers. Where they're getting those from is not our problem.
> > (And if even the hypervisor is not able to collect enough entropy to
> > securely generate random numbers, worrying about RNG reseeding in the
> > guest would be kinda pointless, we'd be fairly screwed anyway.)
>
> Sorry if I sound annoying, but it's a matter of terminology and needs.
>
> Cryptograhically random means safe for use with cryptography in that it
> is unguessable enough so that you can use it for encryption keys that
> nobody will be able to guess. It in no ways guarantees uniqueness, just
> like you don't really care if the symmetric crypto key of you VPN has
> already been used once somewhere else as long as there's no way to know.
> However with the good enough distribution that a CSPRNG provides,
> collisions within a *same* generator are bound to a very low, predictable
> rate which is by generally considered as acceptable for all use cases.

Yes.

> Something random (cryptographically or not) *cannot* be unique by
> definition, otherwise it's not random anymore, since each draw has an
> influence on the remaining list of possible draws, which is contrary to
> randomness. And conversely something unique cannot be completely random
> because if you know it's unique, you can already rule out all other known
> values from the candidates, thus it's more predictable than random.

Yes.

> With this in mind, picking randoms from a same RNG is often highly
> sufficient to consider they're highly likely unique within a long
> period. But it's not a guarantee. And it's even less one between two
> RNGs (e.g. if uniqueness is required between multiple hypervisors in
> case VMs are migrated or centrally managed, which I don't know).
>
> If what is sought here is a strong guarantee of uniqueness, using a
> counter as you first suggested is better.

My suggestion is to use a counter *in the UAPI*, not in the hypervisor
protocol. (And as long as that counter can only miss increments in a
cryptographically negligible fraction of cases, everything's fine.)

> If what is sought is pure
> randomness (in the sense that it's unpredictable, which I don't think
> is needed here), then randoms are better.

And this is what *the hypervisor protocol* gives us (which could be
very useful for reseeding the kernel RNG).

> If both are required, just
> concatenate a counter and a random. And if you need them to be spatially
> unique, just include a node identifier.
>
> Now the initial needs in the forwarded message are not entirely clear
> to me but I wanted to rule out the apparent mismatch between the expressed
> needs for uniqueness and the proposed solutions solely based on randomness.

Sure, from a theoretical standpoint, it would be a little bit nicer if
the hypervisor protocol included a generation number along with the
128-bit random value. But AFAIU it doesn't, so if we want this to just
work under Microsoft's existing hypervisor, we'll have to make do with
checking whether the random value changed. :P



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-16 Thread Jann Horn
On Sat, Oct 17, 2020 at 7:37 AM Willy Tarreau  wrote:
> On Sat, Oct 17, 2020 at 07:01:31AM +0200, Jann Horn wrote:
> > Microsoft's documentation
> > (http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
> > Generation ID that we get after a fork "is a 128-bit,
> > cryptographically random integer value". If multiple people use the
> > same image, it guarantees that each use of the image gets its own,
> > fresh ID:
>
> No. It cannot be more unique than the source that feeds that cryptographic
> transformation. All it guarantees is that the entropy source is protected
> from being guessed based on the output. Applying cryptography on a simple
> counter provides apparently random numbers that will be unique for a long
> period for the same source, but as soon as you duplicate that code between
> users and they start from the same counter they'll get the same IDs.
>
> This is why I think that using a counter is better if you really need 
> something
> unique. Randoms only reduce predictability which helps avoiding collisions.

Microsoft's spec tells us that they're giving us cryptographically
random numbers. Where they're getting those from is not our problem.
(And if even the hypervisor is not able to collect enough entropy to
securely generate random numbers, worrying about RNG reseeding in the
guest would be kinda pointless, we'd be fairly screwed anyway.)

Also note that we don't actually need to *always* reinitialize RNG
state on forks for functional correctness; it is fine if that fails
with a probability of 2^-128, because functionally everything will be
fine, and an attacker who is that lucky could also just guess an AES
key (which has the same probability of being successful). (And also
2^-128 is such a tiny number that it doesn't matter anyway.)

> And I'm saying this as someone who had on his external gateway the same SSH
> host key as 89 other hosts on the net, each of them using randoms to provide
> a universally unique one...

If your SSH host key was shared with 89 other hosts, it evidently
wasn't generated from cryptographically random numbers. :P Either
because the key generator was not properly hooked up to the system's
entropy pool (if you're talking about the Debian fiasco), or because
the system simply did not have enough entropy available. (Or because
the key generator is broken, but I don't think that ever happened with
OpenSSH?)



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-16 Thread Jann Horn
On Sat, Oct 17, 2020 at 6:34 AM Colm MacCarthaigh  wrote:
> On 16 Oct 2020, at 21:02, Jann Horn wrote:
> > On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau  wrote:
> > But in userspace, we just need a simple counter. There's no need for
> > us to worry about anything else, like timestamps or whatever. If we
> > repeatedly fork a paused VM, the forked VMs will see the same counter
> > value, but that's totally fine, because the only thing that matters to
> > userspace is that the counter changes when the VM is forked.
>
> For user-space, even a single bit would do. We added MADVISE_WIPEONFORK
> so that userspace libraries can detect fork()/clone() robustly, for the
> same reasons. It just wipes a page as the indicator, which is
> effectively a single-bit signal, and it works well. On the user-space
> side of this, I’m keen to find a solution like that that we can use
> fairly easily inside of portable libraries and applications. The “have
> I forked” checks do end up in hot paths, so it’s nice if they can be
> CPU cache friendly. Comparing a whole 128-bit value wouldn’t be my
> favorite.

I'm pretty sure a single bit is not enough if you want to have a
single page, shared across the entire system, that stores the VM
forking state; you need a counter for that.

> > And actually, since the value is a cryptographically random 128-bit
> > value, I think that we should definitely use it to help reseed the
> > kernel's RNG, and keep it secret from userspace. That way, even if the
> > VM image is public, we can ensure that going forward, the kernel RNG
> > will return securely random data.
>
> If the image is public, you need some extra new raw entropy from
> somewhere. The gen-id could be mixed in, that can’t do any harm as
> long as rigorous cryptographic mixing with the prior state is used, but
> if that’s all you do then the final state is still deterministic and
> non-secret.

Microsoft's documentation
(http://go.microsoft.com/fwlink/?LinkId=260709) says that the VM
Generation ID that we get after a fork "is a 128-bit,
cryptographically random integer value". If multiple people use the
same image, it guarantees that each use of the image gets its own,
fresh ID: The table in section "How to implement virtual machine
generation ID support in a virtualization platform" says that (among
other things) "Virtual machine is imported, copied, or cloned"
generates a new generation ID.

So the RNG state after mixing in the new VM Generation ID would
contain 128 bits of secret entropy not known to anyone else, including
people with access to the VM image.

Now, 128 bits of cryptographically random data aren't _optimal_; I
think something on the order of 256 bits would be nicer from a
theoretical standpoint. But in practice I think we'll be good with the
128 bits we're getting (since the number of users who fork a VM image
is probably not going to be so large that worst-case collision
probabilities matter).

> The kernel would need to use the change as a trigger to
> measure some entropy (e.g. interrupts and RDRAND, or whatever). Our just
> define the machine contract as “this has to be unique random data and
> if it’s not unique, or if it’s pubic, you’re toast”.

As far as I can tell from Microsoft's spec, that is a guarantee we're
already getting.



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-16 Thread Jann Horn
On Sat, Oct 17, 2020 at 5:36 AM Willy Tarreau  wrote:
> On Sat, Oct 17, 2020 at 03:40:08AM +0200, Jann Horn wrote:
> > [adding some more people who are interested in RNG stuff: Andy, Jason,
> > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> > concerns some pretty fundamental API stuff related to RNG usage]
> >
> > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
> >  wrote:
> > > This patch is a driver which exposes the Virtual Machine Generation ID
> > > via a char-dev FS interface that provides ID update sync and async
> > > notification, retrieval and confirmation mechanisms:
> > >
> > > When the device is 'open()'ed a copy of the current vm UUID is
> > > associated with the file handle. 'read()' operations block until the
> > > associated UUID is no longer up to date - until HW vm gen id changes -
> > > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> > >
> > > 'poll()' is implemented to allow polling for UUID updates. Such
> > > updates result in 'EPOLLIN' events.
> > >
> > > Subsequent read()s following a UUID update no longer block, but return
> > > the updated UUID. The application needs to acknowledge the UUID update
> > > by confirming it through a 'write()'.
> > > Only on writing back to the driver the right/latest UUID, will the
> > > driver mark this "watcher" as up to date and remove EPOLLIN status.
> > >
> > > 'mmap()' support allows mapping a single read-only shared page which
> > > will always contain the latest UUID value at offset 0.
> >
> > It would be nicer if that page just contained an incrementing counter,
> > instead of a UUID. It's not like the application cares *what* the UUID
> > changed to, just that it *did* change and all RNGs state now needs to
> > be reseeded from the kernel, right? And an application can't reliably
> > read the entire UUID from the memory mapping anyway, because the VM
> > might be forked in the middle.
> >
> > So I think your kernel driver should detect UUID changes and then turn
> > those into a monotonically incrementing counter. (Probably 64 bits
> > wide?) (That's probably also a little bit faster than comparing an
> > entire UUID.)
>
> I agree with this. Further, I'm observing there is a very common
> confusion between "universally unique" and "random". Randoms are
> needed when seeking unpredictability. A random number generator
> *must* be able to return the same value multiple times in a row
> (though this is rare), otherwise it's not random.
[...]
> If the UUIDs used there are real UUIDs, it could be as simple as
> updating them according to their format, i.e. updating the timestamp,
> and if the timestamp is already the same, just increase the seq counter.
> Doing this doesn't require entropy, doesn't need to block and doesn't
> needlessly leak randoms that sometimes make people feel nervous.

Those UUIDs are supplied by existing hypervisor code; in that regard,
this is almost like a driver for a hardware device. It is written
against a fixed API provided by the underlying machine. Making sure
that the sequence of UUIDs, as seen from inside the machine, never
changes back to a previous one is the responsibility of the hypervisor
and out of scope for this driver.

Microsoft's spec document (which is a .docx file for reasons I don't
understand) actually promises us that it is a cryptographically random
128-bit integer value, which means that if you fork a VM 2^64 times,
the probability that any two of those VMs have the same counter is
2^-64. That should be good enough.

But in userspace, we just need a simple counter. There's no need for
us to worry about anything else, like timestamps or whatever. If we
repeatedly fork a paused VM, the forked VMs will see the same counter
value, but that's totally fine, because the only thing that matters to
userspace is that the counter changes when the VM is forked.

And actually, since the value is a cryptographically random 128-bit
value, I think that we should definitely use it to help reseed the
kernel's RNG, and keep it secret from userspace. That way, even if the
VM image is public, we can ensure that going forward, the kernel RNG
will return securely random data.



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-16 Thread Jann Horn
[adding some more people who are interested in RNG stuff: Andy, Jason,
Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
concerns some pretty fundamental API stuff related to RNG usage]

On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
 wrote:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
>
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
>
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
>
> Furthermore, simply finding out about a VM generation change is only
> the starting point of a process to renew internal states of possibly
> multiple applications across the system. This process could benefit
> from a driver that provides an interface through which orchestration
> can be easily done.
>
> - Solution
>
> This patch is a driver which exposes the Virtual Machine Generation ID
> via a char-dev FS interface that provides ID update sync and async
> notification, retrieval and confirmation mechanisms:
>
> When the device is 'open()'ed a copy of the current vm UUID is
> associated with the file handle. 'read()' operations block until the
> associated UUID is no longer up to date - until HW vm gen id changes -
> at which point the new UUID is provided/returned. Nonblocking 'read()'
> uses EWOULDBLOCK to signal that there is no _new_ UUID available.
>
> 'poll()' is implemented to allow polling for UUID updates. Such
> updates result in 'EPOLLIN' events.
>
> Subsequent read()s following a UUID update no longer block, but return
> the updated UUID. The application needs to acknowledge the UUID update
> by confirming it through a 'write()'.
> Only on writing back to the driver the right/latest UUID, will the
> driver mark this "watcher" as up to date and remove EPOLLIN status.
>
> 'mmap()' support allows mapping a single read-only shared page which
> will always contain the latest UUID value at offset 0.

It would be nicer if that page just contained an incrementing counter,
instead of a UUID. It's not like the application cares *what* the UUID
changed to, just that it *did* change and all RNGs state now needs to
be reseeded from the kernel, right? And an application can't reliably
read the entire UUID from the memory mapping anyway, because the VM
might be forked in the middle.

So I think your kernel driver should detect UUID changes and then turn
those into a monotonically incrementing counter. (Probably 64 bits
wide?) (That's probably also a little bit faster than comparing an
entire UUID.)

An option might be to put that counter into the vDSO, instead of a
separate VMA; but I don't know how the other folks feel about that.
Andy, do you have opinions on this? That way, normal userspace code
that uses this infrastructure wouldn't have to mess around with a
special device at all. And it'd be usable in seccomp sandboxes and so
on without needing special plumbing. And libraries wouldn't have to
call open() and mess with file descriptor numbers.

> The driver also adds support for tracking count of open file handles
> that haven't acknowledged an UUID update. This is exposed through
> two IOCTLs:
>  * VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
>_outdated_ watchers - number of file handles that were open during
>a VM generation change, and which have not yet confirmed the new
>Vm-Gen-Id.
>  * VMGENID_WAIT_WATCHERS: blocks until there are no more _outdated_
>watchers, or if a 'timeout' argument is provided, until the timeout
>expires.

Does this mean that code that uses the memory mapping to detect
changes is still supposed to confirm generation IDs? What about
multithreaded processes, especially ones that use libraries - if a
library opens a single file descriptor that is used from multiple
threads, is the library required to synchronize all its threads before
confirming the change? That seems like it could get messy. And it
again means that this interface can't easily be used from inside
seccomp sandboxes.

[...]
> diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
[...]
> +``close()``:
> +  Removes the file handle as a Vm-Gen-Id watcher.

(Linux doesn't have "file handles". Technically 

[Qemu-devel] seccomp blacklist is not applied to all threads

2018-08-13 Thread Jann Horn via Qemu-devel
Hi!

I have noticed that when a QEMU build from git master is started with
"-seccomp on", the seccomp policy is only applied to the main thread,
the vcpu worker thread and the VNC thread (I'm using VNC in my
config); the seccomp policy is not applied to e.g. the RCU thread
because it is created before the seccomp policy is applied and
SECCOMP_FILTER_FLAG_TSYNC isn't used. In practice, this makes the
seccomp policy essentially useless. You'll probably want to add
something like seccomp_attr_set(ctx, SCMP_FLTATR_CTL_TSYNC, 1) for
systems with kernel >=3.17; if you have to support seccomp on older
kernels, you'll have to either construct the other threads after
initializing seccomp or manually activate seccomp on those threads.


This shows up in strace as follows:
$ strace -f -e trace=seccomp,clone,prctl
x86_64-softmmu/qemu-system-x86_64 -enable-kvm -drive
file=isos/debian-live-8.7.1-amd64-standard.iso,format=raw -m 4G
-sandbox on -name testvm,debug-threads=on
clone(child_stack=0x7f199af49bf0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f199af4a9d0, tls=0x7f199af4a700,
child_tidptr=0x7f199af4a9d0) = 10948
strace: Process 10948 attached
[pid 10947] prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) = 0
[pid 10947] clone(child_stack=0x7f199a748bf0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f199a7499d0, tls=0x7f199a749700,
child_tidptr=0x7f199a7499d0) = 10949
strace: Process 10949 attached
[pid 10947] prctl(PR_SET_TIMERSLACK, 1) = 0
[pid 10947] prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
[pid 10947] seccomp(SECCOMP_SET_MODE_STRICT, 1, NULL) = -1 EINVAL
(Invalid argument)
[pid 10947] seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=39,
filter=0x55a968927830}) = 0
[pid 10947] clone(strace: Process 10950 attached
child_stack=0x7f1999d65bf0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f1999d669d0, tls=0x7f1999d66700,
child_tidptr=0x7f1999d669d0) = 10950
[pid 10950] prctl(PR_SET_NAME, "CPU 0/KVM") = 0
[pid 10947] clone(strace: Process 10952 attached
child_stack=0x7f1993bfebf0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f1993bff9d0, tls=0x7f1993bff700,
child_tidptr=0x7f1993bff9d0) = 10952
[pid 10952] prctl(PR_SET_NAME, "worker") = 0
[pid 10947] clone(strace: Process 10953 attached
child_stack=0x7f19933fdbf0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f19933fe9d0, tls=0x7f19933fe700,
child_tidptr=0x7f19933fe9d0) = 10953
[pid 10953] prctl(PR_SET_NAME, "vnc_worker") = 0
VNC server running on ::1:5900
[pid 10952] +++ exited with 0 +++


Checking in procfs confirms that two of the threads aren't sandboxed:

$ for task in /proc/10947/task/*; do cat $task/comm; grep Seccomp
$task/status; done
qemu-system-x86
Seccomp: 2
qemu-system-x86
Seccomp: 0
qemu-system-x86
Seccomp: 0
CPU 0/KVM
Seccomp: 2
vnc_worker
Seccomp: 2



Re: [Qemu-devel] insecure git submodule URLs

2018-07-15 Thread Jann Horn via Qemu-devel
On Sun, Jul 15, 2018 at 11:18 PM Peter Maydell  wrote:
>
> On 15 July 2018 at 20:50, Jann Horn via Qemu-devel
>  wrote:
> > I noticed that when I build QEMU from git for the first time, it pulls
> > in submodules over the insecure git:// protocol - in other words, as
> > far as I can tell, if I'm e.g. on an open wifi network while building
> > QEMU for the first time, even if I cloned the main repository over
> > https, anyone could smuggle in malicious code as part of e.g. a
> > submodule's makefile.
>
> Yes, this came up the other week.
>
> > I'm not sure what your preferred fix for this is, so I'm not sending a
> > patch yet. As far as I can tell, the two options are:
> >
> >  - change .gitmodules to use https for everything
>
> We should probably do this...
>
> >  - change .gitmodules to use relative URLs
> >
> > If you want, I'll send a patch that does one of these, although it's
> > probably faster if you just do it yourselves.
> >
> > Relative URLs would have the advantage that if someone is cloning from
> > a mirror (in other words, github), the submodules will also
> > automatically come from the same mirror.
>
> Do we mirror all our submodules to github?

Looks that way - I cloned from github, patched my .submodules to use
relative URLs for everything, then tested:

$ git submodule init
Submodule 'capstone' (capstone.git) registered for path 'capstone'
Submodule 'dtc' (dtc.git) registered for path 'dtc'
Submodule 'roms/QemuMacDrivers' (QemuMacDrivers.git) registered for
path 'roms/QemuMacDrivers'
Submodule 'roms/SLOF' (SLOF.git) registered for path 'roms/SLOF'
Submodule 'roms/ipxe' (ipxe.git) registered for path 'roms/ipxe'
Submodule 'roms/openbios' (openbios.git) registered for path 'roms/openbios'
Submodule 'roms/openhackware' (openhackware.git) registered for path
'roms/openhackware'
Submodule 'roms/qemu-palcode' (qemu-palcode.git) registered for path
'roms/qemu-palcode'
Submodule 'roms/seabios' (seabios.git) registered for path 'roms/seabios'
Submodule 'roms/seabios-hppa'
(https://github.com/hdeller/seabios-hppa.git) registered for path
'roms/seabios-hppa'
Submodule 'roms/sgabios' (sgabios.git) registered for path 'roms/sgabios'
Submodule 'roms/skiboot' (skiboot.git) registered for path 'roms/skiboot'
Submodule 'roms/u-boot' (u-boot.git) registered for path 'roms/u-boot'
Submodule 'roms/u-boot-sam460ex' (u-boot-sam460ex.git) registered for
path 'roms/u-boot-sam460ex'
Submodule 'ui/keycodemapdb' (keycodemapdb.git) registered for path
'ui/keycodemapdb'
$ git submodule update
Submodule path 'capstone': checked out
'22ead3e0bfdb87516656453336160e0a37b066bf'
Submodule path 'dtc': checked out 'e54388015af1fb4bf04d0bca99caba1074d9cc42'
Submodule path 'roms/QemuMacDrivers': checked out
'd4e7d7ac663fcb55f1b93575445fcbca372f17a7'
Submodule path 'roms/SLOF': checked out
'7d37babcfa48a6eb08e726a8d13b745cb2eebe1c'
Submodule path 'roms/ipxe': checked out
'0600d3ae94f93efd10fc6b3c7420a9557a3a1670'
Submodule path 'roms/openbios': checked out
'8fe6f5f96f6ca39f1f62200be7fa130e929f13f2'
Submodule path 'roms/openhackware': checked out
'c559da7c8eec5e45ef1f67978827af6f0b9546f5'
Submodule path 'roms/qemu-palcode': checked out
'f3c7e44c70254975df2a00af39701eafbac4d471'
Submodule path 'roms/seabios': checked out
'f9626ccb91e771f990fbb2da92e427a399d7d918'
Submodule path 'roms/seabios-hppa': checked out
'1ef99a01572c2581c30e16e6fe69e9ea2ef92ce0'
Submodule path 'roms/sgabios': checked out
'cbaee52287e5f32373181cff50a00b6c4ac9015a'
Submodule path 'roms/skiboot': checked out
'e0ee24c27a172bcf482f6f2bc905e6211c134bcc'
Submodule path 'roms/u-boot': checked out
'd85ca029f257b53a96da6c2fb421e78a003a9943'
Submodule path 'roms/u-boot-sam460ex': checked out
'60b3916f33e617a815973c5a6df77055b2e3a588'
Submodule path 'ui/keycodemapdb': checked out
'6b3d716e2b6472eb7189d3220552280ef3d832ce'

> > As far as I can tell, the QEMU git server only supports the "dumb" git
> > protocol when accessed over HTTPS, not the "smart" protocol. I'm not
> > sure whether that might be why QEMU is currently still using the
> > insecure git protocol instead of git over HTTPS?
>
> This is why we haven't switched over the submodules yet, yes.
> It's on Jeff's todo list for the server, though.
>
> thanks
> -- PMM



[Qemu-devel] insecure git submodule URLs

2018-07-15 Thread Jann Horn via Qemu-devel
Hi!

I noticed that when I build QEMU from git for the first time, it pulls
in submodules over the insecure git:// protocol - in other words, as
far as I can tell, if I'm e.g. on an open wifi network while building
QEMU for the first time, even if I cloned the main repository over
https, anyone could smuggle in malicious code as part of e.g. a
submodule's makefile.

I'm not sure what your preferred fix for this is, so I'm not sending a
patch yet. As far as I can tell, the two options are:

 - change .gitmodules to use https for everything
 - change .gitmodules to use relative URLs

If you want, I'll send a patch that does one of these, although it's
probably faster if you just do it yourselves.

Relative URLs would have the advantage that if someone is cloning from
a mirror (in other words, github), the submodules will also
automatically come from the same mirror.

As far as I can tell, the QEMU git server only supports the "dumb" git
protocol when accessed over HTTPS, not the "smart" protocol. I'm not
sure whether that might be why QEMU is currently still using the
insecure git protocol instead of git over HTTPS?



[Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching

2017-03-20 Thread Jann Horn
This is an issue in QEMU's system emulation for X86 in TCG mode.
The issue permits an attacker who can execute code in guest ring 3
with normal user privileges to inject code into other processes that
are running in guest ring 3, in particular root-owned processes.

== reproduction steps ==

 - Create an x86-64 VM and install Debian Jessie in it. The following
   steps should all be executed inside the VM.
 - Verify that procmail is installed and the correct version:
   root@qemuvm:~# apt-cache show procmail | egrep 'Version|SHA'
   Version: 3.22-24
   SHA1: 54ed2d51db0e76f027f06068ab5371048c13434c
   SHA256: 4488cf6975af9134a9b5238d5d70e8be277f70caa45a840dfbefd2dc444bfe7f
 - Install build-essential and nasm ("apt install build-essential nasm").
 - Unpack the exploit, compile it and run it:
   user@qemuvm:~$ tar xvf procmail_cache_attack.tar
   procmail_cache_attack/
   procmail_cache_attack/shellcode.asm
   procmail_cache_attack/xp.c
   procmail_cache_attack/compile.sh
   procmail_cache_attack/attack.c
   user@qemuvm:~$ cd procmail_cache_attack
   user@qemuvm:~/procmail_cache_attack$ ./compile.sh
   user@qemuvm:~/procmail_cache_attack$ ./attack
   memory mappings set up
   child is dead, codegen should be complete
   executing code as root! :)
   root@qemuvm:~/procmail_cache_attack# id
   uid=0(root) gid=0(root) groups=0(root),[...]

Note: While the exploit depends on the precise version of procmail,
the actual vulnerability is in QEMU, not in procmail. procmail merely
serves as a seldomly-executed setuid root binary into which code can
be injected.


== detailed issue description ==
QEMU caches translated basic blocks. To look up a translated basic
block, the function tb_find() is used, which uses tb_htable_lookup()
in its slowpath, which in turn compares translated basic blocks
(TranslationBlock) to the lookup information (struct tb_desc) using
tb_cmp().

tb_cmp() attempts to ensure (among other things) that both the virtual
start address of the basic block and the physical addresses that the
basic block covers match. When checking the physical addresses, it
assumes that a basic block can span at most two pages.

gen_intermediate_code() attempts to enforce this by stopping the
translation of a basic block if nearly one page of instructions has
been translated already:

/* if too long translation, stop generation too */
if (tcg_op_buf_full() ||
(pc_ptr - pc_start) >= (TARGET_PAGE_SIZE - 32) ||
num_insns >= max_insns) {
gen_jmp_im(pc_ptr - dc->cs_base);
gen_eob(dc);
break;
}

However, while real X86 processors have a maximum instruction length
of 15 bytes, QEMU's instruction decoder for X86 does not place any
limit on the instruction length or the number of instruction prefixes.
Therefore, it is possible to create an arbitrarily long instruction
by e.g. prepending an arbitrary number of LOCK prefixes to a normal
instruction. This permits creating a basic block that spans three
pages by simply appending an approximately page-sized instruction to
the end of a normal basic block that starts close to the end of a
page.

Such an overlong basic block causes the basic block caching to fail as
follows: If code is generated and cached for a basic block that spans
the physical pages (A,E,B), this basic block will be returned by
lookups in a process in which the physical pages (A,B,C) are mapped
in the same virtual address range (assuming that all other lookup
parameters match).

This behavior can be abused by an attacker e.g. as follows: If a
non-relocatable world-readable setuid executable legitimately contains
the pages (A,B,C), an attacker can map (A,E,B) into his own process,
at the normal load address of A, where E is an attacker-controlled
page. If a legitimate basic block spans the pages A and B, an attacker
can write arbitrary non-branch instructions at the start of E, then
append an overlong instruction
that ends behind the start of C, yielding a modified basic block that
spans all three pages. If the attacker then executes the modified
basic block in his process, the modified basic block is cached.
Next, the attacker can execute the setuid binary, which will reuse the
cached modified basic block, executing attacker-controlled
instructions in the context of the privileged process.

I am sending this to qemu-devel because a QEMU security contact
told me that QEMU does not consider privilege escalation inside a
TCG VM to be a security concern.


procmail_cache_attack.tar
Description: Unix tar archive


Re: [Qemu-devel] [PATCH] linux-user: limit number of arguments to execve

2017-03-03 Thread Jann Horn
On Fri, Mar 3, 2017 at 4:56 PM, Peter Maydell  wrote:
> On 3 March 2017 at 14:54, Eric Blake  wrote:
>>> +ret = -TARGET_EFAULT;
>>> +break;
>>> +}
>>>  argp = alloca((argc + 1) * sizeof(void *));
>>>  envp = alloca((envc + 1) * sizeof(void *));
>>
>> ...Uggh. You're using alloca() but allowing an allocation of way more
>> than 4k.  That means a guest can cause corruption of the stack (or, with
>> large enough arguments, even escape out of the stack) before you even
>> get to the execve() call to even worry about E2BIG issues.
>
> Yeah, linux-user is shot through with that kind of alloca() usage.
>
> (It's not great, but it's not a security hole because we already
> give the guest binary complete control to do anything it likes.
> Worth fixing bugs if we run into them, though.)

It could be a security hole if a benign guest userspace process decides to
allow a remote client to specify a bunch of environment variables or so.
E.g. HTTP servers with CGI support pass HTTP headers on as
environment variables whose names start with "HTTP_"; however, since
HTTP servers usually limit the request size, this isn't actually exploitable
in that case. But there could theoretically be similar scenarios.



Re: [Qemu-devel] [PATCH] linux-user: limit number of arguments to execve

2017-03-03 Thread Jann Horn
On Fri, Mar 3, 2017 at 4:55 PM, Peter Maydell <peter.mayd...@linaro.org> wrote:
> On 3 March 2017 at 11:25, P J P <ppan...@redhat.com> wrote:
>> From: Prasad J Pandit <p...@fedoraproject.org>
>>
>> Limit the number of arguments passed to execve(2) call from
>> a user program, as large number of them could lead to a bad
>> guest address error.
>>
>> Reported-by: Jann Horn <ja...@google.com>
>> Signed-off-by: Prasad J Pandit <p...@fedoraproject.org>
>> ---
>>  linux-user/syscall.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/linux-user/syscall.c b/linux-user/syscall.c
>> index 9be8e95..c545c12 100644
>> --- a/linux-user/syscall.c
>> +++ b/linux-user/syscall.c
>> @@ -7766,6 +7766,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
>> arg1,
>>  #endif
>>  case TARGET_NR_execve:
>>  {
>> +#define ARG_MAX 65535
>>  char **argp, **envp;
>>  int argc, envc;
>>  abi_ulong gp;
>> @@ -7794,6 +7795,12 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
>> arg1,
>>  envc++;
>>  }
>>
>> +if (argc > ARG_MAX || envc > ARG_MAX) {
>> +fprintf(stderr,
>> +"argc(%d), envc(%d) exceed %d\n", argc, envc, 
>> ARG_MAX);
>> +ret = -TARGET_EFAULT;
>> +break;
>> +}
>>  argp = alloca((argc + 1) * sizeof(void *));
>>  envp = alloca((envc + 1) * sizeof(void *));
>
> This code is already supposed to handle "argument string too big",
> see commit a6f79cc9a5e.
>
> What's the actual bug case we're trying to handle here?

commit a6f79cc9a5e doesn't help here, it only bails out after the two
alloca() calls



Re: [Qemu-devel] [PATCH 23/29] 9pfs: local: chmod: don't follow symlinks

2017-02-24 Thread Jann Horn
On Fri, Feb 24, 2017 at 4:23 PM, Eric Blake  wrote:
> On 02/24/2017 04:34 AM, Greg Kurz wrote:
>> On Thu, 23 Feb 2017 15:10:42 +
>> Stefan Hajnoczi  wrote:
>>
>>> On Mon, Feb 20, 2017 at 03:42:19PM +0100, Greg Kurz wrote:
 The local_chmod() callback is vulnerable to symlink attacks because it
 calls:

 (1) chmod() which follows symbolic links for all path elements
 (2) local_set_xattr()->setxattr() which follows symbolic links for all
 path elements
 (3) local_set_mapped_file_attr() which calls in turn local_fopen() and
 mkdir(), both functions following symbolic links for all path
 elements but the rightmost one

>
 +fd = local_open_nofollow(fs_ctx, fs_path->data, O_RDONLY, 0);
 +if (fd == -1) {
 +return -1;
 +}
 +ret = fchmod(fd, credp->fc_mode);
 +close_preserve_errno(fd);
>>>
>>> As mentioned on IRC, not sure this is workable since chmod(2) must not
>>> require read permission on the file.  This patch introduces failures
>>> when the file isn't readable.
>>>
>>
>> Yeah :-\ This one is the more painful issue to address. I need a
>> chmod() variant that would fail on symlinks instead of following
>> them... and I'm kinda suffering to implement this in a race-free
>> manner.
>
> BSD has lchmod(), but Linux does not.  And Linux does not yet properly
> implement AT_SYMLINK_NOFOLLOW.  (The BSD semantics are pretty cool:
> restricting mode bits on a symlink can create scenarios where some users
> can dereference the symlink but others get ENOENT failures - but I don't
> know of any effort to add those semantics to the Linux kernel)
>
> POSIX says support for AT_SYMLINK_NOFOLLOW is optional, precisely
> because POSIX does not require permissions on symlinks to matter, but
> the way it requires it is that AT_SYMLINK_NOFOLLOW is supposed to be a
> no-op for regular files, and cause an EOPNOTSUPP failure for symlinks.
>
> Unfortunately, the Linux man page says that Linux fails with ENOTSUP
> (which is identical to EOPNOTSUPP) any time AT_SYMLINK_NOFOLLOW is set,
> and not just if it is attempted on a symlink.

And unfortunately, that flags argument is not actually present in the
real syscall.
See this glibc code:

int
fchmodat (int fd, const char *file, mode_t mode, int flag)
{
  if (flag & ~AT_SYMLINK_NOFOLLOW)
return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
#ifndef __NR_lchmod /* Linux so far has no lchmod syscall.  */
  if (flag & AT_SYMLINK_NOFOLLOW)
return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP);
#endif

  return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
}

and this kernel code:

SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename,
umode_t, mode)
{
  [...]
}

So to fix this, you'll probably have to add a new syscall fchmodat2()
to the kernel,
wire it up for all the architectures and get the various libc
implementations to adopt
that. That's going to be quite tedious. :(



Re: [Qemu-devel] [PATCH 07/29] 9pfs: local: introduce symlink-attack safe xattr helpers

2017-02-23 Thread Jann Horn
On Thu, Feb 23, 2017 at 4:02 PM, Eric Blake  wrote:
> On 02/20/2017 08:40 AM, Greg Kurz wrote:
>> All operations dealing with extended attributes are vulnerable to symlink
>> attacks because they use path-based syscalls which can traverse symbolic
>> links while walking through the dirname part of the path.
>>
>> The solution is to introduce helpers based on opendir_nofollow(). This
>> calls for "at" versions of the extended attribute syscalls, which don't
>> exist unfortunately. This patch implement them by simulating the "at"
>> behavior with fchdir(). Since the current working directory is process
>> wide, and we don't want to confuse another thread in QEMU, all the work
>> is done in a separate process.
>
> Can you emulate *at using /proc/fd/nnn/xyz?

I don't know much about QEMU internals, but QEMU supports running in a
chroot using the -chroot option, right? Does that already require procfs to be
mounted inside the chroot?