Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?

2017-07-26 Thread Alex Bennée

Andrew Baumann  writes:

>> From: Richard Henderson [mailto:rth7...@gmail.com] On Behalf Of Richard
>> Henderson
>> Sent: Monday, 24 July 2017 15:03
>>
>> On 07/24/2017 02:23 PM, Emilio G. Cota wrote:
>> > (Adding some Cc's)
>> >
>> > On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel
>> wrote:
>> >> Hi all,
>> >>
>> >> I'm trying to track down what appears to be a translation bug in either
>> >> the aarch64 target or x86_64 TCG (in multithreaded mode). The
>> symptoms
>>
>> I assume this is really x86_64 and not i686 as host.
>
> Yes.
>
>> >> are entirely consistent with a torn read/write -- that is, a 64-bit
>> >> load or store that was translated to two 32-bit loads and stores --
>> >> but that's obviously not what happens in the common path through the
>> >> translation for this code, so I'm wondering: are there any cases in
>> >> which qemu will split a 64-bit memory access into two 32-bit accesses?
>> >
>> > That would be a bug in MTTCG.
>> >
>> >> The code: Guest CPU A writes a 64-bit value to an aligned memory
>> >> location that was previously 0, using a regular store; e.g.:
>> >>   f934 str x20,[x1]
>> >>
>> >> Guest CPU B (who is busy-waiting) reads a value from the same location:
>> >>   f9400280 ldr x0,[x20]
>> >>
>> >> The symptom: CPU B loads a value that is neither NULL nor the value
>> >> written. Instead, x0 gets only the low 32-bits of the value written
>> >> (high bits are all zero). By the time this value is dereferenced (a
>> >> few instructions later) and the exception handlers run, the memory
>> >> location from which it was loaded has the correct 64-bit value with
>> >> a non-zero upper half.
>> >>
>> >> Obviously on a real ARM memory barriers are critical, and indeed
>> >> the code has such barriers in it, but I'm assuming that any possible
>> >> mistranslation of the barriers is irrelevant because for a 64-bit load
>> >> and a 64-bit store you should get all or nothing. Other clues that may
>> >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
>> >> is used to resolve a race when updating another variable), and the
>> >> busy-wait loop has a yield instruction in it (although those appear
>> >> to be no-ops with MTTCG).
>> >
>> > This might have to do with how ldrex/strex is emulated; are you relying
>> > on the exclusive pair detecting ABA? If so, your code won't work in
>> > QEMU since it uses cmpxchg to emulate ldrex/strex.
>>
>> ABA problem is nothing to do with tearing.  And cmpxchg will definitely not
>> create tearing problems.
>
> In fact, the ldrex/strex are there implementing a cmpxchg. :)
>
>> I don't know how we would manage 64-bit tearing on a 64-bit host, at least
>> for
>> the aarch64 guest, for which I believe we have a good emulation.
>>
>> > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then?
>>
>> A good suggestion.
>
> Thanks for the suggestions. I've been running this configuration all day, and 
> haven't seen a repro yet in hundreds of attempts. I'll report if that changes.
>
> The problem is that this test isn't very conclusive... I don't know
> how often the VCPU threads will context switch, but I suspect it's
> pretty rare relative to the ability to expose races when they run
> directly on different host cores. And if the race really doesn't exist
> when running on a single core, that to me suggests a hardware problem,
> which is even less likely.

They will likely context switch as they drop/clear locks coming out of
the translated code

>
>> > - Force the emulation of cmpxchg via EXCP_ATOMIC with:
>> >
>> > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
>> > index 87f673e..771effe5 100644
>> > --- a/tcg/tcg-op.c
>> > +++ b/tcg/tcg-op.c
>> > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64
>> retv, TCGv addr, TCGv_i64 cmpv,
>> >   }
>> >   tcg_temp_free_i64(t1);
>> >   } else if ((memop & MO_SIZE) == MO_64) {
>> > -#ifdef CONFIG_ATOMIC64
>> > +#if 0
>>
>> I suspect this will simply alter the timing.  However, give it a go by all 
>> means.
>
> I haven't tried this yet. If it behaves as you describe, it seems likely to 
> make the race harder to hit.
>
>> If there's a test case that you can share, that would be awesome.
>
> I wish we could :(
>
>> Especially if you can prod it to happen with a standalone minimal binary.  
>> With
>> luck you can reproduce via aarch64-linux-user too, and simply signal an error
>> via branch to __builtin_trap.
>
> Initial attempts to repro this with a tight loop failed, but I'll take
> another stab at producing a standalone test program that we can share.

You could try using the counter value as a way to synchronise threads if
there is a certain race you are trying to engineer. I did something
similar in the barrier tests I wrote while doing the MTTCG work:

  
https://github.com/stsquad/kvm-unit-tests/blob/mttcg/current-tests-v8/arm/barrier-litmus-test.c#L61

>
> Andrew


--
Alex Bennée



Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?

2017-07-25 Thread Andrew Baumann via Qemu-devel
> From: Richard Henderson [mailto:rth7...@gmail.com] On Behalf Of Richard
> Henderson
> Sent: Monday, 24 July 2017 15:03
> 
> On 07/24/2017 02:23 PM, Emilio G. Cota wrote:
> > (Adding some Cc's)
> >
> > On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel
> wrote:
> >> Hi all,
> >>
> >> I'm trying to track down what appears to be a translation bug in either
> >> the aarch64 target or x86_64 TCG (in multithreaded mode). The
> symptoms
> 
> I assume this is really x86_64 and not i686 as host.

Yes.

> >> are entirely consistent with a torn read/write -- that is, a 64-bit
> >> load or store that was translated to two 32-bit loads and stores --
> >> but that's obviously not what happens in the common path through the
> >> translation for this code, so I'm wondering: are there any cases in
> >> which qemu will split a 64-bit memory access into two 32-bit accesses?
> >
> > That would be a bug in MTTCG.
> >
> >> The code: Guest CPU A writes a 64-bit value to an aligned memory
> >> location that was previously 0, using a regular store; e.g.:
> >>f934 str x20,[x1]
> >>
> >> Guest CPU B (who is busy-waiting) reads a value from the same location:
> >>f9400280 ldr x0,[x20]
> >>
> >> The symptom: CPU B loads a value that is neither NULL nor the value
> >> written. Instead, x0 gets only the low 32-bits of the value written
> >> (high bits are all zero). By the time this value is dereferenced (a
> >> few instructions later) and the exception handlers run, the memory
> >> location from which it was loaded has the correct 64-bit value with
> >> a non-zero upper half.
> >>
> >> Obviously on a real ARM memory barriers are critical, and indeed
> >> the code has such barriers in it, but I'm assuming that any possible
> >> mistranslation of the barriers is irrelevant because for a 64-bit load
> >> and a 64-bit store you should get all or nothing. Other clues that may
> >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
> >> is used to resolve a race when updating another variable), and the
> >> busy-wait loop has a yield instruction in it (although those appear
> >> to be no-ops with MTTCG).
> >
> > This might have to do with how ldrex/strex is emulated; are you relying
> > on the exclusive pair detecting ABA? If so, your code won't work in
> > QEMU since it uses cmpxchg to emulate ldrex/strex.
> 
> ABA problem is nothing to do with tearing.  And cmpxchg will definitely not
> create tearing problems.

In fact, the ldrex/strex are there implementing a cmpxchg. :)

> I don't know how we would manage 64-bit tearing on a 64-bit host, at least
> for
> the aarch64 guest, for which I believe we have a good emulation.
> 
> > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then?
> 
> A good suggestion.

Thanks for the suggestions. I've been running this configuration all day, and 
haven't seen a repro yet in hundreds of attempts. I'll report if that changes.

The problem is that this test isn't very conclusive... I don't know how often 
the VCPU threads will context switch, but I suspect it's pretty rare relative 
to the ability to expose races when they run directly on different host cores. 
And if the race really doesn't exist when running on a single core, that to me 
suggests a hardware problem, which is even less likely.

> > - Force the emulation of cmpxchg via EXCP_ATOMIC with:
> >
> > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> > index 87f673e..771effe5 100644
> > --- a/tcg/tcg-op.c
> > +++ b/tcg/tcg-op.c
> > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64
> retv, TCGv addr, TCGv_i64 cmpv,
> >   }
> >   tcg_temp_free_i64(t1);
> >   } else if ((memop & MO_SIZE) == MO_64) {
> > -#ifdef CONFIG_ATOMIC64
> > +#if 0
> 
> I suspect this will simply alter the timing.  However, give it a go by all 
> means.

I haven't tried this yet. If it behaves as you describe, it seems likely to 
make the race harder to hit.

> If there's a test case that you can share, that would be awesome.

I wish we could :(

> Especially if you can prod it to happen with a standalone minimal binary.  
> With
> luck you can reproduce via aarch64-linux-user too, and simply signal an error
> via branch to __builtin_trap.

Initial attempts to repro this with a tight loop failed, but I'll take another 
stab at producing a standalone test program that we can share.

Andrew


Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?

2017-07-24 Thread Richard Henderson

On 07/24/2017 02:23 PM, Emilio G. Cota wrote:

(Adding some Cc's)

On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel wrote:

Hi all,

I'm trying to track down what appears to be a translation bug in either
the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms


I assume this is really x86_64 and not i686 as host.


are entirely consistent with a torn read/write -- that is, a 64-bit
load or store that was translated to two 32-bit loads and stores --
but that's obviously not what happens in the common path through the
translation for this code, so I'm wondering: are there any cases in
which qemu will split a 64-bit memory access into two 32-bit accesses?


That would be a bug in MTTCG.


The code: Guest CPU A writes a 64-bit value to an aligned memory
location that was previously 0, using a regular store; e.g.:
f934 str x20,[x1]

Guest CPU B (who is busy-waiting) reads a value from the same location:
f9400280 ldr x0,[x20]

The symptom: CPU B loads a value that is neither NULL nor the value
written. Instead, x0 gets only the low 32-bits of the value written
(high bits are all zero). By the time this value is dereferenced (a
few instructions later) and the exception handlers run, the memory
location from which it was loaded has the correct 64-bit value with
a non-zero upper half.

Obviously on a real ARM memory barriers are critical, and indeed
the code has such barriers in it, but I'm assuming that any possible
mistranslation of the barriers is irrelevant because for a 64-bit load
and a 64-bit store you should get all or nothing. Other clues that may
be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
is used to resolve a race when updating another variable), and the
busy-wait loop has a yield instruction in it (although those appear
to be no-ops with MTTCG).


This might have to do with how ldrex/strex is emulated; are you relying
on the exclusive pair detecting ABA? If so, your code won't work in
QEMU since it uses cmpxchg to emulate ldrex/strex.


ABA problem is nothing to do with tearing.  And cmpxchg will definitely not 
create tearing problems.


I don't know how we would manage 64-bit tearing on a 64-bit host, at least for 
the aarch64 guest, for which I believe we have a good emulation.



- Pin the QEMU-MTTCG process to a single CPU. Can you repro then?


A good suggestion.


- Force the emulation of cmpxchg via EXCP_ATOMIC with:

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673e..771effe5 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, 
TCGv_i64 cmpv,
  }
  tcg_temp_free_i64(t1);
  } else if ((memop & MO_SIZE) == MO_64) {
-#ifdef CONFIG_ATOMIC64
+#if 0


I suspect this will simply alter the timing.  However, give it a go by all 
means.

If there's a test case that you can share, that would be awesome.

Especially if you can prod it to happen with a standalone minimal binary.  With 
luck you can reproduce via aarch64-linux-user too, and simply signal an error 
via branch to __builtin_trap.



r~



Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?

2017-07-24 Thread Emilio G. Cota
(Adding some Cc's)

On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel wrote:
> Hi all,
>
> I'm trying to track down what appears to be a translation bug in either
> the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms
> are entirely consistent with a torn read/write -- that is, a 64-bit
> load or store that was translated to two 32-bit loads and stores --
> but that's obviously not what happens in the common path through the
> translation for this code, so I'm wondering: are there any cases in
> which qemu will split a 64-bit memory access into two 32-bit accesses?

That would be a bug in MTTCG.

> The code: Guest CPU A writes a 64-bit value to an aligned memory
> location that was previously 0, using a regular store; e.g.:
>   f934 str x20,[x1]
>
> Guest CPU B (who is busy-waiting) reads a value from the same location:
>   f9400280 ldr x0,[x20]
>
> The symptom: CPU B loads a value that is neither NULL nor the value
> written. Instead, x0 gets only the low 32-bits of the value written
> (high bits are all zero). By the time this value is dereferenced (a
> few instructions later) and the exception handlers run, the memory
> location from which it was loaded has the correct 64-bit value with
> a non-zero upper half.
>
> Obviously on a real ARM memory barriers are critical, and indeed
> the code has such barriers in it, but I'm assuming that any possible
> mistranslation of the barriers is irrelevant because for a 64-bit load
> and a 64-bit store you should get all or nothing. Other clues that may
> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting
> is used to resolve a race when updating another variable), and the
> busy-wait loop has a yield instruction in it (although those appear
> to be no-ops with MTTCG).

This might have to do with how ldrex/strex is emulated; are you relying
on the exclusive pair detecting ABA? If so, your code won't work in
QEMU since it uses cmpxchg to emulate ldrex/strex.

> The bug repros more easily with more guest VCPUs, and more load on the
> host (i.e. more context switching to expose the race). It doesn't repro
> for the single-threaded TCG. Unfortunately it's hard to get detailed
> trace information, because the bug only repros roughly every one in 40
> attempts, and it's a long way into the guest OS boot before it arises.
>
> I'm not yet 100% convinced this is a qemu bug -- the obvious path
> through the translator for those instructions does 64-bit memory
> accesses on the host -- but at the same time, it has never been seen
> outside qemu, and after staring long and hard at the guest code, we're
> pretty sure it's correct. It's also extremely unlikely to be a wild
> write, given that it occurs on a wide variety of guest call-stacks,
> and the memory is later inconsistent with what was loaded.
>
> Any clues or debugging suggestions appreciated!

- Pin the QEMU-MTTCG process to a single CPU. Can you repro then?

- Force the emulation of cmpxchg via EXCP_ATOMIC with:

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673e..771effe5 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, 
TCGv_i64 cmpv,
 }
 tcg_temp_free_i64(t1);
 } else if ((memop & MO_SIZE) == MO_64) {
-#ifdef CONFIG_ATOMIC64
+#if 0
 gen_atomic_cx_i64 gen;

 gen = table_cmpxchg[memop & (MO_SIZE | MO_BSWAP)];

This will halt all other vCPUs before the calling vCPU performs
the cmpxchg. Can you reproduce then?

Emilio



[Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?

2017-07-24 Thread Andrew Baumann via Qemu-devel
Hi all,

I'm trying to track down what appears to be a translation bug in either the 
aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms are entirely 
consistent with a torn read/write -- that is, a 64-bit load or store that was 
translated to two 32-bit loads and stores -- but that's obviously not what 
happens in the common path through the translation for this code, so I'm 
wondering: are there any cases in which qemu will split a 64-bit memory access 
into two 32-bit accesses?

The code:
Guest CPU A writes a 64-bit value to an aligned memory location that was 
previously 0, using a regular store; e.g.:
f934 str x20,[x1]

Guest CPU B (who is busy-waiting) reads a value from the same location:
f9400280 ldr x0,[x20]

The symptom:
CPU B loads a value that is neither NULL nor the value written. Instead, x0 
gets only the low 32-bits of the value written (high bits are all zero). By the 
time this value is dereferenced (a few instructions later) and the exception 
handlers run, the memory location from which it was loaded has the correct 
64-bit value with a non-zero upper half.

Obviously on a real ARM memory barriers are critical, and indeed the code has 
such barriers in it, but I'm assuming that any possible mistranslation of the 
barriers is irrelevant because for a 64-bit load and a 64-bit store you should 
get all or nothing. Other clues that may be relevant: the code is _near_ a 
LDREX/STREX pair (the busy-waiting is used to resolve a race when updating 
another variable), and the busy-wait loop has a yield instruction in it 
(although those appear to be no-ops with MTTCG).

The bug repros more easily with more guest VCPUs, and more load on the host 
(i.e. more context switching to expose the race). It doesn't repro for the 
single-threaded TCG. Unfortunately it's hard to get detailed trace information, 
because the bug only repros roughly every one in 40 attempts, and it's a long 
way into the guest OS boot before it arises.

I'm not yet 100% convinced this is a qemu bug -- the obvious path through the 
translator for those instructions does 64-bit memory accesses on the host -- 
but at the same time, it has never been seen outside qemu, and after staring 
long and hard at the guest code, we're pretty sure it's correct. It's also 
extremely unlikely to be a wild write, given that it occurs on a wide variety 
of guest call-stacks, and the memory is later inconsistent with what was loaded.

Any clues or debugging suggestions appreciated!

Thanks,
Andrew