Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?
Andrew Baumannwrites: >> From: Richard Henderson [mailto:rth7...@gmail.com] On Behalf Of Richard >> Henderson >> Sent: Monday, 24 July 2017 15:03 >> >> On 07/24/2017 02:23 PM, Emilio G. Cota wrote: >> > (Adding some Cc's) >> > >> > On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel >> wrote: >> >> Hi all, >> >> >> >> I'm trying to track down what appears to be a translation bug in either >> >> the aarch64 target or x86_64 TCG (in multithreaded mode). The >> symptoms >> >> I assume this is really x86_64 and not i686 as host. > > Yes. > >> >> are entirely consistent with a torn read/write -- that is, a 64-bit >> >> load or store that was translated to two 32-bit loads and stores -- >> >> but that's obviously not what happens in the common path through the >> >> translation for this code, so I'm wondering: are there any cases in >> >> which qemu will split a 64-bit memory access into two 32-bit accesses? >> > >> > That would be a bug in MTTCG. >> > >> >> The code: Guest CPU A writes a 64-bit value to an aligned memory >> >> location that was previously 0, using a regular store; e.g.: >> >> f934 str x20,[x1] >> >> >> >> Guest CPU B (who is busy-waiting) reads a value from the same location: >> >> f9400280 ldr x0,[x20] >> >> >> >> The symptom: CPU B loads a value that is neither NULL nor the value >> >> written. Instead, x0 gets only the low 32-bits of the value written >> >> (high bits are all zero). By the time this value is dereferenced (a >> >> few instructions later) and the exception handlers run, the memory >> >> location from which it was loaded has the correct 64-bit value with >> >> a non-zero upper half. >> >> >> >> Obviously on a real ARM memory barriers are critical, and indeed >> >> the code has such barriers in it, but I'm assuming that any possible >> >> mistranslation of the barriers is irrelevant because for a 64-bit load >> >> and a 64-bit store you should get all or nothing. Other clues that may >> >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting >> >> is used to resolve a race when updating another variable), and the >> >> busy-wait loop has a yield instruction in it (although those appear >> >> to be no-ops with MTTCG). >> > >> > This might have to do with how ldrex/strex is emulated; are you relying >> > on the exclusive pair detecting ABA? If so, your code won't work in >> > QEMU since it uses cmpxchg to emulate ldrex/strex. >> >> ABA problem is nothing to do with tearing. And cmpxchg will definitely not >> create tearing problems. > > In fact, the ldrex/strex are there implementing a cmpxchg. :) > >> I don't know how we would manage 64-bit tearing on a 64-bit host, at least >> for >> the aarch64 guest, for which I believe we have a good emulation. >> >> > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then? >> >> A good suggestion. > > Thanks for the suggestions. I've been running this configuration all day, and > haven't seen a repro yet in hundreds of attempts. I'll report if that changes. > > The problem is that this test isn't very conclusive... I don't know > how often the VCPU threads will context switch, but I suspect it's > pretty rare relative to the ability to expose races when they run > directly on different host cores. And if the race really doesn't exist > when running on a single core, that to me suggests a hardware problem, > which is even less likely. They will likely context switch as they drop/clear locks coming out of the translated code > >> > - Force the emulation of cmpxchg via EXCP_ATOMIC with: >> > >> > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c >> > index 87f673e..771effe5 100644 >> > --- a/tcg/tcg-op.c >> > +++ b/tcg/tcg-op.c >> > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 >> retv, TCGv addr, TCGv_i64 cmpv, >> > } >> > tcg_temp_free_i64(t1); >> > } else if ((memop & MO_SIZE) == MO_64) { >> > -#ifdef CONFIG_ATOMIC64 >> > +#if 0 >> >> I suspect this will simply alter the timing. However, give it a go by all >> means. > > I haven't tried this yet. If it behaves as you describe, it seems likely to > make the race harder to hit. > >> If there's a test case that you can share, that would be awesome. > > I wish we could :( > >> Especially if you can prod it to happen with a standalone minimal binary. >> With >> luck you can reproduce via aarch64-linux-user too, and simply signal an error >> via branch to __builtin_trap. > > Initial attempts to repro this with a tight loop failed, but I'll take > another stab at producing a standalone test program that we can share. You could try using the counter value as a way to synchronise threads if there is a certain race you are trying to engineer. I did something similar in the barrier tests I wrote while doing the MTTCG work: https://github.com/stsquad/kvm-unit-tests/blob/mttcg/current-tests-v8/arm/barrier-litmus-test.c#L61 > > Andrew -- Alex Bennée
Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?
> From: Richard Henderson [mailto:rth7...@gmail.com] On Behalf Of Richard > Henderson > Sent: Monday, 24 July 2017 15:03 > > On 07/24/2017 02:23 PM, Emilio G. Cota wrote: > > (Adding some Cc's) > > > > On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel > wrote: > >> Hi all, > >> > >> I'm trying to track down what appears to be a translation bug in either > >> the aarch64 target or x86_64 TCG (in multithreaded mode). The > symptoms > > I assume this is really x86_64 and not i686 as host. Yes. > >> are entirely consistent with a torn read/write -- that is, a 64-bit > >> load or store that was translated to two 32-bit loads and stores -- > >> but that's obviously not what happens in the common path through the > >> translation for this code, so I'm wondering: are there any cases in > >> which qemu will split a 64-bit memory access into two 32-bit accesses? > > > > That would be a bug in MTTCG. > > > >> The code: Guest CPU A writes a 64-bit value to an aligned memory > >> location that was previously 0, using a regular store; e.g.: > >>f934 str x20,[x1] > >> > >> Guest CPU B (who is busy-waiting) reads a value from the same location: > >>f9400280 ldr x0,[x20] > >> > >> The symptom: CPU B loads a value that is neither NULL nor the value > >> written. Instead, x0 gets only the low 32-bits of the value written > >> (high bits are all zero). By the time this value is dereferenced (a > >> few instructions later) and the exception handlers run, the memory > >> location from which it was loaded has the correct 64-bit value with > >> a non-zero upper half. > >> > >> Obviously on a real ARM memory barriers are critical, and indeed > >> the code has such barriers in it, but I'm assuming that any possible > >> mistranslation of the barriers is irrelevant because for a 64-bit load > >> and a 64-bit store you should get all or nothing. Other clues that may > >> be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting > >> is used to resolve a race when updating another variable), and the > >> busy-wait loop has a yield instruction in it (although those appear > >> to be no-ops with MTTCG). > > > > This might have to do with how ldrex/strex is emulated; are you relying > > on the exclusive pair detecting ABA? If so, your code won't work in > > QEMU since it uses cmpxchg to emulate ldrex/strex. > > ABA problem is nothing to do with tearing. And cmpxchg will definitely not > create tearing problems. In fact, the ldrex/strex are there implementing a cmpxchg. :) > I don't know how we would manage 64-bit tearing on a 64-bit host, at least > for > the aarch64 guest, for which I believe we have a good emulation. > > > - Pin the QEMU-MTTCG process to a single CPU. Can you repro then? > > A good suggestion. Thanks for the suggestions. I've been running this configuration all day, and haven't seen a repro yet in hundreds of attempts. I'll report if that changes. The problem is that this test isn't very conclusive... I don't know how often the VCPU threads will context switch, but I suspect it's pretty rare relative to the ability to expose races when they run directly on different host cores. And if the race really doesn't exist when running on a single core, that to me suggests a hardware problem, which is even less likely. > > - Force the emulation of cmpxchg via EXCP_ATOMIC with: > > > > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c > > index 87f673e..771effe5 100644 > > --- a/tcg/tcg-op.c > > +++ b/tcg/tcg-op.c > > @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 > retv, TCGv addr, TCGv_i64 cmpv, > > } > > tcg_temp_free_i64(t1); > > } else if ((memop & MO_SIZE) == MO_64) { > > -#ifdef CONFIG_ATOMIC64 > > +#if 0 > > I suspect this will simply alter the timing. However, give it a go by all > means. I haven't tried this yet. If it behaves as you describe, it seems likely to make the race harder to hit. > If there's a test case that you can share, that would be awesome. I wish we could :( > Especially if you can prod it to happen with a standalone minimal binary. > With > luck you can reproduce via aarch64-linux-user too, and simply signal an error > via branch to __builtin_trap. Initial attempts to repro this with a tight loop failed, but I'll take another stab at producing a standalone test program that we can share. Andrew
Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?
On 07/24/2017 02:23 PM, Emilio G. Cota wrote: (Adding some Cc's) On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel wrote: Hi all, I'm trying to track down what appears to be a translation bug in either the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms I assume this is really x86_64 and not i686 as host. are entirely consistent with a torn read/write -- that is, a 64-bit load or store that was translated to two 32-bit loads and stores -- but that's obviously not what happens in the common path through the translation for this code, so I'm wondering: are there any cases in which qemu will split a 64-bit memory access into two 32-bit accesses? That would be a bug in MTTCG. The code: Guest CPU A writes a 64-bit value to an aligned memory location that was previously 0, using a regular store; e.g.: f934 str x20,[x1] Guest CPU B (who is busy-waiting) reads a value from the same location: f9400280 ldr x0,[x20] The symptom: CPU B loads a value that is neither NULL nor the value written. Instead, x0 gets only the low 32-bits of the value written (high bits are all zero). By the time this value is dereferenced (a few instructions later) and the exception handlers run, the memory location from which it was loaded has the correct 64-bit value with a non-zero upper half. Obviously on a real ARM memory barriers are critical, and indeed the code has such barriers in it, but I'm assuming that any possible mistranslation of the barriers is irrelevant because for a 64-bit load and a 64-bit store you should get all or nothing. Other clues that may be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting is used to resolve a race when updating another variable), and the busy-wait loop has a yield instruction in it (although those appear to be no-ops with MTTCG). This might have to do with how ldrex/strex is emulated; are you relying on the exclusive pair detecting ABA? If so, your code won't work in QEMU since it uses cmpxchg to emulate ldrex/strex. ABA problem is nothing to do with tearing. And cmpxchg will definitely not create tearing problems. I don't know how we would manage 64-bit tearing on a 64-bit host, at least for the aarch64 guest, for which I believe we have a good emulation. - Pin the QEMU-MTTCG process to a single CPU. Can you repro then? A good suggestion. - Force the emulation of cmpxchg via EXCP_ATOMIC with: diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c index 87f673e..771effe5 100644 --- a/tcg/tcg-op.c +++ b/tcg/tcg-op.c @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv, } tcg_temp_free_i64(t1); } else if ((memop & MO_SIZE) == MO_64) { -#ifdef CONFIG_ATOMIC64 +#if 0 I suspect this will simply alter the timing. However, give it a go by all means. If there's a test case that you can share, that would be awesome. Especially if you can prod it to happen with a standalone minimal binary. With luck you can reproduce via aarch64-linux-user too, and simply signal an error via branch to __builtin_trap. r~
Re: [Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?
(Adding some Cc's) On Mon, Jul 24, 2017 at 19:05:33 +, Andrew Baumann via Qemu-devel wrote: > Hi all, > > I'm trying to track down what appears to be a translation bug in either > the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms > are entirely consistent with a torn read/write -- that is, a 64-bit > load or store that was translated to two 32-bit loads and stores -- > but that's obviously not what happens in the common path through the > translation for this code, so I'm wondering: are there any cases in > which qemu will split a 64-bit memory access into two 32-bit accesses? That would be a bug in MTTCG. > The code: Guest CPU A writes a 64-bit value to an aligned memory > location that was previously 0, using a regular store; e.g.: > f934 str x20,[x1] > > Guest CPU B (who is busy-waiting) reads a value from the same location: > f9400280 ldr x0,[x20] > > The symptom: CPU B loads a value that is neither NULL nor the value > written. Instead, x0 gets only the low 32-bits of the value written > (high bits are all zero). By the time this value is dereferenced (a > few instructions later) and the exception handlers run, the memory > location from which it was loaded has the correct 64-bit value with > a non-zero upper half. > > Obviously on a real ARM memory barriers are critical, and indeed > the code has such barriers in it, but I'm assuming that any possible > mistranslation of the barriers is irrelevant because for a 64-bit load > and a 64-bit store you should get all or nothing. Other clues that may > be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting > is used to resolve a race when updating another variable), and the > busy-wait loop has a yield instruction in it (although those appear > to be no-ops with MTTCG). This might have to do with how ldrex/strex is emulated; are you relying on the exclusive pair detecting ABA? If so, your code won't work in QEMU since it uses cmpxchg to emulate ldrex/strex. > The bug repros more easily with more guest VCPUs, and more load on the > host (i.e. more context switching to expose the race). It doesn't repro > for the single-threaded TCG. Unfortunately it's hard to get detailed > trace information, because the bug only repros roughly every one in 40 > attempts, and it's a long way into the guest OS boot before it arises. > > I'm not yet 100% convinced this is a qemu bug -- the obvious path > through the translator for those instructions does 64-bit memory > accesses on the host -- but at the same time, it has never been seen > outside qemu, and after staring long and hard at the guest code, we're > pretty sure it's correct. It's also extremely unlikely to be a wild > write, given that it occurs on a wide variety of guest call-stacks, > and the memory is later inconsistent with what was loaded. > > Any clues or debugging suggestions appreciated! - Pin the QEMU-MTTCG process to a single CPU. Can you repro then? - Force the emulation of cmpxchg via EXCP_ATOMIC with: diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c index 87f673e..771effe5 100644 --- a/tcg/tcg-op.c +++ b/tcg/tcg-op.c @@ -2856,7 +2856,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv, } tcg_temp_free_i64(t1); } else if ((memop & MO_SIZE) == MO_64) { -#ifdef CONFIG_ATOMIC64 +#if 0 gen_atomic_cx_i64 gen; gen = table_cmpxchg[memop & (MO_SIZE | MO_BSWAP)]; This will halt all other vCPUs before the calling vCPU performs the cmpxchg. Can you reproduce then? Emilio
[Qemu-devel] Torn read/write possible on aarch64/x86-64 MTTCG?
Hi all, I'm trying to track down what appears to be a translation bug in either the aarch64 target or x86_64 TCG (in multithreaded mode). The symptoms are entirely consistent with a torn read/write -- that is, a 64-bit load or store that was translated to two 32-bit loads and stores -- but that's obviously not what happens in the common path through the translation for this code, so I'm wondering: are there any cases in which qemu will split a 64-bit memory access into two 32-bit accesses? The code: Guest CPU A writes a 64-bit value to an aligned memory location that was previously 0, using a regular store; e.g.: f934 str x20,[x1] Guest CPU B (who is busy-waiting) reads a value from the same location: f9400280 ldr x0,[x20] The symptom: CPU B loads a value that is neither NULL nor the value written. Instead, x0 gets only the low 32-bits of the value written (high bits are all zero). By the time this value is dereferenced (a few instructions later) and the exception handlers run, the memory location from which it was loaded has the correct 64-bit value with a non-zero upper half. Obviously on a real ARM memory barriers are critical, and indeed the code has such barriers in it, but I'm assuming that any possible mistranslation of the barriers is irrelevant because for a 64-bit load and a 64-bit store you should get all or nothing. Other clues that may be relevant: the code is _near_ a LDREX/STREX pair (the busy-waiting is used to resolve a race when updating another variable), and the busy-wait loop has a yield instruction in it (although those appear to be no-ops with MTTCG). The bug repros more easily with more guest VCPUs, and more load on the host (i.e. more context switching to expose the race). It doesn't repro for the single-threaded TCG. Unfortunately it's hard to get detailed trace information, because the bug only repros roughly every one in 40 attempts, and it's a long way into the guest OS boot before it arises. I'm not yet 100% convinced this is a qemu bug -- the obvious path through the translator for those instructions does 64-bit memory accesses on the host -- but at the same time, it has never been seen outside qemu, and after staring long and hard at the guest code, we're pretty sure it's correct. It's also extremely unlikely to be a wild write, given that it occurs on a wide variety of guest call-stacks, and the memory is later inconsistent with what was loaded. Any clues or debugging suggestions appreciated! Thanks, Andrew