Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Sean Christopherson
On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
> > So there are effectively three reasons we want a delay:
> > 
> >  1. To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM 
> > can
> > enter the guest so that the guest doesn't need an arch-specific VM-Exit 
> > source.
> > 
> >  2. To let ioctl(KVM_RUN) make its way back to the test before the next 
> > round
> > of migration.
> > 
> >  3. To ensure the read-side can make forward progress, e.g. if 
> > sched_getcpu()
> > involves a syscall.
> > 
> > 
> > After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is 
> > the
> > only arch that currently uses xfer_to_guest_mode_work(), i.e. the test 
> > could be
> > tweaked to be overtly x86-specific.  But since a delay is needed for #2 and 
> > #3,
> > I'd prefer to rely on it for #1 as well in the hopes that this test provides
> > coverage for arm64 and/or s390 if they're ever converted to use the common
> > xfer_to_guest_mode_work().
> 
> Now that we have this understanding of why we need the delay, it would be 
> good to
> write this down in a comment within the test.

Ya, I'll get a new version out next week.

> Does it reproduce if we randomize the delay to have it picked randomly from 
> 0us
> to 100us (with 1us step) ? It would remove a lot of the needs for 
> arch-specific
> magic delay value.

My less-than-scientific testing shows that it can reproduce at delays up to 
~500us,
but above ~10us the reproducibility starts to drop.  The bug still reproduces
reliably, it just takes more iterations, and obviously the test runs a bit 
slower.

Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?


Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers
- On Aug 27, 2021, at 7:23 PM, Sean Christopherson sea...@google.com wrote:

> On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
[...]
>> Does it reproduce if we randomize the delay to have it picked randomly from 
>> 0us
>> to 100us (with 1us step) ? It would remove a lot of the needs for 
>> arch-specific
>> magic delay value.
> 
> My less-than-scientific testing shows that it can reproduce at delays up to
> ~500us,
> but above ~10us the reproducibility starts to drop.  The bug still reproduces
> reliably, it just takes more iterations, and obviously the test runs a bit
> slower.
> 
> Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?

Works for me, thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers



- On Aug 26, 2021, at 7:54 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
>> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> >> >> +  r = sched_setaffinity(0, sizeof(allowed_mask), 
>> >> >> _mask);
>> >> >> +  TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d 
>> >> >> (%s)",
>> >> >> +  errno, strerror(errno));
>> >> >> +  atomic_inc(_cnt);
>> >> >> +
>> >> >> +  CPU_CLR(cpu, _mask);
>> >> >> +
>> >> >> +  /*
>> >> >> +   * Let the read-side get back into KVM_RUN to improve 
>> >> >> the odds
>> >> >> +   * of task migration coinciding with KVM's run loop.
>> >> > 
>> >> > This comment should be about increasing the odds of letting the seqlock
>> >> > read-side complete. Otherwise, the delay between the two back-to-back
>> >> > atomic_inc is so small that the seqlock read-side may never have time to
>> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and 
>> >> > can
>> >> > retry forever.
>> > 
>> > Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
>> > isn't
>> > possible (though that syscall would have to be screaming fast),
>> 
>> I don't think we have the same understanding of the livelock scenario. AFAIU 
>> the
>> livelock
>> would be caused by a too-small delay between the two consecutive atomic_inc()
>> from one
>> loop iteration to the next compared to the time it takes to perform a 
>> read-side
>> critical
>> section of the seqlock. Back-to-back atomic_inc can be performed very 
>> quickly,
>> so I
>> doubt that the sched_getcpu implementation have good odds to be fast enough 
>> to
>> complete
>> in that narrow window, leading to lots of read seqlock retry.
> 
> Ooooh, yeah, brain fart on my side.  I was thinking of the two atomic_inc() in
> the
> same loop iteration and completely ignoring the next iteration.  Yes, I 100%
> agree
> that a delay to ensure forward progress is needed.  An assertion in main() 
> that
> the
> reader complete at least some reasonable number of KVM_RUNs is also probably a
> good
> idea, e.g. to rule out a false pass due to the reader never making forward
> progress.

Agreed.

> 
> FWIW, the do-while loop does make forward progress without a delay, but at 
> ~50%
> throughput, give or take.

I did not expect absolutely no progress, but a significant slow down of
the read-side.

> 
>> > but the primary motivation is very much to allow the read-side enough time
>> > to get back into KVM proper.
>> 
>> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, 
>> then we
>> have:
>> 
>> migration thread KVM_RUN/read-side thread
>> ---
>>  - ioctl(KVM_RUN)
>> - atomic_inc_seq_cst()
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>>  - a = atomic_load() & ~1
>>  - smp_rmb()
>>  - b = 
>> LOAD_ONCE(__rseq_abi->cpu_id);
>>  - sched_getcpu()
>>  - smp_rmb()
>>  - re-load seqcnt/compare 
>> (succeeds)
>>- Can only succeed if entire 
>> read-side happens while the seqcnt
>>  is in an even numbered 
>> state.
>>  - if (a != b) abort()
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst()
>>   /* Let's suppose the lack of delay causes the
>>  setaffinity to complete too early compared
>>  with KVM_RUN ioctl */
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>> 
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst()
>>   /* Then a setaffinity from a following
>>  migration thread loop will run
>>  concurrently with KVM_RUN */
>>  - ioctl(KVM_RUN)
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>> 
>> As pointed out here, if the first setaffinity runs too early compared with
>> KVM_RUN,
>> a following setaffinity will run concurrently with it. However, the fact that
>> the even counter state is very short may very well hurt progress of the read
>> seqlock.
> 
> *sigh*
> 
> Several hours later, I think I finally have my head wrapped around everything.
> 
> Due to the way the test is written and because of how KVM's run loop works,
> TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM
> actually
> enters the guest, otherwise KVM will exit to userspace without touching the
> flag,
> i.e. it will be handled by the 

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-26 Thread Sean Christopherson
On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com 
> wrote:
> >> >> +   r = sched_setaffinity(0, sizeof(allowed_mask), 
> >> >> _mask);
> >> >> +   TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d 
> >> >> (%s)",
> >> >> +   errno, strerror(errno));
> >> >> +   atomic_inc(_cnt);
> >> >> +
> >> >> +   CPU_CLR(cpu, _mask);
> >> >> +
> >> >> +   /*
> >> >> +* Let the read-side get back into KVM_RUN to improve 
> >> >> the odds
> >> >> +* of task migration coinciding with KVM's run loop.
> >> > 
> >> > This comment should be about increasing the odds of letting the seqlock
> >> > read-side complete. Otherwise, the delay between the two back-to-back
> >> > atomic_inc is so small that the seqlock read-side may never have time to
> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
> >> > retry forever.
> > 
> > Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
> > isn't
> > possible (though that syscall would have to be screaming fast),
> 
> I don't think we have the same understanding of the livelock scenario. AFAIU 
> the livelock
> would be caused by a too-small delay between the two consecutive atomic_inc() 
> from one
> loop iteration to the next compared to the time it takes to perform a 
> read-side critical
> section of the seqlock. Back-to-back atomic_inc can be performed very 
> quickly, so I
> doubt that the sched_getcpu implementation have good odds to be fast enough 
> to complete
> in that narrow window, leading to lots of read seqlock retry.

Ooooh, yeah, brain fart on my side.  I was thinking of the two atomic_inc() in 
the
same loop iteration and completely ignoring the next iteration.  Yes, I 100% 
agree
that a delay to ensure forward progress is needed.  An assertion in main() that 
the
reader complete at least some reasonable number of KVM_RUNs is also probably a 
good
idea, e.g. to rule out a false pass due to the reader never making forward 
progress.

FWIW, the do-while loop does make forward progress without a delay, but at ~50% 
throughput, give or take.

> > but the primary motivation is very much to allow the read-side enough time
> > to get back into KVM proper.
> 
> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then 
> we
> have:
> 
> migration thread KVM_RUN/read-side thread
> ---
>  - ioctl(KVM_RUN)
> - atomic_inc_seq_cst()
> - sched_setaffinity
> - atomic_inc_seq_cst()
>  - a = atomic_load() & ~1
>  - smp_rmb()
>  - b = 
> LOAD_ONCE(__rseq_abi->cpu_id);
>  - sched_getcpu()
>  - smp_rmb()
>  - re-load seqcnt/compare 
> (succeeds)
>- Can only succeed if entire 
> read-side happens while the seqcnt
>  is in an even numbered state.
>  - if (a != b) abort()
>   /* no delay. Even counter state is very
>  short. */
> - atomic_inc_seq_cst()
>   /* Let's suppose the lack of delay causes the
>  setaffinity to complete too early compared
>  with KVM_RUN ioctl */
> - sched_setaffinity
> - atomic_inc_seq_cst()
> 
>   /* no delay. Even counter state is very
>  short. */
> - atomic_inc_seq_cst()
>   /* Then a setaffinity from a following
>  migration thread loop will run
>  concurrently with KVM_RUN */
>  - ioctl(KVM_RUN)
> - sched_setaffinity
> - atomic_inc_seq_cst()
> 
> As pointed out here, if the first setaffinity runs too early compared with 
> KVM_RUN,
> a following setaffinity will run concurrently with it. However, the fact that 
> the even counter state is very short may very well hurt progress of the read 
> seqlock.

*sigh*

Several hours later, I think I finally have my head wrapped around everything.

Due to the way the test is written and because of how KVM's run loop works,
TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM 
actually
enters the guest, otherwise KVM will exit to userspace without touching the 
flag,
i.e. it will be handled by the normal exit_to_user_mode_loop().

Where I got lost was trying to figure out why I couldn't make the bug reproduce 
by
causing the guest to exit to KVM, but not userspace, in which case KVM should
easily trigger the problematic flow as the window for sched_getcpu() to collide
with KVM would be enormous.  The reason I didn't go down this 

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-26 Thread Mathieu Desnoyers
- On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> - On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
>> > wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> + cpu_set_t allowed_mask;
>> >> + int r, i, nr_cpus, cpu;
>> >> +
>> >> + CPU_ZERO(_mask);
>> >> +
>> >> + nr_cpus = CPU_COUNT(_mask);
>> >> +
>> >> + for (i = 0; i < 2; i++) {
>> >> + cpu = i % nr_cpus;
>> >> + if (!CPU_ISSET(cpu, _mask))
>> >> + continue;
>> >> +
>> >> + CPU_SET(cpu, _mask);
>> >> +
>> >> + /*
>> >> +  * Bump the sequence count twice to allow the reader to detect
>> >> +  * that a migration may have occurred in between rseq and sched
>> >> +  * CPU ID reads.  An odd sequence count indicates a migration
>> >> +  * is in-progress, while a completely different count indicates
>> >> +  * a migration occurred since the count was last read.
>> >> +  */
>> >> + atomic_inc(_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(>val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would 
> actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I 
would
recommend adding a comment stating which memory barriers are required alongside 
each
use of atomic_inc here. I would even prefer that we add extra (currently 
unneeded)
write barriers to make extra sure that this stays documented. Performance of 
the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> + r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
>> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> + errno, strerror(errno));
>> >> + atomic_inc(_cnt);
>> >> +
>> >> + CPU_CLR(cpu, _mask);
>> >> +
>> >> + /*
>> >> +  * Let the read-side get back into KVM_RUN to improve the odds
>> >> +  * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
> isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU 
the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() 
from one
loop iteration to the next compared to the time it takes to perform a read-side 
critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, 
so I
doubt that the sched_getcpu implementation have good odds to be fast enough to 
complete
in that narrow 

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-25 Thread Sean Christopherson
On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
> [ re-send to Darren Hart ]
> 
> - On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers 
> mathieu.desnoy...@efficios.com wrote:
> 
> > - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
> > wrote:
> > 
> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> >> without updating rseq, leading to a stale CPU ID and other badness.
> >> 
> > 
> > [...]
> > 
> > +#define RSEQ_SIG 0xdeadbeef
> > 
> > Is there any reason for defining a custom signature rather than including
> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
> > the proper architecture header which will define the appropriate signature.
> > 
> > Arguably you don't define rseq critical sections in this test per se, but
> > I'm wondering why the custom signature here.

Partly to avoid taking a dependency on rseq.h, and partly to try to call out 
that
the test doesn't actually do any rseq critical sections.

> > [...]
> > 
> >> +
> >> +static void *migration_worker(void *ign)
> >> +{
> >> +  cpu_set_t allowed_mask;
> >> +  int r, i, nr_cpus, cpu;
> >> +
> >> +  CPU_ZERO(_mask);
> >> +
> >> +  nr_cpus = CPU_COUNT(_mask);
> >> +
> >> +  for (i = 0; i < 2; i++) {
> >> +  cpu = i % nr_cpus;
> >> +  if (!CPU_ISSET(cpu, _mask))
> >> +  continue;
> >> +
> >> +  CPU_SET(cpu, _mask);
> >> +
> >> +  /*
> >> +   * Bump the sequence count twice to allow the reader to detect
> >> +   * that a migration may have occurred in between rseq and sched
> >> +   * CPU ID reads.  An odd sequence count indicates a migration
> >> +   * is in-progress, while a completely different count indicates
> >> +   * a migration occurred since the count was last read.
> >> +   */
> >> +  atomic_inc(_cnt);
> > 
> > So technically this atomic_inc contains the required barriers because the
> > selftests implementation uses "__sync_add_and_fetch(>val, 1)". But
> > it's rather odd that the semantic differs from the kernel implementation in
> > terms of memory barriers: the kernel implementation of atomic_inc
> > guarantees no memory barriers, but this one happens to provide full
> > barriers pretty much by accident (selftests futex/include/atomic.h
> > documents no such guarantee).

Yeah, I got quite lost trying to figure out what atomics the test would actually
end up with.

> > If this full barrier guarantee is indeed provided by the selftests atomic.h
> > header, I would really like a comment stating that in the atomic.h header
> > so the carpet is not pulled from under our feet by a future optimization.
> > 
> > 
> >> +  r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
> >> +  TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
> >> +  errno, strerror(errno));
> >> +  atomic_inc(_cnt);
> >> +
> >> +  CPU_CLR(cpu, _mask);
> >> +
> >> +  /*
> >> +   * Let the read-side get back into KVM_RUN to improve the odds
> >> +   * of task migration coinciding with KVM's run loop.
> > 
> > This comment should be about increasing the odds of letting the seqlock
> > read-side complete. Otherwise, the delay between the two back-to-back
> > atomic_inc is so small that the seqlock read-side may never have time to
> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
> > retry forever.

Hmm, but that's not why there's a delay.  I'm not arguing that a livelock isn't
possible (though that syscall would have to be screaming fast), but the primary
motivation is very much to allow the read-side enough time to get back into KVM
proper.

To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
loop, i.e. sched_setaffinity() must induce task migration after the read-side 
has
invoked ioctl(KVM_RUN).

> > I'm wondering if 1 microsecond is sufficient on other architectures as
> > well.

I'm definitely wondering that as well :-)

> > One alternative way to make this depend less on the architecture's
> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
> > away the value read. This would ensure the delay is appropriate on all
> > architectures.

As above, I think an arbitrary delay is required regardless of how fast
sched_getcpu() can execute.  One thought would be to do sched_getcpu() _and_
usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
but I don't know that that adds meaningful value.

The real test is if someone 

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-23 Thread Mathieu Desnoyers
[ re-send to Darren Hart ]

- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
> wrote:
> 
>> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> without updating rseq, leading to a stale CPU ID and other badness.
>> 
> 
> [...]
> 
> +#define RSEQ_SIG 0xdeadbeef
> 
> Is there any reason for defining a custom signature rather than including
> tools/testing/selftests/rseq/rseq.h ? This should take care of including
> the proper architecture header which will define the appropriate signature.
> 
> Arguably you don't define rseq critical sections in this test per se, but
> I'm wondering why the custom signature here.
> 
> [...]
> 
>> +
>> +static void *migration_worker(void *ign)
>> +{
>> +cpu_set_t allowed_mask;
>> +int r, i, nr_cpus, cpu;
>> +
>> +CPU_ZERO(_mask);
>> +
>> +nr_cpus = CPU_COUNT(_mask);
>> +
>> +for (i = 0; i < 2; i++) {
>> +cpu = i % nr_cpus;
>> +if (!CPU_ISSET(cpu, _mask))
>> +continue;
>> +
>> +CPU_SET(cpu, _mask);
>> +
>> +/*
>> + * Bump the sequence count twice to allow the reader to detect
>> + * that a migration may have occurred in between rseq and sched
>> + * CPU ID reads.  An odd sequence count indicates a migration
>> + * is in-progress, while a completely different count indicates
>> + * a migration occurred since the count was last read.
>> + */
>> +atomic_inc(_cnt);
> 
> So technically this atomic_inc contains the required barriers because the
> selftests
> implementation uses "__sync_add_and_fetch(>val, 1)". But it's rather odd
> that
> the semantic differs from the kernel implementation in terms of memory 
> barriers:
> the
> kernel implementation of atomic_inc guarantees no memory barriers, but this 
> one
> happens to provide full barriers pretty much by accident (selftests
> futex/include/atomic.h documents no such guarantee).
> 
> If this full barrier guarantee is indeed provided by the selftests atomic.h
> header,
> I would really like a comment stating that in the atomic.h header so the 
> carpet
> is
> not pulled from under our feet by a future optimization.
> 
> 
>> +r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
>> +TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> +errno, strerror(errno));
>> +atomic_inc(_cnt);
>> +
>> +CPU_CLR(cpu, _mask);
>> +
>> +/*
>> + * Let the read-side get back into KVM_RUN to improve the odds
>> + * of task migration coinciding with KVM's run loop.
> 
> This comment should be about increasing the odds of letting the seqlock
> read-side
> complete. Otherwise, the delay between the two back-to-back atomic_inc is so
> small
> that the seqlock read-side may never have time to complete the reading the 
> rseq
> cpu id and the sched_getcpu() call, and can retry forever.
> 
> I'm wondering if 1 microsecond is sufficient on other architectures as well. 
> One
> alternative way to make this depend less on the architecture's implementation 
> of
> sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
> the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the
> migration
> thread rather than use usleep, and throw away the value read. This would 
> ensure
> the delay is appropriate on all architectures.
> 
> Thanks!
> 
> Mathieu
> 
>> + */
>> +usleep(1);
>> +}
>> +done = true;
>> +return NULL;
>> +}
>> +
>> +int main(int argc, char *argv[])
>> +{
>> +struct kvm_vm *vm;
>> +u32 cpu, rseq_cpu;
>> +int r, snapshot;
>> +
>> +/* Tell stdout not to buffer its content */
>> +setbuf(stdout, NULL);
>> +
>> +r = sched_getaffinity(0, sizeof(possible_mask), _mask);
>> +TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
>> +strerror(errno));
>> +
>> +if (CPU_COUNT(_mask) < 2) {
>> +print_skip("Only one CPU, task migration not possible\n");
>> +exit(KSFT_SKIP);
>> +}
>> +
>> +sys_rseq(0);
>> +
>> +/*
>> + * Create and run a dummy VM that immediately exits to userspace via
>> + * GUEST_SYNC, while concurrently migrating the process by setting its
>> + * CPU affinity.
>> + */
>> +vm = vm_create_default(VCPU_ID, 0, guest_code);
>> +
>> +pthread_create(_thread, NULL, migration_worker, 0);
>> +
>> +while (!done) {
>> +vcpu_run(vm, VCPU_ID);
>> +TEST_ASSERT(get_ucall(vm, VCPU_ID, 

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-23 Thread Mathieu Desnoyers
- On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 

[...]

+#define RSEQ_SIG 0xdeadbeef

Is there any reason for defining a custom signature rather than including
tools/testing/selftests/rseq/rseq.h ? This should take care of including
the proper architecture header which will define the appropriate signature.

Arguably you don't define rseq critical sections in this test per se, but
I'm wondering why the custom signature here.

[...]

> +
> +static void *migration_worker(void *ign)
> +{
> + cpu_set_t allowed_mask;
> + int r, i, nr_cpus, cpu;
> +
> + CPU_ZERO(_mask);
> +
> + nr_cpus = CPU_COUNT(_mask);
> +
> + for (i = 0; i < 2; i++) {
> + cpu = i % nr_cpus;
> + if (!CPU_ISSET(cpu, _mask))
> + continue;
> +
> + CPU_SET(cpu, _mask);
> +
> + /*
> +  * Bump the sequence count twice to allow the reader to detect
> +  * that a migration may have occurred in between rseq and sched
> +  * CPU ID reads.  An odd sequence count indicates a migration
> +  * is in-progress, while a completely different count indicates
> +  * a migration occurred since the count was last read.
> +  */
> + atomic_inc(_cnt);

So technically this atomic_inc contains the required barriers because the 
selftests
implementation uses "__sync_add_and_fetch(>val, 1)". But it's rather odd 
that
the semantic differs from the kernel implementation in terms of memory 
barriers: the
kernel implementation of atomic_inc guarantees no memory barriers, but this one
happens to provide full barriers pretty much by accident (selftests
futex/include/atomic.h documents no such guarantee).

If this full barrier guarantee is indeed provided by the selftests atomic.h 
header,
I would really like a comment stating that in the atomic.h header so the carpet 
is
not pulled from under our feet by a future optimization.


> + r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
> + errno, strerror(errno));
> + atomic_inc(_cnt);
> +
> + CPU_CLR(cpu, _mask);
> +
> + /*
> +  * Let the read-side get back into KVM_RUN to improve the odds
> +  * of task migration coinciding with KVM's run loop.

This comment should be about increasing the odds of letting the seqlock 
read-side
complete. Otherwise, the delay between the two back-to-back atomic_inc is so 
small
that the seqlock read-side may never have time to complete the reading the rseq
cpu id and the sched_getcpu() call, and can retry forever.

I'm wondering if 1 microsecond is sufficient on other architectures as well. One
alternative way to make this depend less on the architecture's implementation of
sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the 
migration
thread rather than use usleep, and throw away the value read. This would ensure
the delay is appropriate on all architectures.

Thanks!

Mathieu

> +  */
> + usleep(1);
> + }
> + done = true;
> + return NULL;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + struct kvm_vm *vm;
> + u32 cpu, rseq_cpu;
> + int r, snapshot;
> +
> + /* Tell stdout not to buffer its content */
> + setbuf(stdout, NULL);
> +
> + r = sched_getaffinity(0, sizeof(possible_mask), _mask);
> + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
> + strerror(errno));
> +
> + if (CPU_COUNT(_mask) < 2) {
> + print_skip("Only one CPU, task migration not possible\n");
> + exit(KSFT_SKIP);
> + }
> +
> + sys_rseq(0);
> +
> + /*
> +  * Create and run a dummy VM that immediately exits to userspace via
> +  * GUEST_SYNC, while concurrently migrating the process by setting its
> +  * CPU affinity.
> +  */
> + vm = vm_create_default(VCPU_ID, 0, guest_code);
> +
> + pthread_create(_thread, NULL, migration_worker, 0);
> +
> + while (!done) {
> + vcpu_run(vm, VCPU_ID);
> + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> + "Guest failed?");
> +
> + /*
> +  * Verify rseq's CPU matches sched's CPU.  Ensure migration
> +  * doesn't occur between sched_getcpu() and reading the rseq
> +  * cpu_id by 

[PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-20 Thread Sean Christopherson
Add a test to verify an rseq's CPU ID is updated correctly if the task is
migrated while the kernel is handling KVM_RUN.  This is a regression test
for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
without updating rseq, leading to a stale CPU ID and other badness.

Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/.gitignore  |   1 +
 tools/testing/selftests/kvm/Makefile|   3 +
 tools/testing/selftests/kvm/rseq_test.c | 154 
 3 files changed, 158 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/rseq_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore 
b/tools/testing/selftests/kvm/.gitignore
index 0709af0144c8..6d031ff6b68e 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -47,6 +47,7 @@
 /kvm_page_table_test
 /memslot_modification_stress_test
 /memslot_perf_test
+/rseq_test
 /set_memory_region_test
 /steal_time
 /kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 5832f510a16c..0756e79cb513 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += kvm_page_table_test
 TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86_64 += memslot_perf_test
+TEST_GEN_PROGS_x86_64 += rseq_test
 TEST_GEN_PROGS_x86_64 += set_memory_region_test
 TEST_GEN_PROGS_x86_64 += steal_time
 TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
@@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_aarch64 += kvm_page_table_test
+TEST_GEN_PROGS_aarch64 += rseq_test
 TEST_GEN_PROGS_aarch64 += set_memory_region_test
 TEST_GEN_PROGS_aarch64 += steal_time
 TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
@@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
 TEST_GEN_PROGS_s390x += dirty_log_test
 TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
 TEST_GEN_PROGS_s390x += kvm_page_table_test
+TEST_GEN_PROGS_s390x += rseq_test
 TEST_GEN_PROGS_s390x += set_memory_region_test
 TEST_GEN_PROGS_s390x += kvm_binary_stats_test
 
diff --git a/tools/testing/selftests/kvm/rseq_test.c 
b/tools/testing/selftests/kvm/rseq_test.c
new file mode 100644
index ..d28d7ba1a64a
--- /dev/null
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+#define VCPU_ID 0
+
+static __thread volatile struct rseq __rseq = {
+   .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+#define RSEQ_SIG 0xdeadbeef
+
+static pthread_t migration_thread;
+static cpu_set_t possible_mask;
+static bool done;
+
+static atomic_t seq_cnt;
+
+static void guest_code(void)
+{
+   for (;;)
+   GUEST_SYNC(0);
+}
+
+static void sys_rseq(int flags)
+{
+   int r;
+
+   r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG);
+   TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno));
+}
+
+static void *migration_worker(void *ign)
+{
+   cpu_set_t allowed_mask;
+   int r, i, nr_cpus, cpu;
+
+   CPU_ZERO(_mask);
+
+   nr_cpus = CPU_COUNT(_mask);
+
+   for (i = 0; i < 2; i++) {
+   cpu = i % nr_cpus;
+   if (!CPU_ISSET(cpu, _mask))
+   continue;
+
+   CPU_SET(cpu, _mask);
+
+   /*
+* Bump the sequence count twice to allow the reader to detect
+* that a migration may have occurred in between rseq and sched
+* CPU ID reads.  An odd sequence count indicates a migration
+* is in-progress, while a completely different count indicates
+* a migration occurred since the count was last read.
+*/
+   atomic_inc(_cnt);
+   r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
+   TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
+   errno, strerror(errno));
+   atomic_inc(_cnt);
+
+   CPU_CLR(cpu, _mask);
+
+   /*
+* Let the read-side get back into KVM_RUN to improve the odds
+* of task migration coinciding with KVM's run loop.
+*/
+   usleep(1);
+   }
+   done = true;
+   return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+   struct kvm_vm *vm;
+   u32 cpu, rseq_cpu;
+   int r, snapshot;
+
+   /*