subject:"Re\: proposed smp_rendezvous change"

Re: proposed smp_rendezvous change

2011-07-18 Thread Andriy Gapon

on 15/05/2011 19:24 Andriy Gapon said the following:
 on 15/05/2011 19:09 Max Laier said the following:

 I don't think we ever intended to synchronize the local teardown part, and I 
 believe that is the correct behavior for this API.

 This version is sufficiently close to what I have, so I am resonably sure 
 that 
 it will work for us.  It seems, however, that if we move to check to after 
 picking up the lock anyway, the generation approach has even less impact and 
 I 
 am starting to prefer that solution.

 Andriy, is there any reason why you'd prefer your approach over the 
 generation 
 version?
 
 No reason.  And I even haven't said that I prefer it :-)
 I just wanted to show and explain it as apparently there was some
 misunderstanding about it.  I think that generation count approach could even
 have a little bit better performance while perhaps being a tiny bit less 
 obvious.

Resurrecting this conversation.
Here's a link to the thread on gmane for your convenience:
http://article.gmane.org/gmane.os.freebsd.current/132682

It seems that we haven't fully accounted for the special case of master CPU
behavior in neither my proposed fix that added a new wait count nor in the
committed fix that has added a generation count.
This has been pointed out by kib at the recent mini-DevSummit in Kyiv.

The problem can be illustrated by the following example:
- the teardown function is a real function; that is, not NULL and not
smp_rendevous_no_ barrier (sic)
- the arg parameter is a non-NULL pointer
- the teardown function actually accesses memory pointed to by the arg
- the master CPU deallocates memory pointed by the arg right after
smp_rendezvous() call returns (either via free(9) or by stack management, etc)

Quoting the code:
MPASS(generation == smp_rv_generation);
atomic_add_int(smp_rv_waiters[2], 1);
if (local_teardown_func != smp_no_rendevous_barrier) {
while (smp_rv_waiters[2]  smp_rv_ncpus 
generation == smp_rv_generation)
cpu_spinwait();

if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
}

Generation count helps slave CPUs to not get stuck at the while loop, but it
doesn't prevent stale/different arg being passed to the teardown function.

So here is a patch, which is a variation of my earlier proposal, that should fix
this issue:
http://people.freebsd.org/~avg/smp_rendezvous.diff
The patch is not tested yet.

The patch undoes the smp_rv_generation change and introduces a new
smp_rv_waiters[] element.  Perhaps the code could be optimized by making waiting
on smp_rv_waiters[3] conditional on some combination of values for the teardown
and the arg, but I decided to follow the KISS principle here to make the code a
little bit more robust.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-07-18 Thread Andriy Gapon

on 17/05/2011 18:51 John Baldwin said the following:
 On Tuesday, May 17, 2011 10:34:41 am Andriy Gapon wrote:
 on 17/05/2011 16:58 John Baldwin said the following:
 No, it doesn't quite work that way.  It wouldn't work on Alpha for example.

 All load_acq is a load with a memory barrier to order other loads after it.
 It is still free to load stale data.  Only a read-modify-write operation
 would actually block until it could access an up-to-date value.

 Hmm, ok.
 How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-)
 Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] 
 value,
 if there could be any.
 Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ?

 You see at what I am getting?
 
 Yeah, either of those would work.  At this point just leaving the
 atomic_add_int() as-is would be the smallest diff. :)

Yes, atomic_add_int() for smp_rv_waiters[0] is completely OK, it's the spinwait
loop on it that is a bit wasteful.  And probably the extra smp_rv_waiters[]
element used only for this purpose.  Do you agree?

Convenience link to the gmane archive of this thread:
http://article.gmane.org/gmane.os.freebsd.current/132730
-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-07-18 Thread John Baldwin

On Monday, July 18, 2011 7:34:11 am Andriy Gapon wrote:
 on 15/05/2011 19:24 Andriy Gapon said the following:
  on 15/05/2011 19:09 Max Laier said the following:
 
  I don't think we ever intended to synchronize the local teardown part, and 
  I 
  believe that is the correct behavior for this API.
 
  This version is sufficiently close to what I have, so I am resonably sure 
  that 
  it will work for us.  It seems, however, that if we move to check to after 
  picking up the lock anyway, the generation approach has even less impact 
  and I 
  am starting to prefer that solution.
 
  Andriy, is there any reason why you'd prefer your approach over the 
  generation 
  version?
  
  No reason.  And I even haven't said that I prefer it :-)
  I just wanted to show and explain it as apparently there was some
  misunderstanding about it.  I think that generation count approach could 
  even
  have a little bit better performance while perhaps being a tiny bit less 
  obvious.
 
 Resurrecting this conversation.
 Here's a link to the thread on gmane for your convenience:
 http://article.gmane.org/gmane.os.freebsd.current/132682
 
 It seems that we haven't fully accounted for the special case of master CPU
 behavior in neither my proposed fix that added a new wait count nor in the
 committed fix that has added a generation count.
 This has been pointed out by kib at the recent mini-DevSummit in Kyiv.
 
 The problem can be illustrated by the following example:
 - the teardown function is a real function; that is, not NULL and not
 smp_rendevous_no_ barrier (sic)
 - the arg parameter is a non-NULL pointer
 - the teardown function actually accesses memory pointed to by the arg
 - the master CPU deallocates memory pointed by the arg right after
 smp_rendezvous() call returns (either via free(9) or by stack management, etc)
 
 Quoting the code:
   MPASS(generation == smp_rv_generation);
   atomic_add_int(smp_rv_waiters[2], 1);
   if (local_teardown_func != smp_no_rendevous_barrier) {
   while (smp_rv_waiters[2]  smp_rv_ncpus 
   generation == smp_rv_generation)
   cpu_spinwait();
 
   if (local_teardown_func != NULL)
   local_teardown_func(local_func_arg);
   }
 
 Generation count helps slave CPUs to not get stuck at the while loop, but it
 doesn't prevent stale/different arg being passed to the teardown function.
 
 So here is a patch, which is a variation of my earlier proposal, that should 
 fix
 this issue:
 http://people.freebsd.org/~avg/smp_rendezvous.diff
 The patch is not tested yet.
 
 The patch undoes the smp_rv_generation change and introduces a new
 smp_rv_waiters[] element.  Perhaps the code could be optimized by making 
 waiting
 on smp_rv_waiters[3] conditional on some combination of values for the 
 teardown
 and the arg, but I decided to follow the KISS principle here to make the code 
 a
 little bit more robust.

Yes, I prefer the more robust case actually as it is clearer.  This is fine
with me.  It would be good to get Max's feedback and possibly Neel's as well.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-19 Thread John Baldwin

On Wednesday, May 18, 2011 4:27:50 pm Max Laier wrote:
 On 05/17/2011 01:35 PM, John Baldwin wrote:
 ...
  Yeah, I already have a patch to do that, but hadn't added atomic ops to
  critical_enter() and critical_exit().  But it also wasn't as fancy in the
  critical_exit() case.  Here is what I have and I think it might actually
  be ok (it doesn't use an atomic read and clear, but I think it is safe).
  Hmm, actually, it will need to use the read and clear:
 
 Looks good to me.  I was slightly surprised by this:
 
  Index: kern/kern_synch.c
  ===
  --- kern/kern_synch.c   (revision 222024)
  +++ kern/kern_synch.c   (working copy)
  @@ -400,9 +400,7 @@
  if (!TD_ON_LOCK(td)  !TD_IS_RUNNING(td))
  mtx_assert(Giant, MA_NOTOWNED);
#endif
  -   KASSERT(td-td_critnest == 1 || (td-td_critnest == 2
  -   (td-td_owepreempt)  (flags  SW_INVOL) != 0
  -   newtd == NULL) || panicstr,
  +   KASSERT(td-td_critnest == 1 || panicstr,
  (mi_switch: switch in a critical section));
  KASSERT((flags  (SW_INVOL | SW_VOL)) != 0,
  (mi_switch: switch must be voluntary or involuntary));
 
 part of the patch.  But that is in fact correct and much more expressive 
 and safe than the version we had before.

Ok, I need to stress test this some first.

 Thanks,
Max
 
 P.S. I'd like to see this and the rendezvous changes in stable/7 in the 
 not too distant future.  Mind if I MFH these when you are done - unless 
 you are planing to do it already, anyways.

I will merge them certainly.  These are critical fixes. :(

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-19 Thread John Baldwin

On Wednesday, May 18, 2011 4:27:50 pm Max Laier wrote:
 On 05/17/2011 01:35 PM, John Baldwin wrote:
 ...
  Yeah, I already have a patch to do that, but hadn't added atomic ops to
  critical_enter() and critical_exit().  But it also wasn't as fancy in the
  critical_exit() case.  Here is what I have and I think it might actually
  be ok (it doesn't use an atomic read and clear, but I think it is safe).
  Hmm, actually, it will need to use the read and clear:
 
 Looks good to me.  I was slightly surprised by this:
 
  Index: kern/kern_synch.c
  ===
  --- kern/kern_synch.c   (revision 222024)
  +++ kern/kern_synch.c   (working copy)
  @@ -400,9 +400,7 @@
  if (!TD_ON_LOCK(td)  !TD_IS_RUNNING(td))
  mtx_assert(Giant, MA_NOTOWNED);
#endif
  -   KASSERT(td-td_critnest == 1 || (td-td_critnest == 2
  -   (td-td_owepreempt)  (flags  SW_INVOL) != 0
  -   newtd == NULL) || panicstr,
  +   KASSERT(td-td_critnest == 1 || panicstr,
  (mi_switch: switch in a critical section));
  KASSERT((flags  (SW_INVOL | SW_VOL)) != 0,
  (mi_switch: switch must be voluntary or involuntary));
 
 part of the patch.  But that is in fact correct and much more expressive 
 and safe than the version we had before.

Actually, this is something that was made obsolete by ups's change to 
critical_exit() several years ago.

However, after sleeping on this last night, I think that critical_enter() and 
critical_exit() are a very hot path and that it may be worth adding some 
additional complexity to keep them as cheap as possible.  Also, your original 
patch for rm locks is actually correct.  They will not miss a preemption 
because rm_cleanIPI will not schedule a thread to run while it holds the spin 
lock.  However, I'd like to be a bit more future proof so that other IPI 
handler authors don't have to deal with this obscure race.  Given that, I've 
generalized your fix and moved it up into the SMP rendezvous code itself along 
with a big comment to explain why it works and a KASSERT().  I think this is 
also MFC'able back to 7 as well:

Index: kern/subr_smp.c
===
--- kern/subr_smp.c (revision 222032)
+++ kern/subr_smp.c (working copy)
@@ -316,6 +316,9 @@
void (*local_action_func)(void*);
void (*local_teardown_func)(void*);
int generation;
+#ifdef INVARIANTS
+   int owepreempt;
+#endif
 
/* Ensure we have up-to-date values. */
atomic_add_acq_int(smp_rv_waiters[0], 1);
@@ -330,6 +333,33 @@
generation = smp_rv_generation;
 
/*
+* Use a nested critical section to prevent any preemptions
+* from occurring during a rendezvous action routine.
+* Specifically, if a rendezvous handler is invoked via an IPI
+* and the interrupted thread was in the critical_exit()
+* function after setting td_critnest to 0 but before
+* performing a deferred preemption, this routine can be
+* invoked with td_critnest set to 0 and td_owepreempt true.
+* In that case, a critical_exit() during the rendezvous
+* action would trigger a preemption which is not permitted in
+* a rendezvous action.  To fix this, wrap all of the
+* rendezvous action handlers in a critical section.  We
+* cannot use a regular critical section however as having
+* critical_exit() preempt from this routine would also be
+* problematic (the preemption must not occur before the IPI
+* has been acknowleged via an EOI).  Instead, we
+* intentionally ignore td_owepreempt when leaving the
+* critical setion.  This should be harmless because we do not
+* permit rendezvous action routines to schedule threads, and
+* thus td_owepreempt should never transition from 0 to 1
+* during this routine.
+*/
+   td-td_critnest++;
+#ifdef INVARIANTS
+   owepreempt = td-td_owepreempt;
+#endif
+   
+   /*
 * If requested, run a setup function before the main action
 * function.  Ensure all CPUs have completed the setup
 * function before moving on to the action function.
@@ -362,14 +392,18 @@
 */
MPASS(generation == smp_rv_generation);
atomic_add_int(smp_rv_waiters[2], 1);
-   if (local_teardown_func == smp_no_rendevous_barrier)
-return;
-   while (smp_rv_waiters[2]  smp_rv_ncpus 
-   generation == smp_rv_generation)
-   cpu_spinwait();
+   if (local_teardown_func != smp_no_rendevous_barrier) {
+   while (smp_rv_waiters[2]  smp_rv_ncpus 
+   generation == smp_rv_generation)
+   cpu_spinwait();
 
-   if (local_teardown_func != NULL)
-   local_teardown_func(local_func_arg);
+   if (local_teardown_func != NULL)
+

Re: proposed smp_rendezvous change

2011-05-18 Thread Max Laier


On 05/17/2011 01:35 PM, John Baldwin wrote:
...

Yeah, I already have a patch to do that, but hadn't added atomic ops to
critical_enter() and critical_exit().  But it also wasn't as fancy in the
critical_exit() case.  Here is what I have and I think it might actually
be ok (it doesn't use an atomic read and clear, but I think it is safe).
Hmm, actually, it will need to use the read and clear:


Looks good to me.  I was slightly surprised by this:


Index: kern/kern_synch.c
===
--- kern/kern_synch.c   (revision 222024)
+++ kern/kern_synch.c   (working copy)
@@ -400,9 +400,7 @@
if (!TD_ON_LOCK(td)  !TD_IS_RUNNING(td))
mtx_assert(Giant, MA_NOTOWNED);
  #endif
-   KASSERT(td-td_critnest == 1 || (td-td_critnest == 2
-   (td-td_owepreempt)  (flags  SW_INVOL) != 0
-   newtd == NULL) || panicstr,
+   KASSERT(td-td_critnest == 1 || panicstr,
(mi_switch: switch in a critical section));
KASSERT((flags  (SW_INVOL | SW_VOL)) != 0,
(mi_switch: switch must be voluntary or involuntary));


part of the patch.  But that is in fact correct and much more expressive 
and safe than the version we had before.


Thanks,
  Max

P.S. I'd like to see this and the rendezvous changes in stable/7 in the 
not too distant future.  Mind if I MFH these when you are done - unless 
you are planing to do it already, anyways.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread Andriy Gapon

on 16/05/2011 23:09 John Baldwin said the following:
 On Monday, May 16, 2011 3:27:47 pm Andriy Gapon wrote:
 on 16/05/2011 21:21 John Baldwin said the following:
 How about this:
 ...
  /* 
   * Shared mutex to restrict busywaits between smp_rendezvous() and
 @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
  void
  smp_rendezvous_action(void)
  {
 -   void* local_func_arg = smp_rv_func_arg;
 -   void (*local_setup_func)(void*)   = smp_rv_setup_func;
 -   void (*local_action_func)(void*)   = smp_rv_action_func;
 -   void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 +   void *local_func_arg;
 +   void (*local_setup_func)(void*);
 +   void (*local_action_func)(void*);
 +   void (*local_teardown_func)(void*);
 +   int generation;
  
 /* Ensure we have up-to-date values. */
 atomic_add_acq_int(smp_rv_waiters[0], 1);
 while (smp_rv_waiters[0]  smp_rv_ncpus)
 cpu_spinwait();
  
 -   /* setup function */
 +   /* Fetch rendezvous parameters after acquire barrier. */
 +   local_func_arg = smp_rv_func_arg;
 +   local_setup_func = smp_rv_setup_func;
 +   local_action_func = smp_rv_action_func;
 +   local_teardown_func = smp_rv_teardown_func;

 I want to ask once again - please pretty please convince me that the above
 cpu_spinwait() loop is really needed and, by extension, that the assignments
 should be moved behind it.
 Please :)
 
 Well, moving the assignments down is a style fix for one, and we can always 
 remove the first rendezvous as a follow up if desired.

OK, makes sense.

 However, suppose you have an arch where sending an IPI is not a barrier
 (this seems unlikely).  In that case, the atomic_add_acq_int() will not
 succeed (and return) until it has seen the earlier write by the CPU
 initiating the rendezvous to clear smp_rv_waiters[0] to zero.  The actual
 spin on the smp_rv_waiters[] value is not strictly necessary however and

On this below.

 is probably just cut and pasted to match the other uses of values in
 the smp_rv_waiters[] array.
 
 (atomic_add_acq_int() could spin on architectures where it is implemented
 using compare-and-swap (e.g. sparc64) or locked-load conditional-store (e.g. 
 Alpha).)


When you say not strictly necessary, do you mean not necessary?
If you do not mean that, then when could it be (non-strictly) necessary? :)

Couldn't [Shouldn't] the whole:

 /* Ensure we have up-to-date values. */
 atomic_add_acq_int(smp_rv_waiters[0], 1);
 while (smp_rv_waiters[0]  smp_rv_ncpus)
 cpu_spinwait();

be just replaced with:

rmb();

Or a proper MI function that does just a read memory barrier, if rmb() is not 
that.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread John Baldwin


On 5/17/11 4:03 AM, Andriy Gapon wrote:

on 16/05/2011 23:09 John Baldwin said the following:

is probably just cut and pasted to match the other uses of values in
the smp_rv_waiters[] array.

(atomic_add_acq_int() could spin on architectures where it is implemented
using compare-and-swap (e.g. sparc64) or locked-load conditional-store (e.g.
Alpha).)



When you say not strictly necessary, do you mean not necessary?
If you do not mean that, then when could it be (non-strictly) necessary? :)

Couldn't [Shouldn't] the whole:


/* Ensure we have up-to-date values. */
atomic_add_acq_int(smp_rv_waiters[0], 1);
while (smp_rv_waiters[0]  smp_rv_ncpus)
cpu_spinwait();


be just replaced with:

rmb();

Or a proper MI function that does just a read memory barrier, if rmb() is not 
that.


No, you could replace it with:

atomic_add_acq_int(smp_rv_waiters[0], 1);

The key being that atomic_add_acq_int() will block (either in hardware 
or software) until it can safely perform the atomic operation.  That 
means waiting until the write to set smp_rv_waiters[0] to 0 by the 
rendezvous initiator is visible to the current CPU.


On some platforms a write by one CPU may not post instantly to other 
CPUs (e.g. it may sit in a store buffer).  That is fine so long as an 
attempt to update that value atomically (using cas or a 
conditional-store, etc.) fails.  For those platforms, the atomic(9) API 
is required to spin until it succeeds.


This is why the mtx code spins if it can't set MTX_CONTESTED for example.

--
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread John Baldwin


On 5/16/11 6:05 PM, Max Laier wrote:

On Monday 16 May 2011 17:54:54 Attilio Rao wrote:

2011/5/16 Max Laierm...@love2party.net:

On Monday 16 May 2011 16:46:03 John Baldwin wrote:

On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:

On Monday 16 May 2011 14:21:27 John Baldwin wrote:

Yes, we need to fix that.  Humm, it doesn't preempt when you do a
critical_exit() though?  Or do you use a hand-rolled critical exit
that doesn't do a deferred preemption?


Right now I just did a manual td_critnest++/--, but I guess ...


Ah, ok, so you would lose a preemption.  That's not really ideal.


Actually, I'm curious how the spin unlock inside the IPI could yield
the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
guess that is ok as long as the critical_exit() just defers the
preemption to the end of the IPI handler.


... the earliest point where it is safe to preempt is after doing the

atomic_add_int(smp_rv_waiters[2], 1);

so that we can start other IPIs again.  However, since we don't accept
new IPIs until we signal EOI in the MD code (on amd64), this might
still not be a good place to do the yield?!?


Hmm, yeah, you would want to do the EOI before you yield.  However, we
could actually move the EOI up before calling the MI code so long as we
leave interrupts disabled for the duration of the handler (which we do).


The spin unlock boils down to a critical_exit() and unless we did a
critical_enter() at some point during the redenvouz setup, we will
yield() if we owepreempt.  I'm not quite sure how that can happen, but
it seems like there is a path that allows the scheduler to set it from
a foreign CPU.


No, it is only set on curthread by curthread.  This is actually my main
question.  I've no idea how this could happen unless the rmlock code is
actually triggering a wakeup or sched_add() in its rendezvous handler.

I don't see anything in rm_cleanIPI() that would do that however.

I wonder if your original issue was really fixed  just by the first
patch you had which fixed the race in smp_rendezvous()?


I found the stack that lead me to this patch in the first place:

#0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0,
flags=4) at src/sys/kern/sched_ule.c:1939
#1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at
src/sys/kern/kern_synch.c:475
#2  0x802a2cb3 in critical_exit () at
src/sys/kern/kern_switch.c:185 #3  0x80465807 in spinlock_exit
() at
src/sys/amd64/amd64/machdep.c:1458
#4  0x8027adea in rm_cleanIPI (arg=value optimized out) at
src/sys/kern/kern_rmlock.c:180
#5  0x802b9887 in smp_rendezvous_action () at
src/sys/kern/subr_smp.c:402
#6  0x8045e2a4 in Xrendezvous () at
src/sys/amd64/amd64/apic_vector.S:235
#7  0x802a2c6e in critical_exit () at
src/sys/kern/kern_switch.c:179 #8  0x804365ba in uma_zfree_arg
(zone=0xff009ff4b5a0, item=0xff000f34cd40,
udata=0xff000f34ce08) at
src/sys/vm/uma_core.c:2370
.
.
.

and now that I look at it again, it is clear that critical_exit() just
isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:


if (td-td_critnest == 1) {
td-td_critnest = 0;
if (td-td_owepreempt) {
td-td_critnest = 1;

clearly doesn't work.


I'm sorry if I didn't reply to the whole rendezvous thread, but as
long as there is so many people taking care of it, I'll stay hidden in
my corner.

I just wanted to tell that I think you are misunderstanding what
critical section is supposed to do.

When an interrupt fires, it goes on the old interrupt/kernel context
which means it has not a context of his own. That is the reason why we
disable interrupts on spinlocks (or similar workaround for !x86
architectures) and this is why spinlocks are the only protection
usable in code that runs in interrupt context.

Preempting just means another thread will be scheduler in the middle
of another thread execution path.

This code is perfectly fine if you consider curthread won't be
descheduled while it is executing.


Well, no - it is not.  With this you can end up with a curthread that has
td_critnest=0 and td_owepreempt=1 in interrupt context.  If you use a spinlock
on such a thread, it will do the preemption at the point where you drop the
spinlock, this is bad in some circumstances.  One example is the smp_rendevous
case we are discussing here.


Yes, the 'owepreempt' flag can leak because we allow it to be set while 
td_critnest is 0.  That is the problem, and I think we added this as an 
optimization in the last round of changes to this routine to avoid 
excessive context switches.  However, I think we will just have to 
revert that optimization and prevent that state from occurring.


Hmm, the log message for the change said it was to avoid races.

http://svnweb.freebsd.org/base?view=revisionrevision=144777

That is what introduced the bug of mixing those two states.  Hmmm, the 
reason for moving the td_critnest == 0 up

Re: proposed smp_rendezvous change

2011-05-17 Thread Andriy Gapon

on 17/05/2011 14:56 John Baldwin said the following:
 On 5/17/11 4:03 AM, Andriy Gapon wrote:
 Couldn't [Shouldn't] the whole:

   /* Ensure we have up-to-date values. */
   atomic_add_acq_int(smp_rv_waiters[0], 1);
   while (smp_rv_waiters[0]  smp_rv_ncpus)
   cpu_spinwait();

 be just replaced with:

 rmb();

 Or a proper MI function that does just a read memory barrier, if rmb() is 
 not that.
 
 No, you could replace it with:
 
 atomic_add_acq_int(smp_rv_waiters[0], 1);

What about
(void)atomic_load_acq(smp_rv_waiters[0]);

In my opinion that should ensure that the hardware must post the latest value 
from
a master CPU to memory of smp_rv_waiters[0] and a slave CPU gets it from there.
And also, because of memory barriers inserted by store_rel on the master CPU and
load_acq on the slave CPU, the latest values of all other smp_rv_* fields should
become visible to the slave CPU.

 The key being that atomic_add_acq_int() will block (either in hardware or
 software) until it can safely perform the atomic operation.  That means 
 waiting
 until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is
 visible to the current CPU.
 
 On some platforms a write by one CPU may not post instantly to other CPUs 
 (e.g. it
 may sit in a store buffer).  That is fine so long as an attempt to update that
 value atomically (using cas or a conditional-store, etc.) fails.  For those
 platforms, the atomic(9) API is required to spin until it succeeds.
 
 This is why the mtx code spins if it can't set MTX_CONTESTED for example.
 

Thank you for the great explanation!
Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate cas
call, which should take care of hardware synchronization.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread John Baldwin

On Tuesday, May 17, 2011 9:46:34 am Andriy Gapon wrote:
 on 17/05/2011 14:56 John Baldwin said the following:
  On 5/17/11 4:03 AM, Andriy Gapon wrote:
  Couldn't [Shouldn't] the whole:
 
/* Ensure we have up-to-date values. */
atomic_add_acq_int(smp_rv_waiters[0], 1);
while (smp_rv_waiters[0]  smp_rv_ncpus)
cpu_spinwait();
 
  be just replaced with:
 
  rmb();
 
  Or a proper MI function that does just a read memory barrier, if rmb() is 
  not that.
  
  No, you could replace it with:
  
  atomic_add_acq_int(smp_rv_waiters[0], 1);
 
 What about
 (void)atomic_load_acq(smp_rv_waiters[0]);
 
 In my opinion that should ensure that the hardware must post the latest value 
 from
 a master CPU to memory of smp_rv_waiters[0] and a slave CPU gets it from 
 there.
 And also, because of memory barriers inserted by store_rel on the master CPU 
 and
 load_acq on the slave CPU, the latest values of all other smp_rv_* fields 
 should
 become visible to the slave CPU.

No, it doesn't quite work that way.  It wouldn't work on Alpha for example.

All load_acq is a load with a memory barrier to order other loads after it.
It is still free to load stale data.  Only a read-modify-write operation
would actually block until it could access an up-to-date value.

 
  The key being that atomic_add_acq_int() will block (either in hardware or
  software) until it can safely perform the atomic operation.  That means 
  waiting
  until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is
  visible to the current CPU.
  
  On some platforms a write by one CPU may not post instantly to other CPUs 
  (e.g. it
  may sit in a store buffer).  That is fine so long as an attempt to update 
  that
  value atomically (using cas or a conditional-store, etc.) fails.  For those
  platforms, the atomic(9) API is required to spin until it succeeds.
  
  This is why the mtx code spins if it can't set MTX_CONTESTED for example.
  
 
 Thank you for the great explanation!
 Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate 
 cas
 call, which should take care of hardware synchronization.

sparc64's load_acq() is stronger than the MI effect of load_acq().  On ia64
which uses ld.acq or Alpha (originally) which uses a membar and simple load,
the guarantees are only what I stated above (and would not be sufficient).

Note that Alpha borrowed heavily from MIPS, and the MIPS atomic implementation
is mostly identical to the old Alpha one (using conditional stores, etc.).

The MIPS atomic_load_acq():

#define ATOMIC_STORE_LOAD(WIDTH)\
static __inline  uint##WIDTH##_t\
atomic_load_acq_##WIDTH(__volatile uint##WIDTH##_t *p)  \
{   \
uint##WIDTH##_t v;  \
\
v = *p; \
mips_sync();\
return (v); \
}   \


-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread Andriy Gapon

on 17/05/2011 16:58 John Baldwin said the following:
 No, it doesn't quite work that way.  It wouldn't work on Alpha for example.
 
 All load_acq is a load with a memory barrier to order other loads after it.
 It is still free to load stale data.  Only a read-modify-write operation
 would actually block until it could access an up-to-date value.

Hmm, ok.
How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-)
Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] value,
if there could be any.
Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ?

You see at what I am getting?

 The key being that atomic_add_acq_int() will block (either in hardware or
 software) until it can safely perform the atomic operation.  That means 
 waiting
 until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is
 visible to the current CPU.

 On some platforms a write by one CPU may not post instantly to other CPUs 
 (e.g. it
 may sit in a store buffer).  That is fine so long as an attempt to update 
 that
 value atomically (using cas or a conditional-store, etc.) fails.  For those
 platforms, the atomic(9) API is required to spin until it succeeds.

 This is why the mtx code spins if it can't set MTX_CONTESTED for example.


 Thank you for the great explanation!
 Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate 
 cas
 call, which should take care of hardware synchronization.
 
 sparc64's load_acq() is stronger than the MI effect of load_acq().  On ia64

Oh, well, my expectation was that MI effect of atomic_load (emphasis on atomic_)
was to get a non-stale value.

 which uses ld.acq or Alpha (originally) which uses a membar and simple load,
 the guarantees are only what I stated above (and would not be sufficient).
 
 Note that Alpha borrowed heavily from MIPS, and the MIPS atomic implementation
 is mostly identical to the old Alpha one (using conditional stores, etc.).
 
 The MIPS atomic_load_acq():
 
 #define ATOMIC_STORE_LOAD(WIDTH)\
 static __inline  uint##WIDTH##_t\
 atomic_load_acq_##WIDTH(__volatile uint##WIDTH##_t *p)  \
 {   \
 uint##WIDTH##_t v;  \
 \
 v = *p; \
 mips_sync();\
 return (v); \
 }   \

I should have checked this myself.
Thank you for patiently explaining these things to me.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread Max Laier


On 05/17/2011 05:16 AM, John Baldwin wrote:
...

Index: kern/kern_switch.c
===
--- kern/kern_switch.c (revision 221536)
+++ kern/kern_switch.c (working copy)
@@ -192,15 +192,22 @@
critical_exit(void)
{
struct thread *td;
- int flags;
+ int flags, owepreempt;

td = curthread;
KASSERT(td-td_critnest != 0,
(critical_exit: td_critnest == 0));

if (td-td_critnest == 1) {
+ owepreempt = td-td_owepreempt;
+ td-td_owepreempt = 0;
+ /*
+ * XXX: Should move compiler_memory_barrier() from
+ * rmlock to a header.
+ */


XXX: If we get an interrupt at this point and td_owepreempt was zero, 
the new interrupt will re-set it, because td_critnest is still non-zero.


So we still end up with a thread that is leaking an owepreempt *and* 
lose a preemption.


We really need an atomic_readandclear() which gives us a local copy of 
td_owepreempt *and* clears critnest in the same operation.  Sadly, that 
is rather expensive.  It is possible to implement with a flag for 
owepreempt, but that means that all writes to critnest must then be 
atomic.  Either because we know we have interrupts disabled (i.e. 
setting owepreempt can be a RMW), or with a proper atomic_add/set/...


I'm not sure what the performance impact of this will be.  One would 
hope that atomic_add without a memory barrier isn't much more expensive 
than a compiler generated read-modify-write, tho.  Especially, since 
this cacheline should be local and exclusive to us, anyway.



+ __asm __volatile(:::memory);
td-td_critnest = 0;
- if (td-td_owepreempt) {
+ if (owepreempt) {
td-td_critnest = 1;
thread_lock(td);
td-td_critnest--;

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread John Baldwin

On Tuesday, May 17, 2011 10:34:41 am Andriy Gapon wrote:
 on 17/05/2011 16:58 John Baldwin said the following:
  No, it doesn't quite work that way.  It wouldn't work on Alpha for example.
  
  All load_acq is a load with a memory barrier to order other loads after it.
  It is still free to load stale data.  Only a read-modify-write operation
  would actually block until it could access an up-to-date value.
 
 Hmm, ok.
 How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-)
 Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] 
 value,
 if there could be any.
 Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ?
 
 You see at what I am getting?

Yeah, either of those would work.  At this point just leaving the
atomic_add_int() as-is would be the smallest diff. :)

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread John Baldwin

On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote:
 On 05/17/2011 05:16 AM, John Baldwin wrote:
 ...
  Index: kern/kern_switch.c
  ===
  --- kern/kern_switch.c (revision 221536)
  +++ kern/kern_switch.c (working copy)
  @@ -192,15 +192,22 @@
  critical_exit(void)
  {
  struct thread *td;
  - int flags;
  + int flags, owepreempt;
 
  td = curthread;
  KASSERT(td-td_critnest != 0,
  (critical_exit: td_critnest == 0));
 
  if (td-td_critnest == 1) {
  + owepreempt = td-td_owepreempt;
  + td-td_owepreempt = 0;
  + /*
  + * XXX: Should move compiler_memory_barrier() from
  + * rmlock to a header.
  + */
 
 XXX: If we get an interrupt at this point and td_owepreempt was zero, 
 the new interrupt will re-set it, because td_critnest is still non-zero.
 
 So we still end up with a thread that is leaking an owepreempt *and* 
 lose a preemption.

I don't see how this can still leak owepreempt.  The nested interrupt should 
do nothing (except for possibly set owepreempt) until td_critnest is 0.

However, we can certainly lose preemptions.

I wonder if we can abuse the high bit of td_critnest for the owepreempt flag 
so it is all stored in one cookie.  We only set owepreempt while holding 
thread_lock() (so interrupts are disabled), so I think we would be ok and not 
need atomic ops.

Hmm, actually, the top-half code would have to use atomic ops.  Nuts.  Let me 
think some more.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread Max Laier


On 05/17/2011 09:56 AM, John Baldwin wrote:

On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote:

On 05/17/2011 05:16 AM, John Baldwin wrote:
...

Index: kern/kern_switch.c
===
--- kern/kern_switch.c (revision 221536)
+++ kern/kern_switch.c (working copy)
@@ -192,15 +192,22 @@
critical_exit(void)
{
struct thread *td;
- int flags;
+ int flags, owepreempt;

td = curthread;
KASSERT(td-td_critnest != 0,
(critical_exit: td_critnest == 0));

if (td-td_critnest == 1) {
+ owepreempt = td-td_owepreempt;
+ td-td_owepreempt = 0;
+ /*
+ * XXX: Should move compiler_memory_barrier() from
+ * rmlock to a header.
+ */


XXX: If we get an interrupt at this point and td_owepreempt was zero,
the new interrupt will re-set it, because td_critnest is still non-zero.

So we still end up with a thread that is leaking an owepreempt *and*
lose a preemption.


I don't see how this can still leak owepreempt.  The nested interrupt should
do nothing (except for possibly set owepreempt) until td_critnest is 0.


Exactly.  The interrupt sets owepreempt and after we return here, we set 
td_critnest to 0 and exit without clearing owepreempt.  Hence we leak 
the owepreempt.



However, we can certainly lose preemptions.

I wonder if we can abuse the high bit of td_critnest for the owepreempt flag
so it is all stored in one cookie.  We only set owepreempt while holding
thread_lock() (so interrupts are disabled), so I think we would be ok and not
need atomic ops.

Hmm, actually, the top-half code would have to use atomic ops.  Nuts.  Let me
think some more.


I think these two really belong into one single variable.  Setting the 
owepreempt flag can be a normal RMW.  Increasing and decreasing critnest 
must be atomic (otherwise we could lose the flag) and dropping the final 
reference would work like this:


  if ((curthread-td_critnest  TD_CRITNEST_MASK) == 1) {
unsigned int owe;
owe = atomic_readandclear(curthread-td_critnest);
if (owe  TD_OWEPREEMPT_FLAG) {
  /* do the switch */
  }

That should do it ... I can put that into a patch, if we agree that's 
the right thing to do.


Thanks,
  Max






___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread Andriy Gapon

on 17/05/2011 18:51 John Baldwin said the following:
 On Tuesday, May 17, 2011 10:34:41 am Andriy Gapon wrote:
 on 17/05/2011 16:58 John Baldwin said the following:
 No, it doesn't quite work that way.  It wouldn't work on Alpha for example.

 All load_acq is a load with a memory barrier to order other loads after it.
 It is still free to load stale data.  Only a read-modify-write operation
 would actually block until it could access an up-to-date value.

 Hmm, ok.
 How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-)
 Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] 
 value,
 if there could be any.
 Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ?

 You see at what I am getting?
 
 Yeah, either of those would work.  At this point just leaving the
 atomic_add_int() as-is would be the smallest diff. :)

Yes, I agree about the smallest diff.
But if we are going to remove that first spin-loop, then we would effectively be
wasting one volatile variable.  Only a tiny waste, but it could be confusing.

Whatever we decide, this was very educational for me.
-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-17 Thread John Baldwin

On Tuesday, May 17, 2011 3:16:45 pm Max Laier wrote:
 On 05/17/2011 09:56 AM, John Baldwin wrote:
  On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote:
  On 05/17/2011 05:16 AM, John Baldwin wrote:
  ...
  Index: kern/kern_switch.c
  ===
  --- kern/kern_switch.c (revision 221536)
  +++ kern/kern_switch.c (working copy)
  @@ -192,15 +192,22 @@
  critical_exit(void)
  {
  struct thread *td;
  - int flags;
  + int flags, owepreempt;
 
  td = curthread;
  KASSERT(td-td_critnest != 0,
  (critical_exit: td_critnest == 0));
 
  if (td-td_critnest == 1) {
  + owepreempt = td-td_owepreempt;
  + td-td_owepreempt = 0;
  + /*
  + * XXX: Should move compiler_memory_barrier() from
  + * rmlock to a header.
  + */
 
  XXX: If we get an interrupt at this point and td_owepreempt was zero,
  the new interrupt will re-set it, because td_critnest is still non-zero.
 
  So we still end up with a thread that is leaking an owepreempt *and*
  lose a preemption.
 
  I don't see how this can still leak owepreempt.  The nested interrupt 
should
  do nothing (except for possibly set owepreempt) until td_critnest is 0.
 
 Exactly.  The interrupt sets owepreempt and after we return here, we set 
 td_critnest to 0 and exit without clearing owepreempt.  Hence we leak 
 the owepreempt.
 
  However, we can certainly lose preemptions.
 
  I wonder if we can abuse the high bit of td_critnest for the owepreempt 
flag
  so it is all stored in one cookie.  We only set owepreempt while holding
  thread_lock() (so interrupts are disabled), so I think we would be ok and 
not
  need atomic ops.
 
  Hmm, actually, the top-half code would have to use atomic ops.  Nuts.  Let 
me
  think some more.
 
 I think these two really belong into one single variable.  Setting the 
 owepreempt flag can be a normal RMW.  Increasing and decreasing critnest 
 must be atomic (otherwise we could lose the flag) and dropping the final 
 reference would work like this:
 
if ((curthread-td_critnest  TD_CRITNEST_MASK) == 1) {
  unsigned int owe;
  owe = atomic_readandclear(curthread-td_critnest);
  if (owe  TD_OWEPREEMPT_FLAG) {
/* do the switch */
}
 
 That should do it ... I can put that into a patch, if we agree that's 
 the right thing to do.

Yeah, I already have a patch to do that, but hadn't added atomic ops to 
critical_enter() and critical_exit().  But it also wasn't as fancy in the 
critical_exit() case.  Here is what I have and I think it might actually
be ok (it doesn't use an atomic read and clear, but I think it is safe).
Hmm, actually, it will need to use the read and clear:

Index: kern/sched_ule.c
===
--- kern/sched_ule.c(revision 222024)
+++ kern/sched_ule.c(working copy)
@@ -1785,7 +1785,6 @@
td-td_oncpu = NOCPU;
if (!(flags  SW_PREEMPT))
td-td_flags = ~TDF_NEEDRESCHED;
-   td-td_owepreempt = 0;
tdq-tdq_switchcnt++;
/*
 * The lock pointer in an idle thread should never change.  Reset it
@@ -2066,7 +2065,7 @@
 
flags = SW_INVOL | SW_PREEMPT;
if (td-td_critnest  1)
-   td-td_owepreempt = 1;
+   td-td_critnest |= TD_OWEPREEMPT;
else if (TD_IS_IDLETHREAD(td))
mi_switch(flags | SWT_REMOTEWAKEIDLE, NULL);
else
@@ -2261,7 +2260,7 @@
return;
if (!sched_shouldpreempt(pri, cpri, 0))
return;
-   ctd-td_owepreempt = 1;
+   ctd-td_critnest |= TD_OWEPREEMPT;
 }
 
 /*
Index: kern/kern_switch.c
===
--- kern/kern_switch.c  (revision 222024)
+++ kern/kern_switch.c  (working copy)
@@ -183,7 +183,7 @@
struct thread *td;
 
td = curthread;
-   td-td_critnest++;
+   atomic_add_int(td-td_critnest, 1);
CTR4(KTR_CRITICAL, critical_enter by thread %p (%ld, %s) to %d, td,
(long)td-td_proc-p_pid, td-td_name, td-td_critnest);
 }
@@ -192,18 +192,16 @@
 critical_exit(void)
 {
struct thread *td;
-   int flags;
+   int flags, owepreempt;
 
td = curthread;
KASSERT(td-td_critnest != 0,
(critical_exit: td_critnest == 0));
 
-   if (td-td_critnest == 1) {
-   td-td_critnest = 0;
-   if (td-td_owepreempt) {
-   td-td_critnest = 1;
+   if (td-td_critnest  ~TD_OWEPREEMPT == 1) {
+   owepreempt = atomic_readandclear_int(td-td_owepreempt);
+   if (owepreempt  TD_OWEPREEMPT) {
thread_lock(td);
-   td-td_critnest--;
flags = SW_INVOL | SW_PREEMPT;
if (TD_IS_IDLETHREAD(td))
flags |= SWT_IDLE;
@@ -213,7 +211,7 @@
thread_unlock(td);

Re: proposed smp_rendezvous change

2011-05-16 Thread John Baldwin

On Sunday, May 15, 2011 12:36:45 pm Andriy Gapon wrote:
 on 15/05/2011 18:16 John Baldwin said the following:
  On 5/15/11 10:53 AM, Andriy Gapon wrote:
  on 15/05/2011 10:12 Andriy Gapon said the following:
  on 14/05/2011 18:25 John Baldwin said the following:
  Hmmm, so this is not actually sufficient.  NetApp ran into a very 
  similar race
  with virtual CPUs in BHyVe.  In their case because virtual CPUs are 
  threads
  that
  can be preempted, they have a chance at a longer race.
 
  The problem that they see is that even though the values have been 
  updated, the
  next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before
  one of
  the other CPUs notices that it has finished.
 
  As a follow up to my previous question.  Have you noticed that in my 
  patch no
  slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only 
  master
  CPU (and under smp_ipi_mtx).
 
 
  Here's a cleaner version of my approach to the fix.
  This one does not remove the initial wait on smp_rv_waiters[0] in
  smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] 
  members
  and thus hopefully should be clearer.
 
  Index: sys/kern/subr_smp.c
  ===
  --- sys/kern/subr_smp.c(revision 221943)
  +++ sys/kern/subr_smp.c(working copy)
  @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
static void (*volatile smp_rv_action_func)(void *arg);
static void (*volatile smp_rv_teardown_func)(void *arg);
static void *volatile smp_rv_func_arg;
  -static volatile int smp_rv_waiters[3];
  +static volatile int smp_rv_waiters[4];
 
/*
 * Shared mutex to restrict busywaits between smp_rendezvous() and
  @@ -338,11 +338,15 @@ smp_rendezvous_action(void)
 
/* spin on exit rendezvous */
atomic_add_int(smp_rv_waiters[2], 1);
  -if (local_teardown_func == smp_no_rendevous_barrier)
  +if (local_teardown_func == smp_no_rendevous_barrier) {
  +atomic_add_int(smp_rv_waiters[3], 1);
return;
  +}
while (smp_rv_waiters[2]  smp_rv_ncpus)
cpu_spinwait();
 
  +atomic_add_int(smp_rv_waiters[3], 1);
  +
/* teardown function */
if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
  @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
/* obtain rendezvous lock */
mtx_lock_spin(smp_ipi_mtx);
 
  +while (smp_rv_waiters[3]  smp_rv_ncpus)
  +cpu_spinwait();
  +
/* set static function pointers */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;
  @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
smp_rv_func_arg = arg;
smp_rv_waiters[1] = 0;
smp_rv_waiters[2] = 0;
  +smp_rv_waiters[3] = 0;
atomic_store_rel_int(smp_rv_waiters[0], 0);
 
/* signal other processors, which will enter the IPI with interrupts 
  off */
  
  Ahh, so the bump is after the change.  I do think this will still be ok
  and I probably just didn't explain it well to Neel.  I wonder though
  if the bump shouldn't happen until after the call of the local teardown
  function?
 
 That should not be necessary according to my understanding unless we have a
 reason to care about potentially disappearing function text.  Teardown 
 function
 pointer and argument are locally cached and I don't see any problems with a 
 new
 rendezvous being setup and run while teardowns of the previous one are still
 running.
 
 On the other hand, you might be onto something here.  If the argument is
 allocated in temporary storage (e.g. on stack) and the teardown function
 actually uses the argument, then it's not safe to deallocate the argument 
 until
 all of the teardowns are done.  In this case the master CPU should wait not 
 only
 until all actions are done, but also until all teardowns are done.
 The current code doesn't do that.  And I am not sure if we actually would want
 to add this safety net or just document the behavior and put a burden on
 smp_rendezvous API users.

As we chatted about on IRC, I think we should err on the side of making this
API less complex (so that all of the actions for a given rendezvous finish
before the next rendezvous begins) as this stuff is already fairly complex.
We should only add more complexity if we really need it.

 BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e.
 with non-trivial setup, action and teardown (except perhaps for jkim's stress
 test).  OpenSolaris seems to be doing just fine with the following three types
 of simpler CPU cross-calls (as they call them):
 1. async - a master CPU requests an action to be executed on target CPUs but
 doesn't wait for anything
 - we have no explicit equivalent for this
 
 2. call - a master CPU requests an action to be executed on target CPUs and
 waits until all the target CPUs have finished executing it
 - this would

Re: proposed smp_rendezvous change

2011-05-16 Thread Max Laier

I'd like to propose that we move forward with fixing the race, before 
redesigning the API - please.

We agree that there is a problem with the exit rendezvous.  Let's fix that.  
We agree that the generation approach proposed by NetAPP is the right way to 
go?  Can we check it in, then?  Again, I'd like some comments in the code if 
at all possible.

A second commit should take care of the entry sync - which may or may not be 
required on some architectures.  If we decide that we don't need it, we should 
remove it.  Otherwise, we must move all the assignments from global smp_rv_* 
to the stack in smp_rendezvous_action to after the sync.  Otherwise we should 
just kill it.  In any case, it would be nice to clean this up.

After that, I have one more bugfix for rmlock(9) - one of the three consumers 
of this API (that I am aware of).  As it uses a spinlock inside its IPI 
action, we have to mark smp_rendezvous_action a critical section otherwise the 
spin unlock might yield ... with the expected consequences.

It is arguable if you should be allowed to use spinlocks in the rendevous at 
all, but that's beside the point.

On Monday 16 May 2011 11:52:10 John Baldwin wrote:
 On Sunday, May 15, 2011 12:36:45 pm Andriy Gapon wrote:
...
 As we chatted about on IRC, I think we should err on the side of making
 this API less complex (so that all of the actions for a given rendezvous
 finish before the next rendezvous begins) as this stuff is already fairly
 complex. We should only add more complexity if we really need it.

+1

  BTW, I don't think that we ever use smp_rendezvous() in its full glory,
  i.e. with non-trivial setup, action and teardown (except perhaps for
  jkim's stress test).  OpenSolaris seems to be doing just fine with the
  following three types of simpler CPU cross-calls (as they call them):
  1. async - a master CPU requests an action to be executed on target CPUs
  but doesn't wait for anything
  - we have no explicit equivalent for this

I'd like to see something like this.  Otherwise we end up with hacks like 
callouts that sched_bind and stuff like that.  I can think of plenty use cases 
for this API.

  2. call - a master CPU requests an action to be executed on target CPUs
  and waits until all the target CPUs have finished executing it
  - this would be equivalent smp_rendezvous_cpus(no_barrier, action,
  no_barrier)
  
  3. synch - a master CPU requests an action to be executed on target CPUs
  and waits until all the target CPUs have finished executing it and the
  slave CPUs are also waiting for their siblings to be done
  - this would be equivalent smp_rendezvous_cpus(no_barrier, action, NULL)
  
  Seems that our current code doesn't use/need more than that as well.
  But I really like the capability that smp_rendezvous_cpus() potentially
  provides.
 
 I would be fine with stripping down our rendezvous API a bit.  They could
 all still use a single IPI and handler but perhaps prevent a few wrapper
 APIs that provide simpler semantics like this.  Back to my statement
 above, we've had this uber-generic rendezvous API for a while, but if we
 don't really need it to be quite so complex I'd be happy to dumb down the
 API to what we really need so the implementation and use can be simpler.

Again, agreed.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-16 Thread John Baldwin

On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
 I'd like to propose that we move forward with fixing the race, before 
 redesigning the API - please.
 
 We agree that there is a problem with the exit rendezvous.  Let's fix that.  
 We agree that the generation approach proposed by NetAPP is the right way to 
 go?  Can we check it in, then?  Again, I'd like some comments in the code if 
 at all possible.

How about this:

Index: subr_smp.c
===
--- subr_smp.c  (revision 221939)
+++ subr_smp.c  (working copy)
@@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a
 static void (*volatile smp_rv_teardown_func)(void *arg);
 static void *volatile smp_rv_func_arg;
 static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_generation;
 
 /* 
  * Shared mutex to restrict busywaits between smp_rendezvous() and
@@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
 void
 smp_rendezvous_action(void)
 {
-   void* local_func_arg = smp_rv_func_arg;
-   void (*local_setup_func)(void*)   = smp_rv_setup_func;
-   void (*local_action_func)(void*)   = smp_rv_action_func;
-   void (*local_teardown_func)(void*) = smp_rv_teardown_func;
+   void *local_func_arg;
+   void (*local_setup_func)(void*);
+   void (*local_action_func)(void*);
+   void (*local_teardown_func)(void*);
+   int generation;
 
/* Ensure we have up-to-date values. */
atomic_add_acq_int(smp_rv_waiters[0], 1);
while (smp_rv_waiters[0]  smp_rv_ncpus)
cpu_spinwait();
 
-   /* setup function */
+   /* Fetch rendezvous parameters after acquire barrier. */
+   local_func_arg = smp_rv_func_arg;
+   local_setup_func = smp_rv_setup_func;
+   local_action_func = smp_rv_action_func;
+   local_teardown_func = smp_rv_teardown_func;
+   generation = smp_rv_generation;
+
+   /*
+* If requested, run a setup function before the main action
+* function.  Ensure all CPUs have completed the setup
+* function before moving on to the action function.
+*/
if (local_setup_func != smp_no_rendevous_barrier) {
if (smp_rv_setup_func != NULL)
smp_rv_setup_func(smp_rv_func_arg);
-
-   /* spin on entry rendezvous */
atomic_add_int(smp_rv_waiters[1], 1);
while (smp_rv_waiters[1]  smp_rv_ncpus)
cpu_spinwait();
}
 
-   /* action function */
if (local_action_func != NULL)
local_action_func(local_func_arg);
 
-   /* spin on exit rendezvous */
+   /*
+* Signal that the main action has been completed.  If a
+* full exit rendezvous is requested, then all CPUs will
+* wait here until all CPUs have finished the main action.
+*
+* Note that the write by the last CPU to finish the action
+* may become visible to different CPUs at different times.
+* As a result, the CPU that initiated the rendezvous may
+* exit the rendezvous and drop the lock allowing another
+* rendezvous to be initiated on the same CPU or a different
+* CPU.  In that case the exit sentinel may be cleared before
+* all CPUs have noticed causing those CPUs to hang forever.
+* Workaround this by using a generation count to notice when
+* this race occurs and to exit the rendezvous in that case.
+*/
atomic_add_int(smp_rv_waiters[2], 1);
if (local_teardown_func == smp_no_rendevous_barrier)
 return;
-   while (smp_rv_waiters[2]  smp_rv_ncpus)
+   while (smp_rv_waiters[2]  smp_rv_ncpus 
+   generation == smp_rv_generation)
cpu_spinwait();
 
-   /* teardown function */
if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
 }
@@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map,
if (ncpus == 0)
panic(ncpus is 0 with map=0x%x, map);
 
-   /* obtain rendezvous lock */
mtx_lock_spin(smp_ipi_mtx);
 
-   /* set static function pointers */
+   atomic_add_acq_int(smp_rv_generation, 1);
+
+   /* Pass rendezvous parameters via global variables. */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;
smp_rv_action_func = action_func;
@@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map,
smp_rv_waiters[2] = 0;
atomic_store_rel_int(smp_rv_waiters[0], 0);
 
-   /* signal other processors, which will enter the IPI with interrupts 
off */
+   /*
+* Signal other processors, which will enter the IPI with
+* interrupts off.
+*/
ipi_selected(map  ~(1  curcpu), IPI_RENDEZVOUS);
 
/* Check if the current CPU is in the map */
if ((map  (1  curcpu)) != 0)
smp_rendezvous_action();
 
+   /*
+*

Re: proposed smp_rendezvous change

2011-05-16 Thread Andriy Gapon

on 16/05/2011 21:21 John Baldwin said the following:
 How about this:
...
  /* 
   * Shared mutex to restrict busywaits between smp_rendezvous() and
 @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
  void
  smp_rendezvous_action(void)
  {
 - void* local_func_arg = smp_rv_func_arg;
 - void (*local_setup_func)(void*)   = smp_rv_setup_func;
 - void (*local_action_func)(void*)   = smp_rv_action_func;
 - void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 + void *local_func_arg;
 + void (*local_setup_func)(void*);
 + void (*local_action_func)(void*);
 + void (*local_teardown_func)(void*);
 + int generation;
  
   /* Ensure we have up-to-date values. */
   atomic_add_acq_int(smp_rv_waiters[0], 1);
   while (smp_rv_waiters[0]  smp_rv_ncpus)
   cpu_spinwait();
  
 - /* setup function */
 + /* Fetch rendezvous parameters after acquire barrier. */
 + local_func_arg = smp_rv_func_arg;
 + local_setup_func = smp_rv_setup_func;
 + local_action_func = smp_rv_action_func;
 + local_teardown_func = smp_rv_teardown_func;

I want to ask once again - please pretty please convince me that the above
cpu_spinwait() loop is really needed and, by extension, that the assignments
should be moved behind it.
Please :)

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-16 Thread Max Laier

On Monday 16 May 2011 14:21:27 John Baldwin wrote:
 On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
  I'd like to propose that we move forward with fixing the race, before
  redesigning the API - please.
  
  We agree that there is a problem with the exit rendezvous.  Let's fix
  that. We agree that the generation approach proposed by NetAPP is the
  right way to go?  Can we check it in, then?  Again, I'd like some
  comments in the code if at all possible.
 
 How about this:
 
 Index: subr_smp.c
 ===
 --- subr_smp.c(revision 221939)
 +++ subr_smp.c(working copy)
 @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a
  static void (*volatile smp_rv_teardown_func)(void *arg);
  static void *volatile smp_rv_func_arg;
  static volatile int smp_rv_waiters[3];
 +static volatile int smp_rv_generation;
 
  /*
   * Shared mutex to restrict busywaits between smp_rendezvous() and
 @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
  void
  smp_rendezvous_action(void)
  {
 - void* local_func_arg = smp_rv_func_arg;
 - void (*local_setup_func)(void*)   = smp_rv_setup_func;
 - void (*local_action_func)(void*)   = smp_rv_action_func;
 - void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 + void *local_func_arg;
 + void (*local_setup_func)(void*);
 + void (*local_action_func)(void*);
 + void (*local_teardown_func)(void*);
 + int generation;
 
   /* Ensure we have up-to-date values. */
   atomic_add_acq_int(smp_rv_waiters[0], 1);
   while (smp_rv_waiters[0]  smp_rv_ncpus)
   cpu_spinwait();
 
 - /* setup function */
 + /* Fetch rendezvous parameters after acquire barrier. */
 + local_func_arg = smp_rv_func_arg;
 + local_setup_func = smp_rv_setup_func;
 + local_action_func = smp_rv_action_func;
 + local_teardown_func = smp_rv_teardown_func;
 + generation = smp_rv_generation;
 +
 + /*
 +  * If requested, run a setup function before the main action
 +  * function.  Ensure all CPUs have completed the setup
 +  * function before moving on to the action function.
 +  */
   if (local_setup_func != smp_no_rendevous_barrier) {
   if (smp_rv_setup_func != NULL)
   smp_rv_setup_func(smp_rv_func_arg);
 -
 - /* spin on entry rendezvous */
   atomic_add_int(smp_rv_waiters[1], 1);
   while (smp_rv_waiters[1]  smp_rv_ncpus)
   cpu_spinwait();
   }
 
 - /* action function */
   if (local_action_func != NULL)
   local_action_func(local_func_arg);
 
 - /* spin on exit rendezvous */
 + /*
 +  * Signal that the main action has been completed.  If a
 +  * full exit rendezvous is requested, then all CPUs will
 +  * wait here until all CPUs have finished the main action.
 +  *
 +  * Note that the write by the last CPU to finish the action
 +  * may become visible to different CPUs at different times.
 +  * As a result, the CPU that initiated the rendezvous may
 +  * exit the rendezvous and drop the lock allowing another
 +  * rendezvous to be initiated on the same CPU or a different
 +  * CPU.  In that case the exit sentinel may be cleared before
 +  * all CPUs have noticed causing those CPUs to hang forever.
 +  * Workaround this by using a generation count to notice when
 +  * this race occurs and to exit the rendezvous in that case.
 +  */

MPASS(generation == smp_rv_generation);

   atomic_add_int(smp_rv_waiters[2], 1);
   if (local_teardown_func == smp_no_rendevous_barrier)
  return;
 - while (smp_rv_waiters[2]  smp_rv_ncpus)
 + while (smp_rv_waiters[2]  smp_rv_ncpus 
 + generation == smp_rv_generation)
   cpu_spinwait();
 
 - /* teardown function */
   if (local_teardown_func != NULL)
   local_teardown_func(local_func_arg);
  }
 @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map,
   if (ncpus == 0)
   panic(ncpus is 0 with map=0x%x, map);
 
 - /* obtain rendezvous lock */
   mtx_lock_spin(smp_ipi_mtx);
 
 - /* set static function pointers */
 + atomic_add_acq_int(smp_rv_generation, 1);
 +
 + /* Pass rendezvous parameters via global variables. */
   smp_rv_ncpus = ncpus;
   smp_rv_setup_func = setup_func;
   smp_rv_action_func = action_func;
 @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map,
   smp_rv_waiters[2] = 0;
   atomic_store_rel_int(smp_rv_waiters[0], 0);
 
 - /* signal other processors, which will enter the IPI with interrupts off
 */ +  /*
 +  * Signal other processors, which will enter the IPI with
 +  * interrupts off.
 +  */
   ipi_selected(map  ~(1  curcpu), IPI_RENDEZVOUS);
 
   /* Check if the current CPU is in the map */
   if ((map  (1  curcpu)) != 0)

Re: proposed smp_rendezvous change

2011-05-16 Thread John Baldwin

On Monday, May 16, 2011 3:27:47 pm Andriy Gapon wrote:
 on 16/05/2011 21:21 John Baldwin said the following:
  How about this:
 ...
   /* 
* Shared mutex to restrict busywaits between smp_rendezvous() and
  @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
   void
   smp_rendezvous_action(void)
   {
  -   void* local_func_arg = smp_rv_func_arg;
  -   void (*local_setup_func)(void*)   = smp_rv_setup_func;
  -   void (*local_action_func)(void*)   = smp_rv_action_func;
  -   void (*local_teardown_func)(void*) = smp_rv_teardown_func;
  +   void *local_func_arg;
  +   void (*local_setup_func)(void*);
  +   void (*local_action_func)(void*);
  +   void (*local_teardown_func)(void*);
  +   int generation;
   
  /* Ensure we have up-to-date values. */
  atomic_add_acq_int(smp_rv_waiters[0], 1);
  while (smp_rv_waiters[0]  smp_rv_ncpus)
  cpu_spinwait();
   
  -   /* setup function */
  +   /* Fetch rendezvous parameters after acquire barrier. */
  +   local_func_arg = smp_rv_func_arg;
  +   local_setup_func = smp_rv_setup_func;
  +   local_action_func = smp_rv_action_func;
  +   local_teardown_func = smp_rv_teardown_func;
 
 I want to ask once again - please pretty please convince me that the above
 cpu_spinwait() loop is really needed and, by extension, that the assignments
 should be moved behind it.
 Please :)

Well, moving the assignments down is a style fix for one, and we can always 
remove the first rendezvous as a follow up if desired.

However, suppose you have an arch where sending an IPI is not a barrier
(this seems unlikely).  In that case, the atomic_add_acq_int() will not
succeed (and return) until it has seen the earlier write by the CPU
initiating the rendezvous to clear smp_rv_waiters[0] to zero.  The actual
spin on the smp_rv_waiters[] value is not strictly necessary however and
is probably just cut and pasted to match the other uses of values in
the smp_rv_waiters[] array.

(atomic_add_acq_int() could spin on architectures where it is implemented
using compare-and-swap (e.g. sparc64) or locked-load conditional-store (e.g. 
Alpha).)

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-16 Thread John Baldwin

On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
 On Monday 16 May 2011 14:21:27 John Baldwin wrote:
  Yes, we need to fix that.  Humm, it doesn't preempt when you do a
  critical_exit() though?  Or do you use a hand-rolled critical exit that
  doesn't do a deferred preemption?
 
 Right now I just did a manual td_critnest++/--, but I guess ...

Ah, ok, so you would lose a preemption.  That's not really ideal.

  Actually, I'm curious how the spin unlock inside the IPI could yield the
  CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is
  ok as long as the critical_exit() just defers the preemption to the end of
  the IPI handler.
 
 ... the earliest point where it is safe to preempt is after doing the 
 
atomic_add_int(smp_rv_waiters[2], 1);
 
 so that we can start other IPIs again.  However, since we don't accept new 
 IPIs until we signal EOI in the MD code (on amd64), this might still not be a 
 good place to do the yield?!?

Hmm, yeah, you would want to do the EOI before you yield.  However, we could
actually move the EOI up before calling the MI code so long as we leave
interrupts disabled for the duration of the handler (which we do).

 The spin unlock boils down to a critical_exit() and unless we did a 
 critical_enter() at some point during the redenvouz setup, we will yield() if 
 we owepreempt.  I'm not quite sure how that can happen, but it seems like 
 there is a path that allows the scheduler to set it from a foreign CPU.

No, it is only set on curthread by curthread.  This is actually my main
question.  I've no idea how this could happen unless the rmlock code is
actually triggering a wakeup or sched_add() in its rendezvous handler.

I don't see anything in rm_cleanIPI() that would do that however.

I wonder if your original issue was really fixed  just by the first patch
you had which fixed the race in smp_rendezvous()?

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-16 Thread Max Laier

On Monday 16 May 2011 16:46:03 John Baldwin wrote:
 On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
  On Monday 16 May 2011 14:21:27 John Baldwin wrote:
   Yes, we need to fix that.  Humm, it doesn't preempt when you do a
   critical_exit() though?  Or do you use a hand-rolled critical exit that
   doesn't do a deferred preemption?
  
  Right now I just did a manual td_critnest++/--, but I guess ...
 
 Ah, ok, so you would lose a preemption.  That's not really ideal.
 
   Actually, I'm curious how the spin unlock inside the IPI could yield
   the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
   guess that is ok as long as the critical_exit() just defers the
   preemption to the end of the IPI handler.
  
  ... the earliest point where it is safe to preempt is after doing the
  
 atomic_add_int(smp_rv_waiters[2], 1);
  
  so that we can start other IPIs again.  However, since we don't accept
  new IPIs until we signal EOI in the MD code (on amd64), this might still
  not be a good place to do the yield?!?
 
 Hmm, yeah, you would want to do the EOI before you yield.  However, we
 could actually move the EOI up before calling the MI code so long as we
 leave interrupts disabled for the duration of the handler (which we do).
 
  The spin unlock boils down to a critical_exit() and unless we did a
  critical_enter() at some point during the redenvouz setup, we will
  yield() if we owepreempt.  I'm not quite sure how that can happen, but
  it seems like there is a path that allows the scheduler to set it from a
  foreign CPU.
 
 No, it is only set on curthread by curthread.  This is actually my main
 question.  I've no idea how this could happen unless the rmlock code is
 actually triggering a wakeup or sched_add() in its rendezvous handler.
 
 I don't see anything in rm_cleanIPI() that would do that however.
 
 I wonder if your original issue was really fixed  just by the first patch
 you had which fixed the race in smp_rendezvous()?

I found the stack that lead me to this patch in the first place:

#0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0, 
flags=4) at src/sys/kern/sched_ule.c:1939
#1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at 
src/sys/kern/kern_synch.c:475
#2  0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185
#3  0x80465807 in spinlock_exit () at 
src/sys/amd64/amd64/machdep.c:1458
#4  0x8027adea in rm_cleanIPI (arg=value optimized out) at 
src/sys/kern/kern_rmlock.c:180
#5  0x802b9887 in smp_rendezvous_action () at 
src/sys/kern/subr_smp.c:402
#6  0x8045e2a4 in Xrendezvous () at 
src/sys/amd64/amd64/apic_vector.S:235
#7  0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179
#8  0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0, 
item=0xff000f34cd40, udata=0xff000f34ce08) at 
src/sys/vm/uma_core.c:2370
.
.
.

and now that I look at it again, it is clear that critical_exit() just isn't 
interrupt safe.  I'm not sure how to fix that, yet ... but this:


if (td-td_critnest == 1) {
td-td_critnest = 0;
if (td-td_owepreempt) {
td-td_critnest = 1;

clearly doesn't work.

Max
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-16 Thread Max Laier

On Monday 16 May 2011 17:54:54 Attilio Rao wrote:
 2011/5/16 Max Laier m...@love2party.net:
  On Monday 16 May 2011 16:46:03 John Baldwin wrote:
  On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
   On Monday 16 May 2011 14:21:27 John Baldwin wrote:
Yes, we need to fix that.  Humm, it doesn't preempt when you do a
critical_exit() though?  Or do you use a hand-rolled critical exit
that doesn't do a deferred preemption?
   
   Right now I just did a manual td_critnest++/--, but I guess ...
  
  Ah, ok, so you would lose a preemption.  That's not really ideal.
  
Actually, I'm curious how the spin unlock inside the IPI could yield
the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
guess that is ok as long as the critical_exit() just defers the
preemption to the end of the IPI handler.
   
   ... the earliest point where it is safe to preempt is after doing the
   
  atomic_add_int(smp_rv_waiters[2], 1);
   
   so that we can start other IPIs again.  However, since we don't accept
   new IPIs until we signal EOI in the MD code (on amd64), this might
   still not be a good place to do the yield?!?
  
  Hmm, yeah, you would want to do the EOI before you yield.  However, we
  could actually move the EOI up before calling the MI code so long as we
  leave interrupts disabled for the duration of the handler (which we do).
  
   The spin unlock boils down to a critical_exit() and unless we did a
   critical_enter() at some point during the redenvouz setup, we will
   yield() if we owepreempt.  I'm not quite sure how that can happen, but
   it seems like there is a path that allows the scheduler to set it from
   a foreign CPU.
  
  No, it is only set on curthread by curthread.  This is actually my main
  question.  I've no idea how this could happen unless the rmlock code is
  actually triggering a wakeup or sched_add() in its rendezvous handler.
  
  I don't see anything in rm_cleanIPI() that would do that however.
  
  I wonder if your original issue was really fixed  just by the first
  patch you had which fixed the race in smp_rendezvous()?
  
  I found the stack that lead me to this patch in the first place:
  
  #0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0,
  flags=4) at src/sys/kern/sched_ule.c:1939
  #1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at
  src/sys/kern/kern_synch.c:475
  #2  0x802a2cb3 in critical_exit () at
  src/sys/kern/kern_switch.c:185 #3  0x80465807 in spinlock_exit
  () at
  src/sys/amd64/amd64/machdep.c:1458
  #4  0x8027adea in rm_cleanIPI (arg=value optimized out) at
  src/sys/kern/kern_rmlock.c:180
  #5  0x802b9887 in smp_rendezvous_action () at
  src/sys/kern/subr_smp.c:402
  #6  0x8045e2a4 in Xrendezvous () at
  src/sys/amd64/amd64/apic_vector.S:235
  #7  0x802a2c6e in critical_exit () at
  src/sys/kern/kern_switch.c:179 #8  0x804365ba in uma_zfree_arg
  (zone=0xff009ff4b5a0, item=0xff000f34cd40,
  udata=0xff000f34ce08) at
  src/sys/vm/uma_core.c:2370
  .
  .
  .
  
  and now that I look at it again, it is clear that critical_exit() just
  isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:
  
  
 if (td-td_critnest == 1) {
 td-td_critnest = 0;
 if (td-td_owepreempt) {
 td-td_critnest = 1;
  
  clearly doesn't work.
 
 I'm sorry if I didn't reply to the whole rendezvous thread, but as
 long as there is so many people taking care of it, I'll stay hidden in
 my corner.
 
 I just wanted to tell that I think you are misunderstanding what
 critical section is supposed to do.
 
 When an interrupt fires, it goes on the old interrupt/kernel context
 which means it has not a context of his own. That is the reason why we
 disable interrupts on spinlocks (or similar workaround for !x86
 architectures) and this is why spinlocks are the only protection
 usable in code that runs in interrupt context.
 
 Preempting just means another thread will be scheduler in the middle
 of another thread execution path.
 
 This code is perfectly fine if you consider curthread won't be
 descheduled while it is executing.

Well, no - it is not.  With this you can end up with a curthread that has 
td_critnest=0 and td_owepreempt=1 in interrupt context.  If you use a spinlock 
on such a thread, it will do the preemption at the point where you drop the 
spinlock, this is bad in some circumstances.  One example is the smp_rendevous 
case we are discussing here.

Max
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-16 Thread Attilio Rao

2011/5/17 Max Laier m...@love2party.net:
 On Monday 16 May 2011 17:54:54 Attilio Rao wrote:
 2011/5/16 Max Laier m...@love2party.net:
  On Monday 16 May 2011 16:46:03 John Baldwin wrote:
  On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
   On Monday 16 May 2011 14:21:27 John Baldwin wrote:
Yes, we need to fix that.  Humm, it doesn't preempt when you do a
critical_exit() though?  Or do you use a hand-rolled critical exit
that doesn't do a deferred preemption?
  
   Right now I just did a manual td_critnest++/--, but I guess ...
 
  Ah, ok, so you would lose a preemption.  That's not really ideal.
 
Actually, I'm curious how the spin unlock inside the IPI could yield
the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
guess that is ok as long as the critical_exit() just defers the
preemption to the end of the IPI handler.
  
   ... the earliest point where it is safe to preempt is after doing the
  
      atomic_add_int(smp_rv_waiters[2], 1);
  
   so that we can start other IPIs again.  However, since we don't accept
   new IPIs until we signal EOI in the MD code (on amd64), this might
   still not be a good place to do the yield?!?
 
  Hmm, yeah, you would want to do the EOI before you yield.  However, we
  could actually move the EOI up before calling the MI code so long as we
  leave interrupts disabled for the duration of the handler (which we do).
 
   The spin unlock boils down to a critical_exit() and unless we did a
   critical_enter() at some point during the redenvouz setup, we will
   yield() if we owepreempt.  I'm not quite sure how that can happen, but
   it seems like there is a path that allows the scheduler to set it from
   a foreign CPU.
 
  No, it is only set on curthread by curthread.  This is actually my main
  question.  I've no idea how this could happen unless the rmlock code is
  actually triggering a wakeup or sched_add() in its rendezvous handler.
 
  I don't see anything in rm_cleanIPI() that would do that however.
 
  I wonder if your original issue was really fixed  just by the first
  patch you had which fixed the race in smp_rendezvous()?
 
  I found the stack that lead me to this patch in the first place:
 
  #0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0,
  flags=4) at src/sys/kern/sched_ule.c:1939
  #1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at
  src/sys/kern/kern_synch.c:475
  #2  0x802a2cb3 in critical_exit () at
  src/sys/kern/kern_switch.c:185 #3  0x80465807 in spinlock_exit
  () at
  src/sys/amd64/amd64/machdep.c:1458
  #4  0x8027adea in rm_cleanIPI (arg=value optimized out) at
  src/sys/kern/kern_rmlock.c:180
  #5  0x802b9887 in smp_rendezvous_action () at
  src/sys/kern/subr_smp.c:402
  #6  0x8045e2a4 in Xrendezvous () at
  src/sys/amd64/amd64/apic_vector.S:235
  #7  0x802a2c6e in critical_exit () at
  src/sys/kern/kern_switch.c:179 #8  0x804365ba in uma_zfree_arg
  (zone=0xff009ff4b5a0, item=0xff000f34cd40,
  udata=0xff000f34ce08) at
  src/sys/vm/uma_core.c:2370
  .
  .
  .
 
  and now that I look at it again, it is clear that critical_exit() just
  isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:
 
 
         if (td-td_critnest == 1) {
                 td-td_critnest = 0;
                 if (td-td_owepreempt) {
                         td-td_critnest = 1;
 
  clearly doesn't work.

 I'm sorry if I didn't reply to the whole rendezvous thread, but as
 long as there is so many people taking care of it, I'll stay hidden in
 my corner.

 I just wanted to tell that I think you are misunderstanding what
 critical section is supposed to do.

 When an interrupt fires, it goes on the old interrupt/kernel context
 which means it has not a context of his own. That is the reason why we
 disable interrupts on spinlocks (or similar workaround for !x86
 architectures) and this is why spinlocks are the only protection
 usable in code that runs in interrupt context.

 Preempting just means another thread will be scheduler in the middle
 of another thread execution path.

 This code is perfectly fine if you consider curthread won't be
 descheduled while it is executing.

 Well, no - it is not.  With this you can end up with a curthread that has
 td_critnest=0 and td_owepreempt=1 in interrupt context.  If you use a spinlock
 on such a thread, it will do the preemption at the point where you drop the
 spinlock, this is bad in some circumstances.  One example is the smp_rendevous
 case we are discussing here.

This circumstances are further protected by another call to
critical_enter(), by consumers or however upper layer calls.
This is why, for example, spinlock_enter() does call critical_enter()
even if it actually disables interrupts or why we disable preemption
in other cases where the interrupts are already disabled.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

Re: proposed smp_rendezvous change

2011-05-16 Thread Attilio Rao

2011/5/16 Max Laier m...@love2party.net:
 On Monday 16 May 2011 16:46:03 John Baldwin wrote:
 On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
  On Monday 16 May 2011 14:21:27 John Baldwin wrote:
   Yes, we need to fix that.  Humm, it doesn't preempt when you do a
   critical_exit() though?  Or do you use a hand-rolled critical exit that
   doesn't do a deferred preemption?
 
  Right now I just did a manual td_critnest++/--, but I guess ...

 Ah, ok, so you would lose a preemption.  That's not really ideal.

   Actually, I'm curious how the spin unlock inside the IPI could yield
   the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
   guess that is ok as long as the critical_exit() just defers the
   preemption to the end of the IPI handler.
 
  ... the earliest point where it is safe to preempt is after doing the
 
     atomic_add_int(smp_rv_waiters[2], 1);
 
  so that we can start other IPIs again.  However, since we don't accept
  new IPIs until we signal EOI in the MD code (on amd64), this might still
  not be a good place to do the yield?!?

 Hmm, yeah, you would want to do the EOI before you yield.  However, we
 could actually move the EOI up before calling the MI code so long as we
 leave interrupts disabled for the duration of the handler (which we do).

  The spin unlock boils down to a critical_exit() and unless we did a
  critical_enter() at some point during the redenvouz setup, we will
  yield() if we owepreempt.  I'm not quite sure how that can happen, but
  it seems like there is a path that allows the scheduler to set it from a
  foreign CPU.

 No, it is only set on curthread by curthread.  This is actually my main
 question.  I've no idea how this could happen unless the rmlock code is
 actually triggering a wakeup or sched_add() in its rendezvous handler.

 I don't see anything in rm_cleanIPI() that would do that however.

 I wonder if your original issue was really fixed  just by the first patch
 you had which fixed the race in smp_rendezvous()?

 I found the stack that lead me to this patch in the first place:

 #0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0,
 flags=4) at src/sys/kern/sched_ule.c:1939
 #1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at
 src/sys/kern/kern_synch.c:475
 #2  0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185
 #3  0x80465807 in spinlock_exit () at
 src/sys/amd64/amd64/machdep.c:1458
 #4  0x8027adea in rm_cleanIPI (arg=value optimized out) at
 src/sys/kern/kern_rmlock.c:180
 #5  0x802b9887 in smp_rendezvous_action () at
 src/sys/kern/subr_smp.c:402
 #6  0x8045e2a4 in Xrendezvous () at
 src/sys/amd64/amd64/apic_vector.S:235
 #7  0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179
 #8  0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0,
 item=0xff000f34cd40, udata=0xff000f34ce08) at
 src/sys/vm/uma_core.c:2370
 .
 .
 .

 and now that I look at it again, it is clear that critical_exit() just isn't
 interrupt safe.  I'm not sure how to fix that, yet ... but this:


        if (td-td_critnest == 1) {
                td-td_critnest = 0;
                if (td-td_owepreempt) {
                        td-td_critnest = 1;

 clearly doesn't work.

I'm sorry if I didn't reply to the whole rendezvous thread, but as
long as there is so many people taking care of it, I'll stay hidden in
my corner.

I just wanted to tell that I think you are misunderstanding what
critical section is supposed to do.

When an interrupt fires, it goes on the old interrupt/kernel context
which means it has not a context of his own. That is the reason why we
disable interrupts on spinlocks (or similar workaround for !x86
architectures) and this is why spinlocks are the only protection
usable in code that runs in interrupt context.

Preempting just means another thread will be scheduler in the middle
of another thread execution path.

This code is perfectly fine if you consider curthread won't be
descheduled while it is executing.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread Andriy Gapon

on 14/05/2011 18:25 John Baldwin said the following:
 On 5/13/11 9:43 AM, Andriy Gapon wrote:

 This is a change in vein of what I've been doing in the xcpu branch and it's
 supposed to fix the issue by the recent commit that (probably 
 unintentionally)
 stress-tests smp_rendezvous in TSC code.

 Non-essential changes:
 - ditch initial, and in my opinion useless, pre-setup rendezvous in
 smp_rendezvous_action()
 
 As long as IPIs ensure all data is up to date (I think this is certainly true 
 on
 x86) that is fine.  Presumably sending an IPI has an implicit store barrier on
 all other platforms as well?

Well, one certainly can use IPIs as memory barrier, but my point was that we
have other ways to have a memory barrier and using IPI for that was not
necessary (and a little bit harmful to performance) in this case.

 Essential changes (the fix):
 - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really 
 fully
 done with rendezvous (i.e. it's not going to access any members of smp_rv_*
 pseudo-structure)
 - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not 
 re-use
 the smp_rv_* pseudo-structure too early
 
 Hmmm, so this is not actually sufficient.  NetApp ran into a very similar race
 with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads 
 that
 can be preempted, they have a chance at a longer race.

Just a quick question - have you noticed that because of the change above the
smp_rv_waiters[2] of which I spoke was not the same smp_rv_waiters[2] as in the
original cod?  Because I removed smp_rv_waiters[0], smp_rv_waiters[2] is
actually some new smp_rv_waiters[3].

And well, I think I described exactly the same scenario as you did in my email
on the svn mailing list.  So of course I had it in mind:
http://www.mail-archive.com/svn-src-all@freebsd.org/msg38637.html

My problem, I should have not mixed different changes into the same patch, for
clarity.  I should have provided two patches: one that adds smp_rv_waiters[3]
and its handling and one that removes smp_rv_waiters[0].

I would to see my proposed patch actually tested, if possible, before it's
dismissed :-)

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread Andriy Gapon

on 14/05/2011 18:25 John Baldwin said the following:
 Hmmm, so this is not actually sufficient.  NetApp ran into a very similar race
 with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads 
 that
 can be preempted, they have a chance at a longer race.
 
 The problem that they see is that even though the values have been updated, 
 the
 next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one 
 of
 the other CPUs notices that it has finished.

As a follow up to my previous question.  Have you noticed that in my patch no
slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only master
CPU (and under smp_ipi_mtx).

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread Andriy Gapon

on 15/05/2011 07:33 Max Laier said the following:
 On Saturday 14 May 2011 11:25:36 John Baldwin wrote:
 On 5/13/11 9:43 AM, Andriy Gapon wrote:
 This is a change in vein of what I've been doing in the xcpu branch and
 it's supposed to fix the issue by the recent commit that (probably
 unintentionally) stress-tests smp_rendezvous in TSC code.

 Non-essential changes:
 - ditch initial, and in my opinion useless, pre-setup rendezvous in
 smp_rendezvous_action()

 As long as IPIs ensure all data is up to date (I think this is certainly
 true on x86) that is fine.  Presumably sending an IPI has an implicit
 store barrier on all other platforms as well?
 
 Note that if the barrier is required on any platform, then we have to move 
 all 
 the initializations after the _acq, otherwise it is pointless indeed.

Yeah, I still would like to get an explanation why that synchronization is
needed.  I can understand better safe than sorry mentality, I often stick to
it myself.  But for the benefit of improving objective knowledge it would be
useful to understand why that spin-waiting could it useful, especially given
your observation about where the assignment are made.

 Essential changes (the fix):
 - re-use freed smp_rv_waiters[2] to indicate that a slave/target is
 really fully done with rendezvous (i.e. it's not going to access any
 members of smp_rv_* pseudo-structure)
 - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not
 re-use the smp_rv_* pseudo-structure too early

 Hmmm, so this is not actually sufficient.  NetApp ran into a very
 similar race with virtual CPUs in BHyVe.  In their case because virtual
 CPUs are threads that can be preempted, they have a chance at a longer
 race.

 The problem that they see is that even though the values have been
 updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2]
 to zero before one of the other CPUs notices that it has finished.

 A more concrete example:

 Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.

 All three CPUs run the rendezvous down to the point of increasing
 smp_rv_waiters[2].  CPU 1 is the last one to rendezvous for some reason
 so he notices smp_rv_waiters[2] being correct first and exits the
 rendezvous.  It immediately does a new rendezvous.  Even with your
 change, it will see that smp_rv_waiters[2] is correct and will proceed
 to clear it before starting the next rendezvous.

 Meanwhile one of the other CPUs never sees the final update by CPU 1 to
 smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU
 1's two writes were coalesced, or in the case of the hypervisor, CPU 2
 or 3's backing thread is preempted and both writes have posted before
 the thread backing CPU 2 or 3 gets to run again).

 At this point that CPU (call it CPU 2) will spin forever.  When CPU 1
 sends a second rendezvous IPI it will be held in the local APIC of CPU 2
 because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump
 smp_rv_waiters[2] and the entire system will deadlock.

 NetApp's solution is to add a monotonically increasing generation count
 to the rendezvous data set, which is cached in the rendezvous handler
 and to exit the last synchronization point if either smp_rv_waiters[2]
 is high enough, or the generation count has changed.

 I think this would also handle the case the TSC changes have provoked.
 I'm not sure if this would be sufficient for the error case Max Laier
 has encountered.
 
 It seems to address the issue, albeit a bit difficult to understand why this 
 is correct.  I'd like to see some comments in the code instead of just the 
 commit log, but otherwise ... please go ahead with this.

I think it works approximately the same as what I suggested.
Only my patch forces the next master CPU to wait until all slave CPUs are fully
done with the previous rendezvous (and ware of that fact), while this patch
provides a side-channel to tell late slave CPUs that their rendezvous was
completed when there is a new rendezvous started by a new master CPU.

 NetApp's patch:

 Extra protection for consecutive smp_rendezvous() calls.

 We need this because interrupts are not really disabled when
 executing the smp_rendezvous_action() when running inside a
 virtual machine.

 In particular it is possible for the last cpu to increment
 smp_rv_waiters[2] so that the exit rendezvous condition is
  satisfied and then get interrupted by a hardware interrupt.

 When the execution of the interrupted vcpu continues it is
 possible that one of the other vcpus that did *not* get
  interrupted exited the old smp_rendezvous() and started a
 new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is
 again reset to 0. This would mean that the vcpu that got
 interrupted will spin forever on the exit rendezvous.

 We protect this by spinning on

Re: proposed smp_rendezvous change

2011-05-15 Thread Andriy Gapon

on 15/05/2011 10:12 Andriy Gapon said the following:
 on 14/05/2011 18:25 John Baldwin said the following:
 Hmmm, so this is not actually sufficient.  NetApp ran into a very similar 
 race
 with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads 
 that
 can be preempted, they have a chance at a longer race.

 The problem that they see is that even though the values have been updated, 
 the
 next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before 
 one of
 the other CPUs notices that it has finished.
 
 As a follow up to my previous question.  Have you noticed that in my patch no
 slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only master
 CPU (and under smp_ipi_mtx).
 

Here's a cleaner version of my approach to the fix.
This one does not remove the initial wait on smp_rv_waiters[0] in
smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members
and thus hopefully should be clearer.

Index: sys/kern/subr_smp.c
===
--- sys/kern/subr_smp.c (revision 221943)
+++ sys/kern/subr_smp.c (working copy)
@@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
 static void (*volatile smp_rv_action_func)(void *arg);
 static void (*volatile smp_rv_teardown_func)(void *arg);
 static void *volatile smp_rv_func_arg;
-static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_waiters[4];

 /*
  * Shared mutex to restrict busywaits between smp_rendezvous() and
@@ -338,11 +338,15 @@ smp_rendezvous_action(void)

/* spin on exit rendezvous */
atomic_add_int(smp_rv_waiters[2], 1);
-   if (local_teardown_func == smp_no_rendevous_barrier)
+   if (local_teardown_func == smp_no_rendevous_barrier) {
+   atomic_add_int(smp_rv_waiters[3], 1);
 return;
+   }
while (smp_rv_waiters[2]  smp_rv_ncpus)
cpu_spinwait();

+   atomic_add_int(smp_rv_waiters[3], 1);
+
/* teardown function */
if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
@@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
/* obtain rendezvous lock */
mtx_lock_spin(smp_ipi_mtx);

+   while (smp_rv_waiters[3]  smp_rv_ncpus)
+   cpu_spinwait();
+
/* set static function pointers */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;
@@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
smp_rv_func_arg = arg;
smp_rv_waiters[1] = 0;
smp_rv_waiters[2] = 0;
+   smp_rv_waiters[3] = 0;
atomic_store_rel_int(smp_rv_waiters[0], 0);

/* signal other processors, which will enter the IPI with interrupts 
off */


-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread John Baldwin


On 5/15/11 10:53 AM, Andriy Gapon wrote:

on 15/05/2011 10:12 Andriy Gapon said the following:

on 14/05/2011 18:25 John Baldwin said the following:

Hmmm, so this is not actually sufficient.  NetApp ran into a very similar race
with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads that
can be preempted, they have a chance at a longer race.

The problem that they see is that even though the values have been updated, the
next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of
the other CPUs notices that it has finished.


As a follow up to my previous question.  Have you noticed that in my patch no
slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only master
CPU (and under smp_ipi_mtx).



Here's a cleaner version of my approach to the fix.
This one does not remove the initial wait on smp_rv_waiters[0] in
smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members
and thus hopefully should be clearer.

Index: sys/kern/subr_smp.c
===
--- sys/kern/subr_smp.c (revision 221943)
+++ sys/kern/subr_smp.c (working copy)
@@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
  static void (*volatile smp_rv_action_func)(void *arg);
  static void (*volatile smp_rv_teardown_func)(void *arg);
  static void *volatile smp_rv_func_arg;
-static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_waiters[4];

  /*
   * Shared mutex to restrict busywaits between smp_rendezvous() and
@@ -338,11 +338,15 @@ smp_rendezvous_action(void)

/* spin on exit rendezvous */
atomic_add_int(smp_rv_waiters[2], 1);
-   if (local_teardown_func == smp_no_rendevous_barrier)
+   if (local_teardown_func == smp_no_rendevous_barrier) {
+   atomic_add_int(smp_rv_waiters[3], 1);
  return;
+   }
while (smp_rv_waiters[2]  smp_rv_ncpus)
cpu_spinwait();

+   atomic_add_int(smp_rv_waiters[3], 1);
+
/* teardown function */
if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
@@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
/* obtain rendezvous lock */
mtx_lock_spin(smp_ipi_mtx);

+   while (smp_rv_waiters[3]  smp_rv_ncpus)
+   cpu_spinwait();
+
/* set static function pointers */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;
@@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
smp_rv_func_arg = arg;
smp_rv_waiters[1] = 0;
smp_rv_waiters[2] = 0;
+   smp_rv_waiters[3] = 0;
atomic_store_rel_int(smp_rv_waiters[0], 0);

/* signal other processors, which will enter the IPI with interrupts 
off */


Ahh, so the bump is after the change.  I do think this will still be ok
and I probably just didn't explain it well to Neel.  I wonder though
if the bump shouldn't happen until after the call of the local teardown
function?

--
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread Max Laier

On Sunday 15 May 2011 11:16:03 John Baldwin wrote:
 On 5/15/11 10:53 AM, Andriy Gapon wrote:
  on 15/05/2011 10:12 Andriy Gapon said the following:
  on 14/05/2011 18:25 John Baldwin said the following:
  Hmmm, so this is not actually sufficient.  NetApp ran into a very
  similar race with virtual CPUs in BHyVe.  In their case because
  virtual CPUs are threads that can be preempted, they have a chance at
  a longer race.
  
  The problem that they see is that even though the values have been
  updated, the next CPU to start a rendezvous can clear
  smp_rv_waiters[2] to zero before one of the other CPUs notices that it
  has finished.
  
  As a follow up to my previous question.  Have you noticed that in my
  patch no slave CPU actually waits/spins on smp_rv_waiters[2]?  It's
  always only master CPU (and under smp_ipi_mtx).
  
  Here's a cleaner version of my approach to the fix.
  This one does not remove the initial wait on smp_rv_waiters[0] in
  smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[]
  members and thus hopefully should be clearer.
  
  Index: sys/kern/subr_smp.c
  ===
  --- sys/kern/subr_smp.c (revision 221943)
  +++ sys/kern/subr_smp.c (working copy)
  @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
  
static void (*volatile smp_rv_action_func)(void *arg);
static void (*volatile smp_rv_teardown_func)(void *arg);
static void *volatile smp_rv_func_arg;
  
  -static volatile int smp_rv_waiters[3];
  +static volatile int smp_rv_waiters[4];
  
/*

 * Shared mutex to restrict busywaits between smp_rendezvous() and
  
  @@ -338,11 +338,15 @@ smp_rendezvous_action(void)
  
  /* spin on exit rendezvous */
  atomic_add_int(smp_rv_waiters[2], 1);
  
  -   if (local_teardown_func == smp_no_rendevous_barrier)
  +   if (local_teardown_func == smp_no_rendevous_barrier) {
  +   atomic_add_int(smp_rv_waiters[3], 1);
  
return;
  
  +   }
  
  while (smp_rv_waiters[2]  smp_rv_ncpus)
  
  cpu_spinwait();
  
  +   atomic_add_int(smp_rv_waiters[3], 1);
  +
  
  /* teardown function */
  if (local_teardown_func != NULL)
  
  local_teardown_func(local_func_arg);
  
  @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
  
  /* obtain rendezvous lock */
  mtx_lock_spin(smp_ipi_mtx);
  
  +   while (smp_rv_waiters[3]  smp_rv_ncpus)
  +   cpu_spinwait();
  +
  
  /* set static function pointers */
  smp_rv_ncpus = ncpus;
  smp_rv_setup_func = setup_func;
  
  @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
  
  smp_rv_func_arg = arg;
  smp_rv_waiters[1] = 0;
  smp_rv_waiters[2] = 0;
  
  +   smp_rv_waiters[3] = 0;
  
  atomic_store_rel_int(smp_rv_waiters[0], 0);
  
  /* signal other processors, which will enter the IPI with interrupts
  off */
 
 Ahh, so the bump is after the change.  I do think this will still be ok
 and I probably just didn't explain it well to Neel.  I wonder though
 if the bump shouldn't happen until after the call of the local teardown
 function?

I don't think we ever intended to synchronize the local teardown part, and I 
believe that is the correct behavior for this API.

This version is sufficiently close to what I have, so I am resonably sure that 
it will work for us.  It seems, however, that if we move to check to after 
picking up the lock anyway, the generation approach has even less impact and I 
am starting to prefer that solution.

Andriy, is there any reason why you'd prefer your approach over the generation 
version?

Thanks,
  Max
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread Andriy Gapon

on 15/05/2011 19:09 Max Laier said the following:
 
 I don't think we ever intended to synchronize the local teardown part, and I 
 believe that is the correct behavior for this API.
 
 This version is sufficiently close to what I have, so I am resonably sure 
 that 
 it will work for us.  It seems, however, that if we move to check to after 
 picking up the lock anyway, the generation approach has even less impact and 
 I 
 am starting to prefer that solution.
 
 Andriy, is there any reason why you'd prefer your approach over the 
 generation 
 version?

No reason.  And I even haven't said that I prefer it :-)
I just wanted to show and explain it as apparently there was some
misunderstanding about it.  I think that generation count approach could even
have a little bit better performance while perhaps being a tiny bit less 
obvious.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-15 Thread Andriy Gapon

on 15/05/2011 18:16 John Baldwin said the following:
 On 5/15/11 10:53 AM, Andriy Gapon wrote:
 on 15/05/2011 10:12 Andriy Gapon said the following:
 on 14/05/2011 18:25 John Baldwin said the following:
 Hmmm, so this is not actually sufficient.  NetApp ran into a very similar 
 race
 with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads
 that
 can be preempted, they have a chance at a longer race.

 The problem that they see is that even though the values have been 
 updated, the
 next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before
 one of
 the other CPUs notices that it has finished.

 As a follow up to my previous question.  Have you noticed that in my patch 
 no
 slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only 
 master
 CPU (and under smp_ipi_mtx).


 Here's a cleaner version of my approach to the fix.
 This one does not remove the initial wait on smp_rv_waiters[0] in
 smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] 
 members
 and thus hopefully should be clearer.

 Index: sys/kern/subr_smp.c
 ===
 --- sys/kern/subr_smp.c(revision 221943)
 +++ sys/kern/subr_smp.c(working copy)
 @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
   static void (*volatile smp_rv_action_func)(void *arg);
   static void (*volatile smp_rv_teardown_func)(void *arg);
   static void *volatile smp_rv_func_arg;
 -static volatile int smp_rv_waiters[3];
 +static volatile int smp_rv_waiters[4];

   /*
* Shared mutex to restrict busywaits between smp_rendezvous() and
 @@ -338,11 +338,15 @@ smp_rendezvous_action(void)

   /* spin on exit rendezvous */
   atomic_add_int(smp_rv_waiters[2], 1);
 -if (local_teardown_func == smp_no_rendevous_barrier)
 +if (local_teardown_func == smp_no_rendevous_barrier) {
 +atomic_add_int(smp_rv_waiters[3], 1);
   return;
 +}
   while (smp_rv_waiters[2]  smp_rv_ncpus)
   cpu_spinwait();

 +atomic_add_int(smp_rv_waiters[3], 1);
 +
   /* teardown function */
   if (local_teardown_func != NULL)
   local_teardown_func(local_func_arg);
 @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
   /* obtain rendezvous lock */
   mtx_lock_spin(smp_ipi_mtx);

 +while (smp_rv_waiters[3]  smp_rv_ncpus)
 +cpu_spinwait();
 +
   /* set static function pointers */
   smp_rv_ncpus = ncpus;
   smp_rv_setup_func = setup_func;
 @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
   smp_rv_func_arg = arg;
   smp_rv_waiters[1] = 0;
   smp_rv_waiters[2] = 0;
 +smp_rv_waiters[3] = 0;
   atomic_store_rel_int(smp_rv_waiters[0], 0);

   /* signal other processors, which will enter the IPI with interrupts 
 off */
 
 Ahh, so the bump is after the change.  I do think this will still be ok
 and I probably just didn't explain it well to Neel.  I wonder though
 if the bump shouldn't happen until after the call of the local teardown
 function?

That should not be necessary according to my understanding unless we have a
reason to care about potentially disappearing function text.  Teardown function
pointer and argument are locally cached and I don't see any problems with a new
rendezvous being setup and run while teardowns of the previous one are still
running.

On the other hand, you might be onto something here.  If the argument is
allocated in temporary storage (e.g. on stack) and the teardown function
actually uses the argument, then it's not safe to deallocate the argument until
all of the teardowns are done.  In this case the master CPU should wait not only
until all actions are done, but also until all teardowns are done.
The current code doesn't do that.  And I am not sure if we actually would want
to add this safety net or just document the behavior and put a burden on
smp_rendezvous API users.

BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e.
with non-trivial setup, action and teardown (except perhaps for jkim's stress
test).  OpenSolaris seems to be doing just fine with the following three types
of simpler CPU cross-calls (as they call them):
1. async - a master CPU requests an action to be executed on target CPUs but
doesn't wait for anything
- we have no explicit equivalent for this

2. call - a master CPU requests an action to be executed on target CPUs and
waits until all the target CPUs have finished executing it
- this would be equivalent smp_rendezvous_cpus(no_barrier, action, no_barrier)

3. synch - a master CPU requests an action to be executed on target CPUs and
waits until all the target CPUs have finished executing it and the slave CPUs
are also waiting for their siblings to be done
- this would be equivalent smp_rendezvous_cpus(no_barrier, action, NULL)

Seems that our current code doesn't use/need more than that as well.
But I really like the capability that

Re: proposed smp_rendezvous change

2011-05-14 Thread John Baldwin


On 5/13/11 9:43 AM, Andriy Gapon wrote:


This is a change in vein of what I've been doing in the xcpu branch and it's
supposed to fix the issue by the recent commit that (probably unintentionally)
stress-tests smp_rendezvous in TSC code.

Non-essential changes:
- ditch initial, and in my opinion useless, pre-setup rendezvous in
smp_rendezvous_action()


As long as IPIs ensure all data is up to date (I think this is certainly 
true on x86) that is fine.  Presumably sending an IPI has an implicit 
store barrier on all other platforms as well?



Essential changes (the fix):
- re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully
done with rendezvous (i.e. it's not going to access any members of smp_rv_*
pseudo-structure)
- spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use
the smp_rv_* pseudo-structure too early


Hmmm, so this is not actually sufficient.  NetApp ran into a very 
similar race with virtual CPUs in BHyVe.  In their case because virtual 
CPUs are threads that can be preempted, they have a chance at a longer race.


The problem that they see is that even though the values have been 
updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] 
to zero before one of the other CPUs notices that it has finished.


A more concrete example:

Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.

All three CPUs run the rendezvous down to the point of increasing 
smp_rv_waiters[2].  CPU 1 is the last one to rendezvous for some reason 
so he notices smp_rv_waiters[2] being correct first and exits the 
rendezvous.  It immediately does a new rendezvous.  Even with your 
change, it will see that smp_rv_waiters[2] is correct and will proceed 
to clear it before starting the next rendezvous.


Meanwhile one of the other CPUs never sees the final update by CPU 1 to 
smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU 
1's two writes were coalesced, or in the case of the hypervisor, CPU 2 
or 3's backing thread is preempted and both writes have posted before 
the thread backing CPU 2 or 3 gets to run again).


At this point that CPU (call it CPU 2) will spin forever.  When CPU 1 
sends a second rendezvous IPI it will be held in the local APIC of CPU 2 
because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump 
smp_rv_waiters[2] and the entire system will deadlock.


NetApp's solution is to add a monotonically increasing generation count 
to the rendezvous data set, which is cached in the rendezvous handler 
and to exit the last synchronization point if either smp_rv_waiters[2] 
is high enough, or the generation count has changed.


I think this would also handle the case the TSC changes have provoked. 
I'm not sure if this would be sufficient for the error case Max Laier 
has encountered.


NetApp's patch:

   Extra protection for consecutive smp_rendezvous() calls.

   We need this because interrupts are not really disabled when
   executing the smp_rendezvous_action() when running inside a
   virtual machine.

   In particular it is possible for the last cpu to increment
   smp_rv_waiters[2] so that the exit rendezvous condition is
satisfied and then get interrupted by a hardware interrupt.

   When the execution of the interrupted vcpu continues it is
   possible that one of the other vcpus that did *not* get
interrupted exited the old smp_rendezvous() and started a
   new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is
   again reset to 0. This would mean that the vcpu that got
   interrupted will spin forever on the exit rendezvous.

   We protect this by spinning on the exit rendezvous only if
   the generation of the smp_rendezvous() matches what we
   started out with before incrementing 'smp_rv_waiters[2]'.

Differences ...

 //private//sys/kern/subr_smp.c#3 (text) 

@@ -127,6 +127,7 @@
 static void (*volatile smp_rv_teardown_func)(void *arg);
 static void * volatile smp_rv_func_arg;
 static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_generation;

 /*
  * Shared mutex to restrict busywaits between smp_rendezvous() and
@@ -418,6 +419,7 @@
 void
 smp_rendezvous_action(void)
 {
+int generation;
 void* local_func_arg = smp_rv_func_arg;
 void (*local_setup_func)(void*)   = smp_rv_setup_func;
 void (*local_action_func)(void*)   = smp_rv_action_func;
@@ -457,11 +459,14 @@
 if (local_action_func != NULL)
 local_action_func(local_func_arg);

+generation = atomic_load_acq_int(smp_rv_generation);
+
 /* spin on exit rendezvous */
 atomic_add_int(smp_rv_waiters[2], 1);
 if (local_teardown_func == smp_no_rendevous_barrier)
 return;
-while (smp_rv_waiters[2]  smp_rv_ncpus)
+while (smp_rv_waiters[2]  smp_rv_ncpus 
+   generation == smp_rv_generation)
 cpu_spinwait();

Re: proposed smp_rendezvous change

2011-05-14 Thread Max Laier

On Saturday 14 May 2011 11:25:36 John Baldwin wrote:
 On 5/13/11 9:43 AM, Andriy Gapon wrote:
  This is a change in vein of what I've been doing in the xcpu branch and
  it's supposed to fix the issue by the recent commit that (probably
  unintentionally) stress-tests smp_rendezvous in TSC code.
  
  Non-essential changes:
  - ditch initial, and in my opinion useless, pre-setup rendezvous in
  smp_rendezvous_action()
 
 As long as IPIs ensure all data is up to date (I think this is certainly
 true on x86) that is fine.  Presumably sending an IPI has an implicit
 store barrier on all other platforms as well?

Note that if the barrier is required on any platform, then we have to move all 
the initializations after the _acq, otherwise it is pointless indeed.

 
  Essential changes (the fix):
  - re-use freed smp_rv_waiters[2] to indicate that a slave/target is
  really fully done with rendezvous (i.e. it's not going to access any
  members of smp_rv_* pseudo-structure)
  - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not
  re-use the smp_rv_* pseudo-structure too early
 
 Hmmm, so this is not actually sufficient.  NetApp ran into a very
 similar race with virtual CPUs in BHyVe.  In their case because virtual
 CPUs are threads that can be preempted, they have a chance at a longer
 race.
 
 The problem that they see is that even though the values have been
 updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2]
 to zero before one of the other CPUs notices that it has finished.
 
 A more concrete example:
 
 Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.
 
 All three CPUs run the rendezvous down to the point of increasing
 smp_rv_waiters[2].  CPU 1 is the last one to rendezvous for some reason
 so he notices smp_rv_waiters[2] being correct first and exits the
 rendezvous.  It immediately does a new rendezvous.  Even with your
 change, it will see that smp_rv_waiters[2] is correct and will proceed
 to clear it before starting the next rendezvous.
 
 Meanwhile one of the other CPUs never sees the final update by CPU 1 to
 smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU
 1's two writes were coalesced, or in the case of the hypervisor, CPU 2
 or 3's backing thread is preempted and both writes have posted before
 the thread backing CPU 2 or 3 gets to run again).
 
 At this point that CPU (call it CPU 2) will spin forever.  When CPU 1
 sends a second rendezvous IPI it will be held in the local APIC of CPU 2
 because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump
 smp_rv_waiters[2] and the entire system will deadlock.
 
 NetApp's solution is to add a monotonically increasing generation count
 to the rendezvous data set, which is cached in the rendezvous handler
 and to exit the last synchronization point if either smp_rv_waiters[2]
 is high enough, or the generation count has changed.
 
 I think this would also handle the case the TSC changes have provoked.
 I'm not sure if this would be sufficient for the error case Max Laier
 has encountered.

It seems to address the issue, albeit a bit difficult to understand why this 
is correct.  I'd like to see some comments in the code instead of just the 
commit log, but otherwise ... please go ahead with this.

I'll open a new thread with the second issue we have been seeing wrt rmlocks, 
which is related, but different.

 NetApp's patch:
 
 Extra protection for consecutive smp_rendezvous() calls.
 
 We need this because interrupts are not really disabled when
 executing the smp_rendezvous_action() when running inside a
 virtual machine.
 
 In particular it is possible for the last cpu to increment
 smp_rv_waiters[2] so that the exit rendezvous condition is
  satisfied and then get interrupted by a hardware interrupt.
 
 When the execution of the interrupted vcpu continues it is
 possible that one of the other vcpus that did *not* get
  interrupted exited the old smp_rendezvous() and started a
 new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is
 again reset to 0. This would mean that the vcpu that got
 interrupted will spin forever on the exit rendezvous.
 
 We protect this by spinning on the exit rendezvous only if
 the generation of the smp_rendezvous() matches what we
 started out with before incrementing 'smp_rv_waiters[2]'.
 
 Differences ...
 
  //private//sys/kern/subr_smp.c#3 (text) 
 
 @@ -127,6 +127,7 @@
   static void (*volatile smp_rv_teardown_func)(void *arg);
   static void * volatile smp_rv_func_arg;
   static volatile int smp_rv_waiters[3];
 +static volatile int smp_rv_generation;
 
   /*
* Shared mutex to restrict busywaits between smp_rendezvous() and
 @@ -418,6 +419,7 @@
   void
   smp_rendezvous_action(void)
   {
 +int generation;
   void*

Re: proposed smp_rendezvous change

2011-05-13 Thread Max Laier

On Friday 13 May 2011 09:43:25 Andriy Gapon wrote:
 This is a change in vein of what I've been doing in the xcpu branch and
 it's supposed to fix the issue by the recent commit that (probably
 unintentionally) stress-tests smp_rendezvous in TSC code.
 
 Non-essential changes:
 - ditch initial, and in my opinion useless, pre-setup rendezvous in
 smp_rendezvous_action()
 - shift smp_rv_waiters indexes by one because of the above
 
 Essential changes (the fix):
 - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really
 fully done with rendezvous (i.e. it's not going to access any members of
 smp_rv_* pseudo-structure)
 - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not
 re-use the smp_rv_* pseudo-structure too early
 
 Index: sys/kern/subr_smp.c
 ===
 --- sys/kern/subr_smp.c   (revision 221835)
 +++ sys/kern/subr_smp.c   (working copy)
 @@ -316,19 +316,14 @@
   void (*local_action_func)(void*)   = smp_rv_action_func;
   void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 
 - /* Ensure we have up-to-date values. */
 - atomic_add_acq_int(smp_rv_waiters[0], 1);
 - while (smp_rv_waiters[0]  smp_rv_ncpus)
 - cpu_spinwait();
 -
   /* setup function */
   if (local_setup_func != smp_no_rendevous_barrier) {
   if (smp_rv_setup_func != NULL)
   smp_rv_setup_func(smp_rv_func_arg);
 
   /* spin on entry rendezvous */
 - atomic_add_int(smp_rv_waiters[1], 1);
 - while (smp_rv_waiters[1]  smp_rv_ncpus)
 + atomic_add_int(smp_rv_waiters[0], 1);
 + while (smp_rv_waiters[0]  smp_rv_ncpus)
   cpu_spinwait();
   }
 
 @@ -337,12 +332,16 @@
   local_action_func(local_func_arg);
 
   /* spin on exit rendezvous */
 - atomic_add_int(smp_rv_waiters[2], 1);
 - if (local_teardown_func == smp_no_rendevous_barrier)
 + atomic_add_int(smp_rv_waiters[1], 1);
 + if (local_teardown_func == smp_no_rendevous_barrier) {
 + atomic_add_int(smp_rv_waiters[2], 1);
  return;
 - while (smp_rv_waiters[2]  smp_rv_ncpus)
 + }
 + while (smp_rv_waiters[1]  smp_rv_ncpus)
   cpu_spinwait();
 
 + atomic_add_int(smp_rv_waiters[2], 1);
 +
   /* teardown function */
   if (local_teardown_func != NULL)
   local_teardown_func(local_func_arg);
 @@ -377,6 +376,10 @@
   /* obtain rendezvous lock */
   mtx_lock_spin(smp_ipi_mtx);
 
 + /* Wait for any previous unwaited rendezvous to finish. */
 + while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)

this ncpus isn't the one you are looking for.  I have a patch of my own that 
is attached.  This might be overdoing it to some extend, but it has been 
running very well on our system for quite some time, which now uses rmlock 
heavily.

The rmlock diff is unrelated, but since I have this diff around for some time 
now ...

 + cpu_spinwait();
 +
   /* set static function pointers */
   smp_rv_ncpus = ncpus;
   smp_rv_setup_func = setup_func;
 @@ -395,7 +398,7 @@
   smp_rendezvous_action();
 
   if (teardown_func == smp_no_rendevous_barrier)
 - while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
 + while (atomic_load_acq_int(smp_rv_waiters[1])  ncpus)
   cpu_spinwait();
 
   /* release lock */
diff --git a/src/sys/kern/kern_rmlock.c b/src/sys/kern/kern_rmlock.c
index f91e61a..a30c16b 100644
--- a/src/sys/kern/kern_rmlock.c
+++ b/src/sys/kern/kern_rmlock.c
@@ -154,6 +154,7 @@ rm_tracker_remove(struct pcpu *pc, struct rm_priotracker *tracker)
 static void
 rm_cleanIPI(void *arg)
 {
+	TAILQ_HEAD(,rm_priotracker) tmp_list = TAILQ_HEAD_INITIALIZER(tmp_list);
 	struct pcpu *pc;
 	struct rmlock *rm = arg;
 	struct rm_priotracker *tracker;
@@ -165,12 +166,12 @@ rm_cleanIPI(void *arg)
 		tracker = (struct rm_priotracker *)queue;
 		if (tracker-rmp_rmlock == rm  tracker-rmp_flags == 0) {
 			tracker-rmp_flags = RMPF_ONQUEUE;
-			mtx_lock_spin(rm_spinlock);
-			LIST_INSERT_HEAD(rm-rm_activeReaders, tracker,
-			rmp_qentry);
-			mtx_unlock_spin(rm_spinlock);
+			TAILQ_INSERT_HEAD(tmp_list, tracker, rmp_qentry);
 		}
 	}
+	mtx_lock_spin(rm_spinlock);
+	TAILQ_CONCAT(rm-rm_activeReaders, tmp_list, rmp_qentry);
+	mtx_unlock_spin(rm_spinlock);
 }
 
 CTASSERT((RM_SLEEPABLE  LO_CLASSFLAGS) == RM_SLEEPABLE);
@@ -186,7 +187,7 @@ rm_init_flags(struct rmlock *rm, const char *name, int opts)
 	if (opts  RM_RECURSE)
 		liflags |= LO_RECURSABLE;
 	rm-rm_writecpus = all_cpus;
-	LIST_INIT(rm-rm_activeReaders);
+	TAILQ_INIT(rm-rm_activeReaders);
 	if (opts  RM_SLEEPABLE) {
 		liflags |= RM_SLEEPABLE;
 		sx_init_flags(rm-rm_lock_sx, rmlock_sx, SX_RECURSE);
@@ -281,7 +282,7 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
 			if

Re: proposed smp_rendezvous change

2011-05-13 Thread Max Laier

On Friday 13 May 2011 09:43:25 Andriy Gapon wrote:
 This is a change in vein of what I've been doing in the xcpu branch and
 it's supposed to fix the issue by the recent commit that (probably
 unintentionally) stress-tests smp_rendezvous in TSC code.
 
 Non-essential changes:
 - ditch initial, and in my opinion useless, pre-setup rendezvous in
 smp_rendezvous_action()
 - shift smp_rv_waiters indexes by one because of the above
 
 Essential changes (the fix):
 - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really
 fully done with rendezvous (i.e. it's not going to access any members of
 smp_rv_* pseudo-structure)
 - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not
 re-use the smp_rv_* pseudo-structure too early
 
 Index: sys/kern/subr_smp.c
 ===
 --- sys/kern/subr_smp.c   (revision 221835)
 +++ sys/kern/subr_smp.c   (working copy)
 @@ -316,19 +316,14 @@
   void (*local_action_func)(void*)   = smp_rv_action_func;
   void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 

You really need a read/load with a memory fence here in order to make sure you 
get up to date values on all CPUs.

 - /* Ensure we have up-to-date values. */
 - atomic_add_acq_int(smp_rv_waiters[0], 1);
 - while (smp_rv_waiters[0]  smp_rv_ncpus)
 - cpu_spinwait();
 -
   /* setup function */
   if (local_setup_func != smp_no_rendevous_barrier) {
   if (smp_rv_setup_func != NULL)
   smp_rv_setup_func(smp_rv_func_arg);
 
   /* spin on entry rendezvous */
 - atomic_add_int(smp_rv_waiters[1], 1);
 - while (smp_rv_waiters[1]  smp_rv_ncpus)
 + atomic_add_int(smp_rv_waiters[0], 1);
 + while (smp_rv_waiters[0]  smp_rv_ncpus)
   cpu_spinwait();
   }
 
 @@ -337,12 +332,16 @@
   local_action_func(local_func_arg);
 
   /* spin on exit rendezvous */
 - atomic_add_int(smp_rv_waiters[2], 1);
 - if (local_teardown_func == smp_no_rendevous_barrier)
 + atomic_add_int(smp_rv_waiters[1], 1);
 + if (local_teardown_func == smp_no_rendevous_barrier) {
 + atomic_add_int(smp_rv_waiters[2], 1);
  return;
 - while (smp_rv_waiters[2]  smp_rv_ncpus)
 + }
 + while (smp_rv_waiters[1]  smp_rv_ncpus)
   cpu_spinwait();
 
 + atomic_add_int(smp_rv_waiters[2], 1);
 +
   /* teardown function */
   if (local_teardown_func != NULL)
   local_teardown_func(local_func_arg);
 @@ -377,6 +376,10 @@
   /* obtain rendezvous lock */
   mtx_lock_spin(smp_ipi_mtx);
 
 + /* Wait for any previous unwaited rendezvous to finish. */
 + while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
 + cpu_spinwait();
 +

This is not the ncpus you are looking for.  I have a patch of my own, that 
might be overdoing it ... but it attached nonetheless ...

The rmlock is a separate issue, but what lead me to discover the problem with 
the rendevouz.

   /* set static function pointers */
   smp_rv_ncpus = ncpus;
   smp_rv_setup_func = setup_func;
 @@ -395,7 +398,7 @@
   smp_rendezvous_action();
 
   if (teardown_func == smp_no_rendevous_barrier)
 - while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
 + while (atomic_load_acq_int(smp_rv_waiters[1])  ncpus)
   cpu_spinwait();
 
   /* release lock */
commit 48f16f01c28cc3006faefa024b0d3e4bcf02fb60
Author: ndire ndire@b72e2a10-2d34-0410-9a71-d3beadf02b57
Date:   Tue Dec 14 18:49:55 2010 +

Reintegrate BR_CHOPUVILLE.


git-svn-id: https://svn.west.isilon.com/repo/onefs/head@169088 b72e2a10-2d34-0410-9a71-d3beadf02b57

diff --git a/src/sys/kern/kern_rmlock.c b/src/sys/kern/kern_rmlock.c
index f91e61a..a30c16b 100644
--- a/src/sys/kern/kern_rmlock.c
+++ b/src/sys/kern/kern_rmlock.c
@@ -154,6 +154,7 @@ rm_tracker_remove(struct pcpu *pc, struct rm_priotracker *tracker)
 static void
 rm_cleanIPI(void *arg)
 {
+	TAILQ_HEAD(,rm_priotracker) tmp_list = TAILQ_HEAD_INITIALIZER(tmp_list);
 	struct pcpu *pc;
 	struct rmlock *rm = arg;
 	struct rm_priotracker *tracker;
@@ -165,12 +166,12 @@ rm_cleanIPI(void *arg)
 		tracker = (struct rm_priotracker *)queue;
 		if (tracker-rmp_rmlock == rm  tracker-rmp_flags == 0) {
 			tracker-rmp_flags = RMPF_ONQUEUE;
-			mtx_lock_spin(rm_spinlock);
-			LIST_INSERT_HEAD(rm-rm_activeReaders, tracker,
-			rmp_qentry);
-			mtx_unlock_spin(rm_spinlock);
+			TAILQ_INSERT_HEAD(tmp_list, tracker, rmp_qentry);
 		}
 	}
+	mtx_lock_spin(rm_spinlock);
+	TAILQ_CONCAT(rm-rm_activeReaders, tmp_list, rmp_qentry);
+	mtx_unlock_spin(rm_spinlock);
 }
 
 CTASSERT((RM_SLEEPABLE  LO_CLASSFLAGS) == RM_SLEEPABLE);
@@ -186,7 +187,7 @@ rm_init_flags(struct rmlock *rm, const char *name, int opts)
 	if (opts  RM_RECURSE)
 		liflags |= LO_RECURSABLE;

Re: proposed smp_rendezvous change

2011-05-13 Thread Andriy Gapon

on 13/05/2011 17:41 Max Laier said the following:
 this ncpus isn't the one you are looking for.

Thank you!

Here's an updated patch:

Index: sys/kern/subr_smp.c
===
--- sys/kern/subr_smp.c (revision 221835)
+++ sys/kern/subr_smp.c (working copy)
@@ -316,19 +316,14 @@
void (*local_action_func)(void*)   = smp_rv_action_func;
void (*local_teardown_func)(void*) = smp_rv_teardown_func;

-   /* Ensure we have up-to-date values. */
-   atomic_add_acq_int(smp_rv_waiters[0], 1);
-   while (smp_rv_waiters[0]  smp_rv_ncpus)
-   cpu_spinwait();
-
/* setup function */
if (local_setup_func != smp_no_rendevous_barrier) {
if (smp_rv_setup_func != NULL)
smp_rv_setup_func(smp_rv_func_arg);

/* spin on entry rendezvous */
-   atomic_add_int(smp_rv_waiters[1], 1);
-   while (smp_rv_waiters[1]  smp_rv_ncpus)
+   atomic_add_int(smp_rv_waiters[0], 1);
+   while (smp_rv_waiters[0]  smp_rv_ncpus)
cpu_spinwait();
}

@@ -337,12 +332,16 @@
local_action_func(local_func_arg);

/* spin on exit rendezvous */
-   atomic_add_int(smp_rv_waiters[2], 1);
-   if (local_teardown_func == smp_no_rendevous_barrier)
+   atomic_add_int(smp_rv_waiters[1], 1);
+   if (local_teardown_func == smp_no_rendevous_barrier) {
+   atomic_add_int(smp_rv_waiters[2], 1);
 return;
-   while (smp_rv_waiters[2]  smp_rv_ncpus)
+   }
+   while (smp_rv_waiters[1]  smp_rv_ncpus)
cpu_spinwait();

+   atomic_add_int(smp_rv_waiters[2], 1);
+
/* teardown function */
if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
@@ -377,6 +376,10 @@
/* obtain rendezvous lock */
mtx_lock_spin(smp_ipi_mtx);

+   /* Wait for any previous unwaited rendezvous to finish. */
+   while (atomic_load_acq_int(smp_rv_waiters[2])  smp_rv_ncpus)
+   cpu_spinwait();
+
/* set static function pointers */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;
@@ -395,7 +398,7 @@
smp_rendezvous_action();

if (teardown_func == smp_no_rendevous_barrier)
-   while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
+   while (atomic_load_acq_int(smp_rv_waiters[1])  ncpus)
cpu_spinwait();

/* release lock */
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-13 Thread Max Laier

On Friday 13 May 2011 11:28:33 Andriy Gapon wrote:
 on 13/05/2011 17:41 Max Laier said the following:
  this ncpus isn't the one you are looking for.
 
 Thank you!
 
 Here's an updated patch:

Can you attach the patch, so I can apply it locally.  This code is really hard 
to read without context.  Some more comments inline ...

 
 Index: sys/kern/subr_smp.c
 ===
 --- sys/kern/subr_smp.c   (revision 221835)
 +++ sys/kern/subr_smp.c   (working copy)
 @@ -316,19 +316,14 @@
   void (*local_action_func)(void*)   = smp_rv_action_func;
   void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 
 - /* Ensure we have up-to-date values. */
 - atomic_add_acq_int(smp_rv_waiters[0], 1);
 - while (smp_rv_waiters[0]  smp_rv_ncpus)
 - cpu_spinwait();
 -

You really need this for architectures that need the memory barrier to ensure 
consistency.  We also need to move the reads of smp_rv_* below this point to 
provide a consistent view.

   /* setup function */
   if (local_setup_func != smp_no_rendevous_barrier) {
   if (smp_rv_setup_func != NULL)
   smp_rv_setup_func(smp_rv_func_arg);
 
   /* spin on entry rendezvous */
 - atomic_add_int(smp_rv_waiters[1], 1);
 - while (smp_rv_waiters[1]  smp_rv_ncpus)
 + atomic_add_int(smp_rv_waiters[0], 1);
 + while (smp_rv_waiters[0]  smp_rv_ncpus)
   cpu_spinwait();
   }
 
 @@ -337,12 +332,16 @@
   local_action_func(local_func_arg);
 
   /* spin on exit rendezvous */
 - atomic_add_int(smp_rv_waiters[2], 1);
 - if (local_teardown_func == smp_no_rendevous_barrier)
 + atomic_add_int(smp_rv_waiters[1], 1);
 + if (local_teardown_func == smp_no_rendevous_barrier) {
 + atomic_add_int(smp_rv_waiters[2], 1);
  return;
 - while (smp_rv_waiters[2]  smp_rv_ncpus)
 + }
 + while (smp_rv_waiters[1]  smp_rv_ncpus)
   cpu_spinwait();
 
 + atomic_add_int(smp_rv_waiters[2], 1);
 +
   /* teardown function */
   if (local_teardown_func != NULL)
   local_teardown_func(local_func_arg);
 @@ -377,6 +376,10 @@
   /* obtain rendezvous lock */
   mtx_lock_spin(smp_ipi_mtx);
 
 + /* Wait for any previous unwaited rendezvous to finish. */
 + while (atomic_load_acq_int(smp_rv_waiters[2])  smp_rv_ncpus)
 + cpu_spinwait();
 +

This does not help you at all.  Imagine the following (unlikely, but not 
impossible) case:

CPUA: start rendevouz including self, finish the action first (i.e. CPUA is 
the first one to see smp_rv_waiters[2] == smp_rv_ncpus, drop the lock and 
start a new rendevouz.  smp_rv_waiters[2] == smp_rv_ncpus is still true on 
that CPU, but ...

CPUB might have increased smp_rv_waiters[2] for the first rendevouz, but never 
saw smp_rv_waiters[2] == smp_rv_ncpus, still ...

CPUA is allowed to start a new rendevouz which will leave CPUB stranded and 
can lead to a deadlock.

I think this is also possible with another CPU starting the second rendevous.

   /* set static function pointers */
   smp_rv_ncpus = ncpus;
   smp_rv_setup_func = setup_func;
 @@ -395,7 +398,7 @@
   smp_rendezvous_action();
 
   if (teardown_func == smp_no_rendevous_barrier)
 - while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
 + while (atomic_load_acq_int(smp_rv_waiters[1])  ncpus)
   cpu_spinwait();
 
   /* release lock */
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-13 Thread Max Laier

On Friday 13 May 2011 11:50:57 Max Laier wrote:
 On Friday 13 May 2011 11:28:33 Andriy Gapon wrote:
  on 13/05/2011 17:41 Max Laier said the following:
   this ncpus isn't the one you are looking for.
  
  Thank you!
 
  Here's an updated patch:
 Can you attach the patch, so I can apply it locally.  This code is really
 hard to read without context.  Some more comments inline ...
 
  Index: sys/kern/subr_smp.c
  ===
  --- sys/kern/subr_smp.c (revision 221835)
  +++ sys/kern/subr_smp.c (working copy)
  @@ -316,19 +316,14 @@
  
  void (*local_action_func)(void*)   = smp_rv_action_func;
  void (*local_teardown_func)(void*) = smp_rv_teardown_func;
  
  -   /* Ensure we have up-to-date values. */
  -   atomic_add_acq_int(smp_rv_waiters[0], 1);
  -   while (smp_rv_waiters[0]  smp_rv_ncpus)
  -   cpu_spinwait();
  -
 
 You really need this for architectures that need the memory barrier to
 ensure consistency.  We also need to move the reads of smp_rv_* below this
 point to provide a consistent view.
 
  /* setup function */
  if (local_setup_func != smp_no_rendevous_barrier) {
  
  if (smp_rv_setup_func != NULL)
  
  smp_rv_setup_func(smp_rv_func_arg);
  
  /* spin on entry rendezvous */
  
  -   atomic_add_int(smp_rv_waiters[1], 1);
  -   while (smp_rv_waiters[1]  smp_rv_ncpus)
  +   atomic_add_int(smp_rv_waiters[0], 1);
  +   while (smp_rv_waiters[0]  smp_rv_ncpus)
  
  cpu_spinwait();
  
  }
  
  @@ -337,12 +332,16 @@
  
  local_action_func(local_func_arg);
  
  /* spin on exit rendezvous */
  
  -   atomic_add_int(smp_rv_waiters[2], 1);
  -   if (local_teardown_func == smp_no_rendevous_barrier)
  +   atomic_add_int(smp_rv_waiters[1], 1);
  +   if (local_teardown_func == smp_no_rendevous_barrier) {
  +   atomic_add_int(smp_rv_waiters[2], 1);
  
   return;
  
  -   while (smp_rv_waiters[2]  smp_rv_ncpus)
  +   }
  +   while (smp_rv_waiters[1]  smp_rv_ncpus)
  
  cpu_spinwait();
  
  +   atomic_add_int(smp_rv_waiters[2], 1);
  +
  
  /* teardown function */
  if (local_teardown_func != NULL)
  
  local_teardown_func(local_func_arg);
  
  @@ -377,6 +376,10 @@
  
  /* obtain rendezvous lock */
  mtx_lock_spin(smp_ipi_mtx);
  
  +   /* Wait for any previous unwaited rendezvous to finish. */
  +   while (atomic_load_acq_int(smp_rv_waiters[2])  smp_rv_ncpus)
  +   cpu_spinwait();
  +

Disregard this ... I misread the diff.  You are indeed using [2] correctly as 
the all-clear semaphore.  I still believe, that it is safer/cleaner to do 
this spin before releasing the lock instead (see my patch).

 This does not help you at all.  Imagine the following (unlikely, but not
 impossible) case:
 
 CPUA: start rendevouz including self, finish the action first (i.e. CPUA is
 the first one to see smp_rv_waiters[2] == smp_rv_ncpus, drop the lock and
 start a new rendevouz.  smp_rv_waiters[2] == smp_rv_ncpus is still true on
 that CPU, but ...
 
 CPUB might have increased smp_rv_waiters[2] for the first rendevouz, but
 never saw smp_rv_waiters[2] == smp_rv_ncpus, still ...
 
 CPUA is allowed to start a new rendevouz which will leave CPUB stranded and
 can lead to a deadlock.
 
 I think this is also possible with another CPU starting the second
 rendevous.
 
  /* set static function pointers */
  smp_rv_ncpus = ncpus;
  smp_rv_setup_func = setup_func;
  
  @@ -395,7 +398,7 @@
  
  smp_rendezvous_action();
  
  if (teardown_func == smp_no_rendevous_barrier)
  
  -   while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
  +   while (atomic_load_acq_int(smp_rv_waiters[1])  ncpus)
  
  cpu_spinwait();
  
  /* release lock */
  
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to
  freebsd-current-unsubscr...@freebsd.org
 
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-13 Thread Andriy Gapon

on 13/05/2011 18:50 Max Laier said the following:
 On Friday 13 May 2011 11:28:33 Andriy Gapon wrote:
 on 13/05/2011 17:41 Max Laier said the following:
 this ncpus isn't the one you are looking for.

 Thank you!

 Here's an updated patch:
 
 Can you attach the patch, so I can apply it locally.  This code is really 
 hard 
 to read without context.  Some more comments inline ...

Attached.


 Index: sys/kern/subr_smp.c
 ===
 --- sys/kern/subr_smp.c  (revision 221835)
 +++ sys/kern/subr_smp.c  (working copy)
 @@ -316,19 +316,14 @@
  void (*local_action_func)(void*)   = smp_rv_action_func;
  void (*local_teardown_func)(void*) = smp_rv_teardown_func;

 -/* Ensure we have up-to-date values. */
 -atomic_add_acq_int(smp_rv_waiters[0], 1);
 -while (smp_rv_waiters[0]  smp_rv_ncpus)
 -cpu_spinwait();
 -
 
 You really need this for architectures that need the memory barrier to ensure 
 consistency.  We also need to move the reads of smp_rv_* below this point to 
 provide a consistent view.

I thought that this would be automatically handled by the fact that a master CPU
sets smp_rv_waiters[0] using atomic operation with release semantics.
But I am not very proficient in this matters...
But I fail to see why we need to require that all CPUs should gather at this
point/condition.

That is, my point is that we don't start a new rendezvous until a previous one
is completely finished.  Then we set up the new rendezvous, finish the setup
with an operation with release semantics and only then notify the target CPUs.
I can't see how the slave CPUs would see stale values in the rendezvous
pseudo-object, but, OTOH, I am not very familiar with architectures that have
weaker memory consistency rules as compared to x86.

-- 
Andriy Gapon
Index: sys/kern/subr_smp.c
===
--- sys/kern/subr_smp.c (revision 221835)
+++ sys/kern/subr_smp.c (working copy)
@@ -316,19 +316,14 @@
void (*local_action_func)(void*)   = smp_rv_action_func;
void (*local_teardown_func)(void*) = smp_rv_teardown_func;
 
-   /* Ensure we have up-to-date values. */
-   atomic_add_acq_int(smp_rv_waiters[0], 1);
-   while (smp_rv_waiters[0]  smp_rv_ncpus)
-   cpu_spinwait();
-
/* setup function */
if (local_setup_func != smp_no_rendevous_barrier) {
if (smp_rv_setup_func != NULL)
smp_rv_setup_func(smp_rv_func_arg);
 
/* spin on entry rendezvous */
-   atomic_add_int(smp_rv_waiters[1], 1);
-   while (smp_rv_waiters[1]  smp_rv_ncpus)
+   atomic_add_int(smp_rv_waiters[0], 1);
+   while (smp_rv_waiters[0]  smp_rv_ncpus)
cpu_spinwait();
}
 
@@ -337,12 +332,16 @@
local_action_func(local_func_arg);
 
/* spin on exit rendezvous */
-   atomic_add_int(smp_rv_waiters[2], 1);
-   if (local_teardown_func == smp_no_rendevous_barrier)
+   atomic_add_int(smp_rv_waiters[1], 1);
+   if (local_teardown_func == smp_no_rendevous_barrier) {
+   atomic_add_int(smp_rv_waiters[2], 1);
 return;
-   while (smp_rv_waiters[2]  smp_rv_ncpus)
+   }
+   while (smp_rv_waiters[1]  smp_rv_ncpus)
cpu_spinwait();
 
+   atomic_add_int(smp_rv_waiters[2], 1);
+
/* teardown function */
if (local_teardown_func != NULL)
local_teardown_func(local_func_arg);
@@ -377,6 +376,10 @@
/* obtain rendezvous lock */
mtx_lock_spin(smp_ipi_mtx);
 
+   /* Wait for any previous unwaited rendezvous to finish. */
+   while (atomic_load_acq_int(smp_rv_waiters[2])  smp_rv_ncpus)
+   cpu_spinwait();
+
/* set static function pointers */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;
@@ -395,7 +398,7 @@
smp_rendezvous_action();
 
if (teardown_func == smp_no_rendevous_barrier)
-   while (atomic_load_acq_int(smp_rv_waiters[2])  ncpus)
+   while (atomic_load_acq_int(smp_rv_waiters[1])  ncpus)
cpu_spinwait();
 
/* release lock */
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: proposed smp_rendezvous change

2011-05-13 Thread Andriy Gapon

on 13/05/2011 20:13 Max Laier said the following:
 Disregard this ... I misread the diff.  You are indeed using [2] correctly as 
 the all-clear semaphore.  I still believe, that it is safer/cleaner to do 
 this spin before releasing the lock instead (see my patch).

Maybe.  I consider my approach a minor optimization - that is, I think that
normally smp_rendezvous calls would be sparse enough to never require that
synchronization (as proved by the past experience).  So synchroning/delaying the
master CPU at the end of smp_rendezvous would slightly hurt performance.  Having
the check at the start should trigger the synchronization only when it is really
required.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

46 matches

Mail list logo