Re: proposed smp_rendezvous change
on 15/05/2011 19:24 Andriy Gapon said the following: on 15/05/2011 19:09 Max Laier said the following: I don't think we ever intended to synchronize the local teardown part, and I believe that is the correct behavior for this API. This version is sufficiently close to what I have, so I am resonably sure that it will work for us. It seems, however, that if we move to check to after picking up the lock anyway, the generation approach has even less impact and I am starting to prefer that solution. Andriy, is there any reason why you'd prefer your approach over the generation version? No reason. And I even haven't said that I prefer it :-) I just wanted to show and explain it as apparently there was some misunderstanding about it. I think that generation count approach could even have a little bit better performance while perhaps being a tiny bit less obvious. Resurrecting this conversation. Here's a link to the thread on gmane for your convenience: http://article.gmane.org/gmane.os.freebsd.current/132682 It seems that we haven't fully accounted for the special case of master CPU behavior in neither my proposed fix that added a new wait count nor in the committed fix that has added a generation count. This has been pointed out by kib at the recent mini-DevSummit in Kyiv. The problem can be illustrated by the following example: - the teardown function is a real function; that is, not NULL and not smp_rendevous_no_ barrier (sic) - the arg parameter is a non-NULL pointer - the teardown function actually accesses memory pointed to by the arg - the master CPU deallocates memory pointed by the arg right after smp_rendezvous() call returns (either via free(9) or by stack management, etc) Quoting the code: MPASS(generation == smp_rv_generation); atomic_add_int(smp_rv_waiters[2], 1); if (local_teardown_func != smp_no_rendevous_barrier) { while (smp_rv_waiters[2] smp_rv_ncpus generation == smp_rv_generation) cpu_spinwait(); if (local_teardown_func != NULL) local_teardown_func(local_func_arg); } Generation count helps slave CPUs to not get stuck at the while loop, but it doesn't prevent stale/different arg being passed to the teardown function. So here is a patch, which is a variation of my earlier proposal, that should fix this issue: http://people.freebsd.org/~avg/smp_rendezvous.diff The patch is not tested yet. The patch undoes the smp_rv_generation change and introduces a new smp_rv_waiters[] element. Perhaps the code could be optimized by making waiting on smp_rv_waiters[3] conditional on some combination of values for the teardown and the arg, but I decided to follow the KISS principle here to make the code a little bit more robust. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 17/05/2011 18:51 John Baldwin said the following: On Tuesday, May 17, 2011 10:34:41 am Andriy Gapon wrote: on 17/05/2011 16:58 John Baldwin said the following: No, it doesn't quite work that way. It wouldn't work on Alpha for example. All load_acq is a load with a memory barrier to order other loads after it. It is still free to load stale data. Only a read-modify-write operation would actually block until it could access an up-to-date value. Hmm, ok. How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-) Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] value, if there could be any. Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ? You see at what I am getting? Yeah, either of those would work. At this point just leaving the atomic_add_int() as-is would be the smallest diff. :) Yes, atomic_add_int() for smp_rv_waiters[0] is completely OK, it's the spinwait loop on it that is a bit wasteful. And probably the extra smp_rv_waiters[] element used only for this purpose. Do you agree? Convenience link to the gmane archive of this thread: http://article.gmane.org/gmane.os.freebsd.current/132730 -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Monday, July 18, 2011 7:34:11 am Andriy Gapon wrote: on 15/05/2011 19:24 Andriy Gapon said the following: on 15/05/2011 19:09 Max Laier said the following: I don't think we ever intended to synchronize the local teardown part, and I believe that is the correct behavior for this API. This version is sufficiently close to what I have, so I am resonably sure that it will work for us. It seems, however, that if we move to check to after picking up the lock anyway, the generation approach has even less impact and I am starting to prefer that solution. Andriy, is there any reason why you'd prefer your approach over the generation version? No reason. And I even haven't said that I prefer it :-) I just wanted to show and explain it as apparently there was some misunderstanding about it. I think that generation count approach could even have a little bit better performance while perhaps being a tiny bit less obvious. Resurrecting this conversation. Here's a link to the thread on gmane for your convenience: http://article.gmane.org/gmane.os.freebsd.current/132682 It seems that we haven't fully accounted for the special case of master CPU behavior in neither my proposed fix that added a new wait count nor in the committed fix that has added a generation count. This has been pointed out by kib at the recent mini-DevSummit in Kyiv. The problem can be illustrated by the following example: - the teardown function is a real function; that is, not NULL and not smp_rendevous_no_ barrier (sic) - the arg parameter is a non-NULL pointer - the teardown function actually accesses memory pointed to by the arg - the master CPU deallocates memory pointed by the arg right after smp_rendezvous() call returns (either via free(9) or by stack management, etc) Quoting the code: MPASS(generation == smp_rv_generation); atomic_add_int(smp_rv_waiters[2], 1); if (local_teardown_func != smp_no_rendevous_barrier) { while (smp_rv_waiters[2] smp_rv_ncpus generation == smp_rv_generation) cpu_spinwait(); if (local_teardown_func != NULL) local_teardown_func(local_func_arg); } Generation count helps slave CPUs to not get stuck at the while loop, but it doesn't prevent stale/different arg being passed to the teardown function. So here is a patch, which is a variation of my earlier proposal, that should fix this issue: http://people.freebsd.org/~avg/smp_rendezvous.diff The patch is not tested yet. The patch undoes the smp_rv_generation change and introduces a new smp_rv_waiters[] element. Perhaps the code could be optimized by making waiting on smp_rv_waiters[3] conditional on some combination of values for the teardown and the arg, but I decided to follow the KISS principle here to make the code a little bit more robust. Yes, I prefer the more robust case actually as it is clearer. This is fine with me. It would be good to get Max's feedback and possibly Neel's as well. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Wednesday, May 18, 2011 4:27:50 pm Max Laier wrote: On 05/17/2011 01:35 PM, John Baldwin wrote: ... Yeah, I already have a patch to do that, but hadn't added atomic ops to critical_enter() and critical_exit(). But it also wasn't as fancy in the critical_exit() case. Here is what I have and I think it might actually be ok (it doesn't use an atomic read and clear, but I think it is safe). Hmm, actually, it will need to use the read and clear: Looks good to me. I was slightly surprised by this: Index: kern/kern_synch.c === --- kern/kern_synch.c (revision 222024) +++ kern/kern_synch.c (working copy) @@ -400,9 +400,7 @@ if (!TD_ON_LOCK(td) !TD_IS_RUNNING(td)) mtx_assert(Giant, MA_NOTOWNED); #endif - KASSERT(td-td_critnest == 1 || (td-td_critnest == 2 - (td-td_owepreempt) (flags SW_INVOL) != 0 - newtd == NULL) || panicstr, + KASSERT(td-td_critnest == 1 || panicstr, (mi_switch: switch in a critical section)); KASSERT((flags (SW_INVOL | SW_VOL)) != 0, (mi_switch: switch must be voluntary or involuntary)); part of the patch. But that is in fact correct and much more expressive and safe than the version we had before. Ok, I need to stress test this some first. Thanks, Max P.S. I'd like to see this and the rendezvous changes in stable/7 in the not too distant future. Mind if I MFH these when you are done - unless you are planing to do it already, anyways. I will merge them certainly. These are critical fixes. :( -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Wednesday, May 18, 2011 4:27:50 pm Max Laier wrote: On 05/17/2011 01:35 PM, John Baldwin wrote: ... Yeah, I already have a patch to do that, but hadn't added atomic ops to critical_enter() and critical_exit(). But it also wasn't as fancy in the critical_exit() case. Here is what I have and I think it might actually be ok (it doesn't use an atomic read and clear, but I think it is safe). Hmm, actually, it will need to use the read and clear: Looks good to me. I was slightly surprised by this: Index: kern/kern_synch.c === --- kern/kern_synch.c (revision 222024) +++ kern/kern_synch.c (working copy) @@ -400,9 +400,7 @@ if (!TD_ON_LOCK(td) !TD_IS_RUNNING(td)) mtx_assert(Giant, MA_NOTOWNED); #endif - KASSERT(td-td_critnest == 1 || (td-td_critnest == 2 - (td-td_owepreempt) (flags SW_INVOL) != 0 - newtd == NULL) || panicstr, + KASSERT(td-td_critnest == 1 || panicstr, (mi_switch: switch in a critical section)); KASSERT((flags (SW_INVOL | SW_VOL)) != 0, (mi_switch: switch must be voluntary or involuntary)); part of the patch. But that is in fact correct and much more expressive and safe than the version we had before. Actually, this is something that was made obsolete by ups's change to critical_exit() several years ago. However, after sleeping on this last night, I think that critical_enter() and critical_exit() are a very hot path and that it may be worth adding some additional complexity to keep them as cheap as possible. Also, your original patch for rm locks is actually correct. They will not miss a preemption because rm_cleanIPI will not schedule a thread to run while it holds the spin lock. However, I'd like to be a bit more future proof so that other IPI handler authors don't have to deal with this obscure race. Given that, I've generalized your fix and moved it up into the SMP rendezvous code itself along with a big comment to explain why it works and a KASSERT(). I think this is also MFC'able back to 7 as well: Index: kern/subr_smp.c === --- kern/subr_smp.c (revision 222032) +++ kern/subr_smp.c (working copy) @@ -316,6 +316,9 @@ void (*local_action_func)(void*); void (*local_teardown_func)(void*); int generation; +#ifdef INVARIANTS + int owepreempt; +#endif /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); @@ -330,6 +333,33 @@ generation = smp_rv_generation; /* +* Use a nested critical section to prevent any preemptions +* from occurring during a rendezvous action routine. +* Specifically, if a rendezvous handler is invoked via an IPI +* and the interrupted thread was in the critical_exit() +* function after setting td_critnest to 0 but before +* performing a deferred preemption, this routine can be +* invoked with td_critnest set to 0 and td_owepreempt true. +* In that case, a critical_exit() during the rendezvous +* action would trigger a preemption which is not permitted in +* a rendezvous action. To fix this, wrap all of the +* rendezvous action handlers in a critical section. We +* cannot use a regular critical section however as having +* critical_exit() preempt from this routine would also be +* problematic (the preemption must not occur before the IPI +* has been acknowleged via an EOI). Instead, we +* intentionally ignore td_owepreempt when leaving the +* critical setion. This should be harmless because we do not +* permit rendezvous action routines to schedule threads, and +* thus td_owepreempt should never transition from 0 to 1 +* during this routine. +*/ + td-td_critnest++; +#ifdef INVARIANTS + owepreempt = td-td_owepreempt; +#endif + + /* * If requested, run a setup function before the main action * function. Ensure all CPUs have completed the setup * function before moving on to the action function. @@ -362,14 +392,18 @@ */ MPASS(generation == smp_rv_generation); atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) -return; - while (smp_rv_waiters[2] smp_rv_ncpus - generation == smp_rv_generation) - cpu_spinwait(); + if (local_teardown_func != smp_no_rendevous_barrier) { + while (smp_rv_waiters[2] smp_rv_ncpus + generation == smp_rv_generation) + cpu_spinwait(); - if (local_teardown_func != NULL) - local_teardown_func(local_func_arg); + if (local_teardown_func != NULL) +
Re: proposed smp_rendezvous change
On 05/17/2011 01:35 PM, John Baldwin wrote: ... Yeah, I already have a patch to do that, but hadn't added atomic ops to critical_enter() and critical_exit(). But it also wasn't as fancy in the critical_exit() case. Here is what I have and I think it might actually be ok (it doesn't use an atomic read and clear, but I think it is safe). Hmm, actually, it will need to use the read and clear: Looks good to me. I was slightly surprised by this: Index: kern/kern_synch.c === --- kern/kern_synch.c (revision 222024) +++ kern/kern_synch.c (working copy) @@ -400,9 +400,7 @@ if (!TD_ON_LOCK(td) !TD_IS_RUNNING(td)) mtx_assert(Giant, MA_NOTOWNED); #endif - KASSERT(td-td_critnest == 1 || (td-td_critnest == 2 - (td-td_owepreempt) (flags SW_INVOL) != 0 - newtd == NULL) || panicstr, + KASSERT(td-td_critnest == 1 || panicstr, (mi_switch: switch in a critical section)); KASSERT((flags (SW_INVOL | SW_VOL)) != 0, (mi_switch: switch must be voluntary or involuntary)); part of the patch. But that is in fact correct and much more expressive and safe than the version we had before. Thanks, Max P.S. I'd like to see this and the rendezvous changes in stable/7 in the not too distant future. Mind if I MFH these when you are done - unless you are planing to do it already, anyways. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 16/05/2011 23:09 John Baldwin said the following: On Monday, May 16, 2011 3:27:47 pm Andriy Gapon wrote: on 16/05/2011 21:21 John Baldwin said the following: How about this: ... /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map) void smp_rendezvous_action(void) { - void* local_func_arg = smp_rv_func_arg; - void (*local_setup_func)(void*) = smp_rv_setup_func; - void (*local_action_func)(void*) = smp_rv_action_func; - void (*local_teardown_func)(void*) = smp_rv_teardown_func; + void *local_func_arg; + void (*local_setup_func)(void*); + void (*local_action_func)(void*); + void (*local_teardown_func)(void*); + int generation; /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); - /* setup function */ + /* Fetch rendezvous parameters after acquire barrier. */ + local_func_arg = smp_rv_func_arg; + local_setup_func = smp_rv_setup_func; + local_action_func = smp_rv_action_func; + local_teardown_func = smp_rv_teardown_func; I want to ask once again - please pretty please convince me that the above cpu_spinwait() loop is really needed and, by extension, that the assignments should be moved behind it. Please :) Well, moving the assignments down is a style fix for one, and we can always remove the first rendezvous as a follow up if desired. OK, makes sense. However, suppose you have an arch where sending an IPI is not a barrier (this seems unlikely). In that case, the atomic_add_acq_int() will not succeed (and return) until it has seen the earlier write by the CPU initiating the rendezvous to clear smp_rv_waiters[0] to zero. The actual spin on the smp_rv_waiters[] value is not strictly necessary however and On this below. is probably just cut and pasted to match the other uses of values in the smp_rv_waiters[] array. (atomic_add_acq_int() could spin on architectures where it is implemented using compare-and-swap (e.g. sparc64) or locked-load conditional-store (e.g. Alpha).) When you say not strictly necessary, do you mean not necessary? If you do not mean that, then when could it be (non-strictly) necessary? :) Couldn't [Shouldn't] the whole: /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); be just replaced with: rmb(); Or a proper MI function that does just a read memory barrier, if rmb() is not that. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On 5/17/11 4:03 AM, Andriy Gapon wrote: on 16/05/2011 23:09 John Baldwin said the following: is probably just cut and pasted to match the other uses of values in the smp_rv_waiters[] array. (atomic_add_acq_int() could spin on architectures where it is implemented using compare-and-swap (e.g. sparc64) or locked-load conditional-store (e.g. Alpha).) When you say not strictly necessary, do you mean not necessary? If you do not mean that, then when could it be (non-strictly) necessary? :) Couldn't [Shouldn't] the whole: /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); be just replaced with: rmb(); Or a proper MI function that does just a read memory barrier, if rmb() is not that. No, you could replace it with: atomic_add_acq_int(smp_rv_waiters[0], 1); The key being that atomic_add_acq_int() will block (either in hardware or software) until it can safely perform the atomic operation. That means waiting until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is visible to the current CPU. On some platforms a write by one CPU may not post instantly to other CPUs (e.g. it may sit in a store buffer). That is fine so long as an attempt to update that value atomically (using cas or a conditional-store, etc.) fails. For those platforms, the atomic(9) API is required to spin until it succeeds. This is why the mtx code spins if it can't set MTX_CONTESTED for example. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On 5/16/11 6:05 PM, Max Laier wrote: On Monday 16 May 2011 17:54:54 Attilio Rao wrote: 2011/5/16 Max Laierm...@love2party.net: On Monday 16 May 2011 16:46:03 John Baldwin wrote: On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: On Monday 16 May 2011 14:21:27 John Baldwin wrote: Yes, we need to fix that. Humm, it doesn't preempt when you do a critical_exit() though? Or do you use a hand-rolled critical exit that doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... Ah, ok, so you would lose a preemption. That's not really ideal. Actually, I'm curious how the spin unlock inside the IPI could yield the CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I guess that is ok as long as the critical_exit() just defers the preemption to the end of the IPI handler. ... the earliest point where it is safe to preempt is after doing the atomic_add_int(smp_rv_waiters[2], 1); so that we can start other IPIs again. However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? Hmm, yeah, you would want to do the EOI before you yield. However, we could actually move the EOI up before calling the MI code so long as we leave interrupts disabled for the duration of the handler (which we do). The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt. I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. No, it is only set on curthread by curthread. This is actually my main question. I've no idea how this could happen unless the rmlock code is actually triggering a wakeup or sched_add() in its rendezvous handler. I don't see anything in rm_cleanIPI() that would do that however. I wonder if your original issue was really fixed just by the first patch you had which fixed the race in smp_rendezvous()? I found the stack that lead me to this patch in the first place: #0 sched_switch (td=0xff011a97, newtd=0xff006e6784b0, flags=4) at src/sys/kern/sched_ule.c:1939 #1 0x80285c7f in mi_switch (flags=6, newtd=0x0) at src/sys/kern/kern_synch.c:475 #2 0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185 #3 0x80465807 in spinlock_exit () at src/sys/amd64/amd64/machdep.c:1458 #4 0x8027adea in rm_cleanIPI (arg=value optimized out) at src/sys/kern/kern_rmlock.c:180 #5 0x802b9887 in smp_rendezvous_action () at src/sys/kern/subr_smp.c:402 #6 0x8045e2a4 in Xrendezvous () at src/sys/amd64/amd64/apic_vector.S:235 #7 0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179 #8 0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0, item=0xff000f34cd40, udata=0xff000f34ce08) at src/sys/vm/uma_core.c:2370 . . . and now that I look at it again, it is clear that critical_exit() just isn't interrupt safe. I'm not sure how to fix that, yet ... but this: if (td-td_critnest == 1) { td-td_critnest = 0; if (td-td_owepreempt) { td-td_critnest = 1; clearly doesn't work. I'm sorry if I didn't reply to the whole rendezvous thread, but as long as there is so many people taking care of it, I'll stay hidden in my corner. I just wanted to tell that I think you are misunderstanding what critical section is supposed to do. When an interrupt fires, it goes on the old interrupt/kernel context which means it has not a context of his own. That is the reason why we disable interrupts on spinlocks (or similar workaround for !x86 architectures) and this is why spinlocks are the only protection usable in code that runs in interrupt context. Preempting just means another thread will be scheduler in the middle of another thread execution path. This code is perfectly fine if you consider curthread won't be descheduled while it is executing. Well, no - it is not. With this you can end up with a curthread that has td_critnest=0 and td_owepreempt=1 in interrupt context. If you use a spinlock on such a thread, it will do the preemption at the point where you drop the spinlock, this is bad in some circumstances. One example is the smp_rendevous case we are discussing here. Yes, the 'owepreempt' flag can leak because we allow it to be set while td_critnest is 0. That is the problem, and I think we added this as an optimization in the last round of changes to this routine to avoid excessive context switches. However, I think we will just have to revert that optimization and prevent that state from occurring. Hmm, the log message for the change said it was to avoid races. http://svnweb.freebsd.org/base?view=revisionrevision=144777 That is what introduced the bug of mixing those two states. Hmmm, the reason for moving the td_critnest == 0 up
Re: proposed smp_rendezvous change
on 17/05/2011 14:56 John Baldwin said the following: On 5/17/11 4:03 AM, Andriy Gapon wrote: Couldn't [Shouldn't] the whole: /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); be just replaced with: rmb(); Or a proper MI function that does just a read memory barrier, if rmb() is not that. No, you could replace it with: atomic_add_acq_int(smp_rv_waiters[0], 1); What about (void)atomic_load_acq(smp_rv_waiters[0]); In my opinion that should ensure that the hardware must post the latest value from a master CPU to memory of smp_rv_waiters[0] and a slave CPU gets it from there. And also, because of memory barriers inserted by store_rel on the master CPU and load_acq on the slave CPU, the latest values of all other smp_rv_* fields should become visible to the slave CPU. The key being that atomic_add_acq_int() will block (either in hardware or software) until it can safely perform the atomic operation. That means waiting until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is visible to the current CPU. On some platforms a write by one CPU may not post instantly to other CPUs (e.g. it may sit in a store buffer). That is fine so long as an attempt to update that value atomically (using cas or a conditional-store, etc.) fails. For those platforms, the atomic(9) API is required to spin until it succeeds. This is why the mtx code spins if it can't set MTX_CONTESTED for example. Thank you for the great explanation! Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate cas call, which should take care of hardware synchronization. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Tuesday, May 17, 2011 9:46:34 am Andriy Gapon wrote: on 17/05/2011 14:56 John Baldwin said the following: On 5/17/11 4:03 AM, Andriy Gapon wrote: Couldn't [Shouldn't] the whole: /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); be just replaced with: rmb(); Or a proper MI function that does just a read memory barrier, if rmb() is not that. No, you could replace it with: atomic_add_acq_int(smp_rv_waiters[0], 1); What about (void)atomic_load_acq(smp_rv_waiters[0]); In my opinion that should ensure that the hardware must post the latest value from a master CPU to memory of smp_rv_waiters[0] and a slave CPU gets it from there. And also, because of memory barriers inserted by store_rel on the master CPU and load_acq on the slave CPU, the latest values of all other smp_rv_* fields should become visible to the slave CPU. No, it doesn't quite work that way. It wouldn't work on Alpha for example. All load_acq is a load with a memory barrier to order other loads after it. It is still free to load stale data. Only a read-modify-write operation would actually block until it could access an up-to-date value. The key being that atomic_add_acq_int() will block (either in hardware or software) until it can safely perform the atomic operation. That means waiting until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is visible to the current CPU. On some platforms a write by one CPU may not post instantly to other CPUs (e.g. it may sit in a store buffer). That is fine so long as an attempt to update that value atomically (using cas or a conditional-store, etc.) fails. For those platforms, the atomic(9) API is required to spin until it succeeds. This is why the mtx code spins if it can't set MTX_CONTESTED for example. Thank you for the great explanation! Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate cas call, which should take care of hardware synchronization. sparc64's load_acq() is stronger than the MI effect of load_acq(). On ia64 which uses ld.acq or Alpha (originally) which uses a membar and simple load, the guarantees are only what I stated above (and would not be sufficient). Note that Alpha borrowed heavily from MIPS, and the MIPS atomic implementation is mostly identical to the old Alpha one (using conditional stores, etc.). The MIPS atomic_load_acq(): #define ATOMIC_STORE_LOAD(WIDTH)\ static __inline uint##WIDTH##_t\ atomic_load_acq_##WIDTH(__volatile uint##WIDTH##_t *p) \ { \ uint##WIDTH##_t v; \ \ v = *p; \ mips_sync();\ return (v); \ } \ -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 17/05/2011 16:58 John Baldwin said the following: No, it doesn't quite work that way. It wouldn't work on Alpha for example. All load_acq is a load with a memory barrier to order other loads after it. It is still free to load stale data. Only a read-modify-write operation would actually block until it could access an up-to-date value. Hmm, ok. How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-) Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] value, if there could be any. Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ? You see at what I am getting? The key being that atomic_add_acq_int() will block (either in hardware or software) until it can safely perform the atomic operation. That means waiting until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is visible to the current CPU. On some platforms a write by one CPU may not post instantly to other CPUs (e.g. it may sit in a store buffer). That is fine so long as an attempt to update that value atomically (using cas or a conditional-store, etc.) fails. For those platforms, the atomic(9) API is required to spin until it succeeds. This is why the mtx code spins if it can't set MTX_CONTESTED for example. Thank you for the great explanation! Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate cas call, which should take care of hardware synchronization. sparc64's load_acq() is stronger than the MI effect of load_acq(). On ia64 Oh, well, my expectation was that MI effect of atomic_load (emphasis on atomic_) was to get a non-stale value. which uses ld.acq or Alpha (originally) which uses a membar and simple load, the guarantees are only what I stated above (and would not be sufficient). Note that Alpha borrowed heavily from MIPS, and the MIPS atomic implementation is mostly identical to the old Alpha one (using conditional stores, etc.). The MIPS atomic_load_acq(): #define ATOMIC_STORE_LOAD(WIDTH)\ static __inline uint##WIDTH##_t\ atomic_load_acq_##WIDTH(__volatile uint##WIDTH##_t *p) \ { \ uint##WIDTH##_t v; \ \ v = *p; \ mips_sync();\ return (v); \ } \ I should have checked this myself. Thank you for patiently explaining these things to me. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On 05/17/2011 05:16 AM, John Baldwin wrote: ... Index: kern/kern_switch.c === --- kern/kern_switch.c (revision 221536) +++ kern/kern_switch.c (working copy) @@ -192,15 +192,22 @@ critical_exit(void) { struct thread *td; - int flags; + int flags, owepreempt; td = curthread; KASSERT(td-td_critnest != 0, (critical_exit: td_critnest == 0)); if (td-td_critnest == 1) { + owepreempt = td-td_owepreempt; + td-td_owepreempt = 0; + /* + * XXX: Should move compiler_memory_barrier() from + * rmlock to a header. + */ XXX: If we get an interrupt at this point and td_owepreempt was zero, the new interrupt will re-set it, because td_critnest is still non-zero. So we still end up with a thread that is leaking an owepreempt *and* lose a preemption. We really need an atomic_readandclear() which gives us a local copy of td_owepreempt *and* clears critnest in the same operation. Sadly, that is rather expensive. It is possible to implement with a flag for owepreempt, but that means that all writes to critnest must then be atomic. Either because we know we have interrupts disabled (i.e. setting owepreempt can be a RMW), or with a proper atomic_add/set/... I'm not sure what the performance impact of this will be. One would hope that atomic_add without a memory barrier isn't much more expensive than a compiler generated read-modify-write, tho. Especially, since this cacheline should be local and exclusive to us, anyway. + __asm __volatile(:::memory); td-td_critnest = 0; - if (td-td_owepreempt) { + if (owepreempt) { td-td_critnest = 1; thread_lock(td); td-td_critnest--; ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Tuesday, May 17, 2011 10:34:41 am Andriy Gapon wrote: on 17/05/2011 16:58 John Baldwin said the following: No, it doesn't quite work that way. It wouldn't work on Alpha for example. All load_acq is a load with a memory barrier to order other loads after it. It is still free to load stale data. Only a read-modify-write operation would actually block until it could access an up-to-date value. Hmm, ok. How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-) Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] value, if there could be any. Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ? You see at what I am getting? Yeah, either of those would work. At this point just leaving the atomic_add_int() as-is would be the smallest diff. :) -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote: On 05/17/2011 05:16 AM, John Baldwin wrote: ... Index: kern/kern_switch.c === --- kern/kern_switch.c (revision 221536) +++ kern/kern_switch.c (working copy) @@ -192,15 +192,22 @@ critical_exit(void) { struct thread *td; - int flags; + int flags, owepreempt; td = curthread; KASSERT(td-td_critnest != 0, (critical_exit: td_critnest == 0)); if (td-td_critnest == 1) { + owepreempt = td-td_owepreempt; + td-td_owepreempt = 0; + /* + * XXX: Should move compiler_memory_barrier() from + * rmlock to a header. + */ XXX: If we get an interrupt at this point and td_owepreempt was zero, the new interrupt will re-set it, because td_critnest is still non-zero. So we still end up with a thread that is leaking an owepreempt *and* lose a preemption. I don't see how this can still leak owepreempt. The nested interrupt should do nothing (except for possibly set owepreempt) until td_critnest is 0. However, we can certainly lose preemptions. I wonder if we can abuse the high bit of td_critnest for the owepreempt flag so it is all stored in one cookie. We only set owepreempt while holding thread_lock() (so interrupts are disabled), so I think we would be ok and not need atomic ops. Hmm, actually, the top-half code would have to use atomic ops. Nuts. Let me think some more. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On 05/17/2011 09:56 AM, John Baldwin wrote: On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote: On 05/17/2011 05:16 AM, John Baldwin wrote: ... Index: kern/kern_switch.c === --- kern/kern_switch.c (revision 221536) +++ kern/kern_switch.c (working copy) @@ -192,15 +192,22 @@ critical_exit(void) { struct thread *td; - int flags; + int flags, owepreempt; td = curthread; KASSERT(td-td_critnest != 0, (critical_exit: td_critnest == 0)); if (td-td_critnest == 1) { + owepreempt = td-td_owepreempt; + td-td_owepreempt = 0; + /* + * XXX: Should move compiler_memory_barrier() from + * rmlock to a header. + */ XXX: If we get an interrupt at this point and td_owepreempt was zero, the new interrupt will re-set it, because td_critnest is still non-zero. So we still end up with a thread that is leaking an owepreempt *and* lose a preemption. I don't see how this can still leak owepreempt. The nested interrupt should do nothing (except for possibly set owepreempt) until td_critnest is 0. Exactly. The interrupt sets owepreempt and after we return here, we set td_critnest to 0 and exit without clearing owepreempt. Hence we leak the owepreempt. However, we can certainly lose preemptions. I wonder if we can abuse the high bit of td_critnest for the owepreempt flag so it is all stored in one cookie. We only set owepreempt while holding thread_lock() (so interrupts are disabled), so I think we would be ok and not need atomic ops. Hmm, actually, the top-half code would have to use atomic ops. Nuts. Let me think some more. I think these two really belong into one single variable. Setting the owepreempt flag can be a normal RMW. Increasing and decreasing critnest must be atomic (otherwise we could lose the flag) and dropping the final reference would work like this: if ((curthread-td_critnest TD_CRITNEST_MASK) == 1) { unsigned int owe; owe = atomic_readandclear(curthread-td_critnest); if (owe TD_OWEPREEMPT_FLAG) { /* do the switch */ } That should do it ... I can put that into a patch, if we agree that's the right thing to do. Thanks, Max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 17/05/2011 18:51 John Baldwin said the following: On Tuesday, May 17, 2011 10:34:41 am Andriy Gapon wrote: on 17/05/2011 16:58 John Baldwin said the following: No, it doesn't quite work that way. It wouldn't work on Alpha for example. All load_acq is a load with a memory barrier to order other loads after it. It is still free to load stale data. Only a read-modify-write operation would actually block until it could access an up-to-date value. Hmm, ok. How about atomic_add_acq_int(smp_rv_waiters[0], 0) ? :-) Or an equivalent MI action that doesn't actually change smp_rv_waiters[0] value, if there could be any. Maybe explicit atomic_cmpset_acq_int(smp_rv_waiters[0], 0, 0) ? You see at what I am getting? Yeah, either of those would work. At this point just leaving the atomic_add_int() as-is would be the smallest diff. :) Yes, I agree about the smallest diff. But if we are going to remove that first spin-loop, then we would effectively be wasting one volatile variable. Only a tiny waste, but it could be confusing. Whatever we decide, this was very educational for me. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Tuesday, May 17, 2011 3:16:45 pm Max Laier wrote: On 05/17/2011 09:56 AM, John Baldwin wrote: On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote: On 05/17/2011 05:16 AM, John Baldwin wrote: ... Index: kern/kern_switch.c === --- kern/kern_switch.c (revision 221536) +++ kern/kern_switch.c (working copy) @@ -192,15 +192,22 @@ critical_exit(void) { struct thread *td; - int flags; + int flags, owepreempt; td = curthread; KASSERT(td-td_critnest != 0, (critical_exit: td_critnest == 0)); if (td-td_critnest == 1) { + owepreempt = td-td_owepreempt; + td-td_owepreempt = 0; + /* + * XXX: Should move compiler_memory_barrier() from + * rmlock to a header. + */ XXX: If we get an interrupt at this point and td_owepreempt was zero, the new interrupt will re-set it, because td_critnest is still non-zero. So we still end up with a thread that is leaking an owepreempt *and* lose a preemption. I don't see how this can still leak owepreempt. The nested interrupt should do nothing (except for possibly set owepreempt) until td_critnest is 0. Exactly. The interrupt sets owepreempt and after we return here, we set td_critnest to 0 and exit without clearing owepreempt. Hence we leak the owepreempt. However, we can certainly lose preemptions. I wonder if we can abuse the high bit of td_critnest for the owepreempt flag so it is all stored in one cookie. We only set owepreempt while holding thread_lock() (so interrupts are disabled), so I think we would be ok and not need atomic ops. Hmm, actually, the top-half code would have to use atomic ops. Nuts. Let me think some more. I think these two really belong into one single variable. Setting the owepreempt flag can be a normal RMW. Increasing and decreasing critnest must be atomic (otherwise we could lose the flag) and dropping the final reference would work like this: if ((curthread-td_critnest TD_CRITNEST_MASK) == 1) { unsigned int owe; owe = atomic_readandclear(curthread-td_critnest); if (owe TD_OWEPREEMPT_FLAG) { /* do the switch */ } That should do it ... I can put that into a patch, if we agree that's the right thing to do. Yeah, I already have a patch to do that, but hadn't added atomic ops to critical_enter() and critical_exit(). But it also wasn't as fancy in the critical_exit() case. Here is what I have and I think it might actually be ok (it doesn't use an atomic read and clear, but I think it is safe). Hmm, actually, it will need to use the read and clear: Index: kern/sched_ule.c === --- kern/sched_ule.c(revision 222024) +++ kern/sched_ule.c(working copy) @@ -1785,7 +1785,6 @@ td-td_oncpu = NOCPU; if (!(flags SW_PREEMPT)) td-td_flags = ~TDF_NEEDRESCHED; - td-td_owepreempt = 0; tdq-tdq_switchcnt++; /* * The lock pointer in an idle thread should never change. Reset it @@ -2066,7 +2065,7 @@ flags = SW_INVOL | SW_PREEMPT; if (td-td_critnest 1) - td-td_owepreempt = 1; + td-td_critnest |= TD_OWEPREEMPT; else if (TD_IS_IDLETHREAD(td)) mi_switch(flags | SWT_REMOTEWAKEIDLE, NULL); else @@ -2261,7 +2260,7 @@ return; if (!sched_shouldpreempt(pri, cpri, 0)) return; - ctd-td_owepreempt = 1; + ctd-td_critnest |= TD_OWEPREEMPT; } /* Index: kern/kern_switch.c === --- kern/kern_switch.c (revision 222024) +++ kern/kern_switch.c (working copy) @@ -183,7 +183,7 @@ struct thread *td; td = curthread; - td-td_critnest++; + atomic_add_int(td-td_critnest, 1); CTR4(KTR_CRITICAL, critical_enter by thread %p (%ld, %s) to %d, td, (long)td-td_proc-p_pid, td-td_name, td-td_critnest); } @@ -192,18 +192,16 @@ critical_exit(void) { struct thread *td; - int flags; + int flags, owepreempt; td = curthread; KASSERT(td-td_critnest != 0, (critical_exit: td_critnest == 0)); - if (td-td_critnest == 1) { - td-td_critnest = 0; - if (td-td_owepreempt) { - td-td_critnest = 1; + if (td-td_critnest ~TD_OWEPREEMPT == 1) { + owepreempt = atomic_readandclear_int(td-td_owepreempt); + if (owepreempt TD_OWEPREEMPT) { thread_lock(td); - td-td_critnest--; flags = SW_INVOL | SW_PREEMPT; if (TD_IS_IDLETHREAD(td)) flags |= SWT_IDLE; @@ -213,7 +211,7 @@ thread_unlock(td);
Re: proposed smp_rendezvous change
On Sunday, May 15, 2011 12:36:45 pm Andriy Gapon wrote: on 15/05/2011 18:16 John Baldwin said the following: On 5/15/11 10:53 AM, Andriy Gapon wrote: on 15/05/2011 10:12 Andriy Gapon said the following: on 14/05/2011 18:25 John Baldwin said the following: Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. As a follow up to my previous question. Have you noticed that in my patch no slave CPU actually waits/spins on smp_rv_waiters[2]? It's always only master CPU (and under smp_ipi_mtx). Here's a cleaner version of my approach to the fix. This one does not remove the initial wait on smp_rv_waiters[0] in smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members and thus hopefully should be clearer. Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c(revision 221943) +++ sys/kern/subr_smp.c(working copy) @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar static void (*volatile smp_rv_action_func)(void *arg); static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; -static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_waiters[4]; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -338,11 +338,15 @@ smp_rendezvous_action(void) /* spin on exit rendezvous */ atomic_add_int(smp_rv_waiters[2], 1); -if (local_teardown_func == smp_no_rendevous_barrier) +if (local_teardown_func == smp_no_rendevous_barrier) { +atomic_add_int(smp_rv_waiters[3], 1); return; +} while (smp_rv_waiters[2] smp_rv_ncpus) cpu_spinwait(); +atomic_add_int(smp_rv_waiters[3], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map, /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); +while (smp_rv_waiters[3] smp_rv_ncpus) +cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_func_arg = arg; smp_rv_waiters[1] = 0; smp_rv_waiters[2] = 0; +smp_rv_waiters[3] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); /* signal other processors, which will enter the IPI with interrupts off */ Ahh, so the bump is after the change. I do think this will still be ok and I probably just didn't explain it well to Neel. I wonder though if the bump shouldn't happen until after the call of the local teardown function? That should not be necessary according to my understanding unless we have a reason to care about potentially disappearing function text. Teardown function pointer and argument are locally cached and I don't see any problems with a new rendezvous being setup and run while teardowns of the previous one are still running. On the other hand, you might be onto something here. If the argument is allocated in temporary storage (e.g. on stack) and the teardown function actually uses the argument, then it's not safe to deallocate the argument until all of the teardowns are done. In this case the master CPU should wait not only until all actions are done, but also until all teardowns are done. The current code doesn't do that. And I am not sure if we actually would want to add this safety net or just document the behavior and put a burden on smp_rendezvous API users. As we chatted about on IRC, I think we should err on the side of making this API less complex (so that all of the actions for a given rendezvous finish before the next rendezvous begins) as this stuff is already fairly complex. We should only add more complexity if we really need it. BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e. with non-trivial setup, action and teardown (except perhaps for jkim's stress test). OpenSolaris seems to be doing just fine with the following three types of simpler CPU cross-calls (as they call them): 1. async - a master CPU requests an action to be executed on target CPUs but doesn't wait for anything - we have no explicit equivalent for this 2. call - a master CPU requests an action to be executed on target CPUs and waits until all the target CPUs have finished executing it - this would
Re: proposed smp_rendezvous change
I'd like to propose that we move forward with fixing the race, before redesigning the API - please. We agree that there is a problem with the exit rendezvous. Let's fix that. We agree that the generation approach proposed by NetAPP is the right way to go? Can we check it in, then? Again, I'd like some comments in the code if at all possible. A second commit should take care of the entry sync - which may or may not be required on some architectures. If we decide that we don't need it, we should remove it. Otherwise, we must move all the assignments from global smp_rv_* to the stack in smp_rendezvous_action to after the sync. Otherwise we should just kill it. In any case, it would be nice to clean this up. After that, I have one more bugfix for rmlock(9) - one of the three consumers of this API (that I am aware of). As it uses a spinlock inside its IPI action, we have to mark smp_rendezvous_action a critical section otherwise the spin unlock might yield ... with the expected consequences. It is arguable if you should be allowed to use spinlocks in the rendevous at all, but that's beside the point. On Monday 16 May 2011 11:52:10 John Baldwin wrote: On Sunday, May 15, 2011 12:36:45 pm Andriy Gapon wrote: ... As we chatted about on IRC, I think we should err on the side of making this API less complex (so that all of the actions for a given rendezvous finish before the next rendezvous begins) as this stuff is already fairly complex. We should only add more complexity if we really need it. +1 BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e. with non-trivial setup, action and teardown (except perhaps for jkim's stress test). OpenSolaris seems to be doing just fine with the following three types of simpler CPU cross-calls (as they call them): 1. async - a master CPU requests an action to be executed on target CPUs but doesn't wait for anything - we have no explicit equivalent for this I'd like to see something like this. Otherwise we end up with hacks like callouts that sched_bind and stuff like that. I can think of plenty use cases for this API. 2. call - a master CPU requests an action to be executed on target CPUs and waits until all the target CPUs have finished executing it - this would be equivalent smp_rendezvous_cpus(no_barrier, action, no_barrier) 3. synch - a master CPU requests an action to be executed on target CPUs and waits until all the target CPUs have finished executing it and the slave CPUs are also waiting for their siblings to be done - this would be equivalent smp_rendezvous_cpus(no_barrier, action, NULL) Seems that our current code doesn't use/need more than that as well. But I really like the capability that smp_rendezvous_cpus() potentially provides. I would be fine with stripping down our rendezvous API a bit. They could all still use a single IPI and handler but perhaps prevent a few wrapper APIs that provide simpler semantics like this. Back to my statement above, we've had this uber-generic rendezvous API for a while, but if we don't really need it to be quite so complex I'd be happy to dumb down the API to what we really need so the implementation and use can be simpler. Again, agreed. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Monday, May 16, 2011 1:46:33 pm Max Laier wrote: I'd like to propose that we move forward with fixing the race, before redesigning the API - please. We agree that there is a problem with the exit rendezvous. Let's fix that. We agree that the generation approach proposed by NetAPP is the right way to go? Can we check it in, then? Again, I'd like some comments in the code if at all possible. How about this: Index: subr_smp.c === --- subr_smp.c (revision 221939) +++ subr_smp.c (working copy) @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_generation; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map) void smp_rendezvous_action(void) { - void* local_func_arg = smp_rv_func_arg; - void (*local_setup_func)(void*) = smp_rv_setup_func; - void (*local_action_func)(void*) = smp_rv_action_func; - void (*local_teardown_func)(void*) = smp_rv_teardown_func; + void *local_func_arg; + void (*local_setup_func)(void*); + void (*local_action_func)(void*); + void (*local_teardown_func)(void*); + int generation; /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); - /* setup function */ + /* Fetch rendezvous parameters after acquire barrier. */ + local_func_arg = smp_rv_func_arg; + local_setup_func = smp_rv_setup_func; + local_action_func = smp_rv_action_func; + local_teardown_func = smp_rv_teardown_func; + generation = smp_rv_generation; + + /* +* If requested, run a setup function before the main action +* function. Ensure all CPUs have completed the setup +* function before moving on to the action function. +*/ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); - - /* spin on entry rendezvous */ atomic_add_int(smp_rv_waiters[1], 1); while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); } - /* action function */ if (local_action_func != NULL) local_action_func(local_func_arg); - /* spin on exit rendezvous */ + /* +* Signal that the main action has been completed. If a +* full exit rendezvous is requested, then all CPUs will +* wait here until all CPUs have finished the main action. +* +* Note that the write by the last CPU to finish the action +* may become visible to different CPUs at different times. +* As a result, the CPU that initiated the rendezvous may +* exit the rendezvous and drop the lock allowing another +* rendezvous to be initiated on the same CPU or a different +* CPU. In that case the exit sentinel may be cleared before +* all CPUs have noticed causing those CPUs to hang forever. +* Workaround this by using a generation count to notice when +* this race occurs and to exit the rendezvous in that case. +*/ atomic_add_int(smp_rv_waiters[2], 1); if (local_teardown_func == smp_no_rendevous_barrier) return; - while (smp_rv_waiters[2] smp_rv_ncpus) + while (smp_rv_waiters[2] smp_rv_ncpus + generation == smp_rv_generation) cpu_spinwait(); - /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); } @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map, if (ncpus == 0) panic(ncpus is 0 with map=0x%x, map); - /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); - /* set static function pointers */ + atomic_add_acq_int(smp_rv_generation, 1); + + /* Pass rendezvous parameters via global variables. */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; smp_rv_action_func = action_func; @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_waiters[2] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); - /* signal other processors, which will enter the IPI with interrupts off */ + /* +* Signal other processors, which will enter the IPI with +* interrupts off. +*/ ipi_selected(map ~(1 curcpu), IPI_RENDEZVOUS); /* Check if the current CPU is in the map */ if ((map (1 curcpu)) != 0) smp_rendezvous_action(); + /* +*
Re: proposed smp_rendezvous change
on 16/05/2011 21:21 John Baldwin said the following: How about this: ... /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map) void smp_rendezvous_action(void) { - void* local_func_arg = smp_rv_func_arg; - void (*local_setup_func)(void*) = smp_rv_setup_func; - void (*local_action_func)(void*) = smp_rv_action_func; - void (*local_teardown_func)(void*) = smp_rv_teardown_func; + void *local_func_arg; + void (*local_setup_func)(void*); + void (*local_action_func)(void*); + void (*local_teardown_func)(void*); + int generation; /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); - /* setup function */ + /* Fetch rendezvous parameters after acquire barrier. */ + local_func_arg = smp_rv_func_arg; + local_setup_func = smp_rv_setup_func; + local_action_func = smp_rv_action_func; + local_teardown_func = smp_rv_teardown_func; I want to ask once again - please pretty please convince me that the above cpu_spinwait() loop is really needed and, by extension, that the assignments should be moved behind it. Please :) -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Monday 16 May 2011 14:21:27 John Baldwin wrote: On Monday, May 16, 2011 1:46:33 pm Max Laier wrote: I'd like to propose that we move forward with fixing the race, before redesigning the API - please. We agree that there is a problem with the exit rendezvous. Let's fix that. We agree that the generation approach proposed by NetAPP is the right way to go? Can we check it in, then? Again, I'd like some comments in the code if at all possible. How about this: Index: subr_smp.c === --- subr_smp.c(revision 221939) +++ subr_smp.c(working copy) @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_generation; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map) void smp_rendezvous_action(void) { - void* local_func_arg = smp_rv_func_arg; - void (*local_setup_func)(void*) = smp_rv_setup_func; - void (*local_action_func)(void*) = smp_rv_action_func; - void (*local_teardown_func)(void*) = smp_rv_teardown_func; + void *local_func_arg; + void (*local_setup_func)(void*); + void (*local_action_func)(void*); + void (*local_teardown_func)(void*); + int generation; /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); - /* setup function */ + /* Fetch rendezvous parameters after acquire barrier. */ + local_func_arg = smp_rv_func_arg; + local_setup_func = smp_rv_setup_func; + local_action_func = smp_rv_action_func; + local_teardown_func = smp_rv_teardown_func; + generation = smp_rv_generation; + + /* + * If requested, run a setup function before the main action + * function. Ensure all CPUs have completed the setup + * function before moving on to the action function. + */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); - - /* spin on entry rendezvous */ atomic_add_int(smp_rv_waiters[1], 1); while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); } - /* action function */ if (local_action_func != NULL) local_action_func(local_func_arg); - /* spin on exit rendezvous */ + /* + * Signal that the main action has been completed. If a + * full exit rendezvous is requested, then all CPUs will + * wait here until all CPUs have finished the main action. + * + * Note that the write by the last CPU to finish the action + * may become visible to different CPUs at different times. + * As a result, the CPU that initiated the rendezvous may + * exit the rendezvous and drop the lock allowing another + * rendezvous to be initiated on the same CPU or a different + * CPU. In that case the exit sentinel may be cleared before + * all CPUs have noticed causing those CPUs to hang forever. + * Workaround this by using a generation count to notice when + * this race occurs and to exit the rendezvous in that case. + */ MPASS(generation == smp_rv_generation); atomic_add_int(smp_rv_waiters[2], 1); if (local_teardown_func == smp_no_rendevous_barrier) return; - while (smp_rv_waiters[2] smp_rv_ncpus) + while (smp_rv_waiters[2] smp_rv_ncpus + generation == smp_rv_generation) cpu_spinwait(); - /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); } @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map, if (ncpus == 0) panic(ncpus is 0 with map=0x%x, map); - /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); - /* set static function pointers */ + atomic_add_acq_int(smp_rv_generation, 1); + + /* Pass rendezvous parameters via global variables. */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; smp_rv_action_func = action_func; @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_waiters[2] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); - /* signal other processors, which will enter the IPI with interrupts off */ + /* + * Signal other processors, which will enter the IPI with + * interrupts off. + */ ipi_selected(map ~(1 curcpu), IPI_RENDEZVOUS); /* Check if the current CPU is in the map */ if ((map (1 curcpu)) != 0)
Re: proposed smp_rendezvous change
On Monday, May 16, 2011 3:27:47 pm Andriy Gapon wrote: on 16/05/2011 21:21 John Baldwin said the following: How about this: ... /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map) void smp_rendezvous_action(void) { - void* local_func_arg = smp_rv_func_arg; - void (*local_setup_func)(void*) = smp_rv_setup_func; - void (*local_action_func)(void*) = smp_rv_action_func; - void (*local_teardown_func)(void*) = smp_rv_teardown_func; + void *local_func_arg; + void (*local_setup_func)(void*); + void (*local_action_func)(void*); + void (*local_teardown_func)(void*); + int generation; /* Ensure we have up-to-date values. */ atomic_add_acq_int(smp_rv_waiters[0], 1); while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); - /* setup function */ + /* Fetch rendezvous parameters after acquire barrier. */ + local_func_arg = smp_rv_func_arg; + local_setup_func = smp_rv_setup_func; + local_action_func = smp_rv_action_func; + local_teardown_func = smp_rv_teardown_func; I want to ask once again - please pretty please convince me that the above cpu_spinwait() loop is really needed and, by extension, that the assignments should be moved behind it. Please :) Well, moving the assignments down is a style fix for one, and we can always remove the first rendezvous as a follow up if desired. However, suppose you have an arch where sending an IPI is not a barrier (this seems unlikely). In that case, the atomic_add_acq_int() will not succeed (and return) until it has seen the earlier write by the CPU initiating the rendezvous to clear smp_rv_waiters[0] to zero. The actual spin on the smp_rv_waiters[] value is not strictly necessary however and is probably just cut and pasted to match the other uses of values in the smp_rv_waiters[] array. (atomic_add_acq_int() could spin on architectures where it is implemented using compare-and-swap (e.g. sparc64) or locked-load conditional-store (e.g. Alpha).) -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: On Monday 16 May 2011 14:21:27 John Baldwin wrote: Yes, we need to fix that. Humm, it doesn't preempt when you do a critical_exit() though? Or do you use a hand-rolled critical exit that doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... Ah, ok, so you would lose a preemption. That's not really ideal. Actually, I'm curious how the spin unlock inside the IPI could yield the CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I guess that is ok as long as the critical_exit() just defers the preemption to the end of the IPI handler. ... the earliest point where it is safe to preempt is after doing the atomic_add_int(smp_rv_waiters[2], 1); so that we can start other IPIs again. However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? Hmm, yeah, you would want to do the EOI before you yield. However, we could actually move the EOI up before calling the MI code so long as we leave interrupts disabled for the duration of the handler (which we do). The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt. I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. No, it is only set on curthread by curthread. This is actually my main question. I've no idea how this could happen unless the rmlock code is actually triggering a wakeup or sched_add() in its rendezvous handler. I don't see anything in rm_cleanIPI() that would do that however. I wonder if your original issue was really fixed just by the first patch you had which fixed the race in smp_rendezvous()? -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Monday 16 May 2011 16:46:03 John Baldwin wrote: On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: On Monday 16 May 2011 14:21:27 John Baldwin wrote: Yes, we need to fix that. Humm, it doesn't preempt when you do a critical_exit() though? Or do you use a hand-rolled critical exit that doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... Ah, ok, so you would lose a preemption. That's not really ideal. Actually, I'm curious how the spin unlock inside the IPI could yield the CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I guess that is ok as long as the critical_exit() just defers the preemption to the end of the IPI handler. ... the earliest point where it is safe to preempt is after doing the atomic_add_int(smp_rv_waiters[2], 1); so that we can start other IPIs again. However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? Hmm, yeah, you would want to do the EOI before you yield. However, we could actually move the EOI up before calling the MI code so long as we leave interrupts disabled for the duration of the handler (which we do). The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt. I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. No, it is only set on curthread by curthread. This is actually my main question. I've no idea how this could happen unless the rmlock code is actually triggering a wakeup or sched_add() in its rendezvous handler. I don't see anything in rm_cleanIPI() that would do that however. I wonder if your original issue was really fixed just by the first patch you had which fixed the race in smp_rendezvous()? I found the stack that lead me to this patch in the first place: #0 sched_switch (td=0xff011a97, newtd=0xff006e6784b0, flags=4) at src/sys/kern/sched_ule.c:1939 #1 0x80285c7f in mi_switch (flags=6, newtd=0x0) at src/sys/kern/kern_synch.c:475 #2 0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185 #3 0x80465807 in spinlock_exit () at src/sys/amd64/amd64/machdep.c:1458 #4 0x8027adea in rm_cleanIPI (arg=value optimized out) at src/sys/kern/kern_rmlock.c:180 #5 0x802b9887 in smp_rendezvous_action () at src/sys/kern/subr_smp.c:402 #6 0x8045e2a4 in Xrendezvous () at src/sys/amd64/amd64/apic_vector.S:235 #7 0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179 #8 0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0, item=0xff000f34cd40, udata=0xff000f34ce08) at src/sys/vm/uma_core.c:2370 . . . and now that I look at it again, it is clear that critical_exit() just isn't interrupt safe. I'm not sure how to fix that, yet ... but this: if (td-td_critnest == 1) { td-td_critnest = 0; if (td-td_owepreempt) { td-td_critnest = 1; clearly doesn't work. Max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Monday 16 May 2011 17:54:54 Attilio Rao wrote: 2011/5/16 Max Laier m...@love2party.net: On Monday 16 May 2011 16:46:03 John Baldwin wrote: On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: On Monday 16 May 2011 14:21:27 John Baldwin wrote: Yes, we need to fix that. Humm, it doesn't preempt when you do a critical_exit() though? Or do you use a hand-rolled critical exit that doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... Ah, ok, so you would lose a preemption. That's not really ideal. Actually, I'm curious how the spin unlock inside the IPI could yield the CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I guess that is ok as long as the critical_exit() just defers the preemption to the end of the IPI handler. ... the earliest point where it is safe to preempt is after doing the atomic_add_int(smp_rv_waiters[2], 1); so that we can start other IPIs again. However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? Hmm, yeah, you would want to do the EOI before you yield. However, we could actually move the EOI up before calling the MI code so long as we leave interrupts disabled for the duration of the handler (which we do). The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt. I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. No, it is only set on curthread by curthread. This is actually my main question. I've no idea how this could happen unless the rmlock code is actually triggering a wakeup or sched_add() in its rendezvous handler. I don't see anything in rm_cleanIPI() that would do that however. I wonder if your original issue was really fixed just by the first patch you had which fixed the race in smp_rendezvous()? I found the stack that lead me to this patch in the first place: #0 sched_switch (td=0xff011a97, newtd=0xff006e6784b0, flags=4) at src/sys/kern/sched_ule.c:1939 #1 0x80285c7f in mi_switch (flags=6, newtd=0x0) at src/sys/kern/kern_synch.c:475 #2 0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185 #3 0x80465807 in spinlock_exit () at src/sys/amd64/amd64/machdep.c:1458 #4 0x8027adea in rm_cleanIPI (arg=value optimized out) at src/sys/kern/kern_rmlock.c:180 #5 0x802b9887 in smp_rendezvous_action () at src/sys/kern/subr_smp.c:402 #6 0x8045e2a4 in Xrendezvous () at src/sys/amd64/amd64/apic_vector.S:235 #7 0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179 #8 0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0, item=0xff000f34cd40, udata=0xff000f34ce08) at src/sys/vm/uma_core.c:2370 . . . and now that I look at it again, it is clear that critical_exit() just isn't interrupt safe. I'm not sure how to fix that, yet ... but this: if (td-td_critnest == 1) { td-td_critnest = 0; if (td-td_owepreempt) { td-td_critnest = 1; clearly doesn't work. I'm sorry if I didn't reply to the whole rendezvous thread, but as long as there is so many people taking care of it, I'll stay hidden in my corner. I just wanted to tell that I think you are misunderstanding what critical section is supposed to do. When an interrupt fires, it goes on the old interrupt/kernel context which means it has not a context of his own. That is the reason why we disable interrupts on spinlocks (or similar workaround for !x86 architectures) and this is why spinlocks are the only protection usable in code that runs in interrupt context. Preempting just means another thread will be scheduler in the middle of another thread execution path. This code is perfectly fine if you consider curthread won't be descheduled while it is executing. Well, no - it is not. With this you can end up with a curthread that has td_critnest=0 and td_owepreempt=1 in interrupt context. If you use a spinlock on such a thread, it will do the preemption at the point where you drop the spinlock, this is bad in some circumstances. One example is the smp_rendevous case we are discussing here. Max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
2011/5/17 Max Laier m...@love2party.net: On Monday 16 May 2011 17:54:54 Attilio Rao wrote: 2011/5/16 Max Laier m...@love2party.net: On Monday 16 May 2011 16:46:03 John Baldwin wrote: On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: On Monday 16 May 2011 14:21:27 John Baldwin wrote: Yes, we need to fix that.  Humm, it doesn't preempt when you do a critical_exit() though?  Or do you use a hand-rolled critical exit that doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... Ah, ok, so you would lose a preemption.  That's not really ideal. Actually, I'm curious how the spin unlock inside the IPI could yield the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is ok as long as the critical_exit() just defers the preemption to the end of the IPI handler. ... the earliest point where it is safe to preempt is after doing the   atomic_add_int(smp_rv_waiters[2], 1); so that we can start other IPIs again.  However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? Hmm, yeah, you would want to do the EOI before you yield.  However, we could actually move the EOI up before calling the MI code so long as we leave interrupts disabled for the duration of the handler (which we do). The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt.  I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. No, it is only set on curthread by curthread.  This is actually my main question.  I've no idea how this could happen unless the rmlock code is actually triggering a wakeup or sched_add() in its rendezvous handler. I don't see anything in rm_cleanIPI() that would do that however. I wonder if your original issue was really fixed  just by the first patch you had which fixed the race in smp_rendezvous()? I found the stack that lead me to this patch in the first place: #0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0, flags=4) at src/sys/kern/sched_ule.c:1939 #1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at src/sys/kern/kern_synch.c:475 #2  0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185 #3  0x80465807 in spinlock_exit () at src/sys/amd64/amd64/machdep.c:1458 #4  0x8027adea in rm_cleanIPI (arg=value optimized out) at src/sys/kern/kern_rmlock.c:180 #5  0x802b9887 in smp_rendezvous_action () at src/sys/kern/subr_smp.c:402 #6  0x8045e2a4 in Xrendezvous () at src/sys/amd64/amd64/apic_vector.S:235 #7  0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179 #8  0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0, item=0xff000f34cd40, udata=0xff000f34ce08) at src/sys/vm/uma_core.c:2370 . . . and now that I look at it again, it is clear that critical_exit() just isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:     if (td-td_critnest == 1) {         td-td_critnest = 0;         if (td-td_owepreempt) {             td-td_critnest = 1; clearly doesn't work. I'm sorry if I didn't reply to the whole rendezvous thread, but as long as there is so many people taking care of it, I'll stay hidden in my corner. I just wanted to tell that I think you are misunderstanding what critical section is supposed to do. When an interrupt fires, it goes on the old interrupt/kernel context which means it has not a context of his own. That is the reason why we disable interrupts on spinlocks (or similar workaround for !x86 architectures) and this is why spinlocks are the only protection usable in code that runs in interrupt context. Preempting just means another thread will be scheduler in the middle of another thread execution path. This code is perfectly fine if you consider curthread won't be descheduled while it is executing. Well, no - it is not.  With this you can end up with a curthread that has td_critnest=0 and td_owepreempt=1 in interrupt context.  If you use a spinlock on such a thread, it will do the preemption at the point where you drop the spinlock, this is bad in some circumstances.  One example is the smp_rendevous case we are discussing here. This circumstances are further protected by another call to critical_enter(), by consumers or however upper layer calls. This is why, for example, spinlock_enter() does call critical_enter() even if it actually disables interrupts or why we disable preemption in other cases where the interrupts are already disabled. Attilio -- Peace can only be achieved by understanding - A. Einstein
Re: proposed smp_rendezvous change
2011/5/16 Max Laier m...@love2party.net: On Monday 16 May 2011 16:46:03 John Baldwin wrote: On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: On Monday 16 May 2011 14:21:27 John Baldwin wrote: Yes, we need to fix that.  Humm, it doesn't preempt when you do a critical_exit() though?  Or do you use a hand-rolled critical exit that doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... Ah, ok, so you would lose a preemption.  That's not really ideal. Actually, I'm curious how the spin unlock inside the IPI could yield the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is ok as long as the critical_exit() just defers the preemption to the end of the IPI handler. ... the earliest point where it is safe to preempt is after doing the   atomic_add_int(smp_rv_waiters[2], 1); so that we can start other IPIs again.  However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? Hmm, yeah, you would want to do the EOI before you yield.  However, we could actually move the EOI up before calling the MI code so long as we leave interrupts disabled for the duration of the handler (which we do). The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt.  I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. No, it is only set on curthread by curthread.  This is actually my main question.  I've no idea how this could happen unless the rmlock code is actually triggering a wakeup or sched_add() in its rendezvous handler. I don't see anything in rm_cleanIPI() that would do that however. I wonder if your original issue was really fixed  just by the first patch you had which fixed the race in smp_rendezvous()? I found the stack that lead me to this patch in the first place: #0  sched_switch (td=0xff011a97, newtd=0xff006e6784b0, flags=4) at src/sys/kern/sched_ule.c:1939 #1  0x80285c7f in mi_switch (flags=6, newtd=0x0) at src/sys/kern/kern_synch.c:475 #2  0x802a2cb3 in critical_exit () at src/sys/kern/kern_switch.c:185 #3  0x80465807 in spinlock_exit () at src/sys/amd64/amd64/machdep.c:1458 #4  0x8027adea in rm_cleanIPI (arg=value optimized out) at src/sys/kern/kern_rmlock.c:180 #5  0x802b9887 in smp_rendezvous_action () at src/sys/kern/subr_smp.c:402 #6  0x8045e2a4 in Xrendezvous () at src/sys/amd64/amd64/apic_vector.S:235 #7  0x802a2c6e in critical_exit () at src/sys/kern/kern_switch.c:179 #8  0x804365ba in uma_zfree_arg (zone=0xff009ff4b5a0, item=0xff000f34cd40, udata=0xff000f34ce08) at src/sys/vm/uma_core.c:2370 . . . and now that I look at it again, it is clear that critical_exit() just isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:     if (td-td_critnest == 1) {         td-td_critnest = 0;         if (td-td_owepreempt) {             td-td_critnest = 1; clearly doesn't work. I'm sorry if I didn't reply to the whole rendezvous thread, but as long as there is so many people taking care of it, I'll stay hidden in my corner. I just wanted to tell that I think you are misunderstanding what critical section is supposed to do. When an interrupt fires, it goes on the old interrupt/kernel context which means it has not a context of his own. That is the reason why we disable interrupts on spinlocks (or similar workaround for !x86 architectures) and this is why spinlocks are the only protection usable in code that runs in interrupt context. Preempting just means another thread will be scheduler in the middle of another thread execution path. This code is perfectly fine if you consider curthread won't be descheduled while it is executing. Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 14/05/2011 18:25 John Baldwin said the following: On 5/13/11 9:43 AM, Andriy Gapon wrote: This is a change in vein of what I've been doing in the xcpu branch and it's supposed to fix the issue by the recent commit that (probably unintentionally) stress-tests smp_rendezvous in TSC code. Non-essential changes: - ditch initial, and in my opinion useless, pre-setup rendezvous in smp_rendezvous_action() As long as IPIs ensure all data is up to date (I think this is certainly true on x86) that is fine. Presumably sending an IPI has an implicit store barrier on all other platforms as well? Well, one certainly can use IPIs as memory barrier, but my point was that we have other ways to have a memory barrier and using IPI for that was not necessary (and a little bit harmful to performance) in this case. Essential changes (the fix): - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully done with rendezvous (i.e. it's not going to access any members of smp_rv_* pseudo-structure) - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use the smp_rv_* pseudo-structure too early Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. Just a quick question - have you noticed that because of the change above the smp_rv_waiters[2] of which I spoke was not the same smp_rv_waiters[2] as in the original cod? Because I removed smp_rv_waiters[0], smp_rv_waiters[2] is actually some new smp_rv_waiters[3]. And well, I think I described exactly the same scenario as you did in my email on the svn mailing list. So of course I had it in mind: http://www.mail-archive.com/svn-src-all@freebsd.org/msg38637.html My problem, I should have not mixed different changes into the same patch, for clarity. I should have provided two patches: one that adds smp_rv_waiters[3] and its handling and one that removes smp_rv_waiters[0]. I would to see my proposed patch actually tested, if possible, before it's dismissed :-) -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 14/05/2011 18:25 John Baldwin said the following: Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. As a follow up to my previous question. Have you noticed that in my patch no slave CPU actually waits/spins on smp_rv_waiters[2]? It's always only master CPU (and under smp_ipi_mtx). -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 15/05/2011 07:33 Max Laier said the following: On Saturday 14 May 2011 11:25:36 John Baldwin wrote: On 5/13/11 9:43 AM, Andriy Gapon wrote: This is a change in vein of what I've been doing in the xcpu branch and it's supposed to fix the issue by the recent commit that (probably unintentionally) stress-tests smp_rendezvous in TSC code. Non-essential changes: - ditch initial, and in my opinion useless, pre-setup rendezvous in smp_rendezvous_action() As long as IPIs ensure all data is up to date (I think this is certainly true on x86) that is fine. Presumably sending an IPI has an implicit store barrier on all other platforms as well? Note that if the barrier is required on any platform, then we have to move all the initializations after the _acq, otherwise it is pointless indeed. Yeah, I still would like to get an explanation why that synchronization is needed. I can understand better safe than sorry mentality, I often stick to it myself. But for the benefit of improving objective knowledge it would be useful to understand why that spin-waiting could it useful, especially given your observation about where the assignment are made. Essential changes (the fix): - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully done with rendezvous (i.e. it's not going to access any members of smp_rv_* pseudo-structure) - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use the smp_rv_* pseudo-structure too early Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. A more concrete example: Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator. All three CPUs run the rendezvous down to the point of increasing smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason so he notices smp_rv_waiters[2] being correct first and exits the rendezvous. It immediately does a new rendezvous. Even with your change, it will see that smp_rv_waiters[2] is correct and will proceed to clear it before starting the next rendezvous. Meanwhile one of the other CPUs never sees the final update by CPU 1 to smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU 1's two writes were coalesced, or in the case of the hypervisor, CPU 2 or 3's backing thread is preempted and both writes have posted before the thread backing CPU 2 or 3 gets to run again). At this point that CPU (call it CPU 2) will spin forever. When CPU 1 sends a second rendezvous IPI it will be held in the local APIC of CPU 2 because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump smp_rv_waiters[2] and the entire system will deadlock. NetApp's solution is to add a monotonically increasing generation count to the rendezvous data set, which is cached in the rendezvous handler and to exit the last synchronization point if either smp_rv_waiters[2] is high enough, or the generation count has changed. I think this would also handle the case the TSC changes have provoked. I'm not sure if this would be sufficient for the error case Max Laier has encountered. It seems to address the issue, albeit a bit difficult to understand why this is correct. I'd like to see some comments in the code instead of just the commit log, but otherwise ... please go ahead with this. I think it works approximately the same as what I suggested. Only my patch forces the next master CPU to wait until all slave CPUs are fully done with the previous rendezvous (and ware of that fact), while this patch provides a side-channel to tell late slave CPUs that their rendezvous was completed when there is a new rendezvous started by a new master CPU. NetApp's patch: Extra protection for consecutive smp_rendezvous() calls. We need this because interrupts are not really disabled when executing the smp_rendezvous_action() when running inside a virtual machine. In particular it is possible for the last cpu to increment smp_rv_waiters[2] so that the exit rendezvous condition is satisfied and then get interrupted by a hardware interrupt. When the execution of the interrupted vcpu continues it is possible that one of the other vcpus that did *not* get interrupted exited the old smp_rendezvous() and started a new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is again reset to 0. This would mean that the vcpu that got interrupted will spin forever on the exit rendezvous. We protect this by spinning on
Re: proposed smp_rendezvous change
on 15/05/2011 10:12 Andriy Gapon said the following: on 14/05/2011 18:25 John Baldwin said the following: Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. As a follow up to my previous question. Have you noticed that in my patch no slave CPU actually waits/spins on smp_rv_waiters[2]? It's always only master CPU (and under smp_ipi_mtx). Here's a cleaner version of my approach to the fix. This one does not remove the initial wait on smp_rv_waiters[0] in smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members and thus hopefully should be clearer. Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221943) +++ sys/kern/subr_smp.c (working copy) @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar static void (*volatile smp_rv_action_func)(void *arg); static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; -static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_waiters[4]; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -338,11 +338,15 @@ smp_rendezvous_action(void) /* spin on exit rendezvous */ atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[3], 1); return; + } while (smp_rv_waiters[2] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[3], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map, /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + while (smp_rv_waiters[3] smp_rv_ncpus) + cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_func_arg = arg; smp_rv_waiters[1] = 0; smp_rv_waiters[2] = 0; + smp_rv_waiters[3] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); /* signal other processors, which will enter the IPI with interrupts off */ -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On 5/15/11 10:53 AM, Andriy Gapon wrote: on 15/05/2011 10:12 Andriy Gapon said the following: on 14/05/2011 18:25 John Baldwin said the following: Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. As a follow up to my previous question. Have you noticed that in my patch no slave CPU actually waits/spins on smp_rv_waiters[2]? It's always only master CPU (and under smp_ipi_mtx). Here's a cleaner version of my approach to the fix. This one does not remove the initial wait on smp_rv_waiters[0] in smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members and thus hopefully should be clearer. Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221943) +++ sys/kern/subr_smp.c (working copy) @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar static void (*volatile smp_rv_action_func)(void *arg); static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; -static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_waiters[4]; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -338,11 +338,15 @@ smp_rendezvous_action(void) /* spin on exit rendezvous */ atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[3], 1); return; + } while (smp_rv_waiters[2] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[3], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map, /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + while (smp_rv_waiters[3] smp_rv_ncpus) + cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_func_arg = arg; smp_rv_waiters[1] = 0; smp_rv_waiters[2] = 0; + smp_rv_waiters[3] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); /* signal other processors, which will enter the IPI with interrupts off */ Ahh, so the bump is after the change. I do think this will still be ok and I probably just didn't explain it well to Neel. I wonder though if the bump shouldn't happen until after the call of the local teardown function? -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Sunday 15 May 2011 11:16:03 John Baldwin wrote: On 5/15/11 10:53 AM, Andriy Gapon wrote: on 15/05/2011 10:12 Andriy Gapon said the following: on 14/05/2011 18:25 John Baldwin said the following: Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. As a follow up to my previous question. Have you noticed that in my patch no slave CPU actually waits/spins on smp_rv_waiters[2]? It's always only master CPU (and under smp_ipi_mtx). Here's a cleaner version of my approach to the fix. This one does not remove the initial wait on smp_rv_waiters[0] in smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members and thus hopefully should be clearer. Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221943) +++ sys/kern/subr_smp.c (working copy) @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar static void (*volatile smp_rv_action_func)(void *arg); static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; -static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_waiters[4]; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -338,11 +338,15 @@ smp_rendezvous_action(void) /* spin on exit rendezvous */ atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[3], 1); return; + } while (smp_rv_waiters[2] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[3], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map, /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + while (smp_rv_waiters[3] smp_rv_ncpus) + cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_func_arg = arg; smp_rv_waiters[1] = 0; smp_rv_waiters[2] = 0; + smp_rv_waiters[3] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); /* signal other processors, which will enter the IPI with interrupts off */ Ahh, so the bump is after the change. I do think this will still be ok and I probably just didn't explain it well to Neel. I wonder though if the bump shouldn't happen until after the call of the local teardown function? I don't think we ever intended to synchronize the local teardown part, and I believe that is the correct behavior for this API. This version is sufficiently close to what I have, so I am resonably sure that it will work for us. It seems, however, that if we move to check to after picking up the lock anyway, the generation approach has even less impact and I am starting to prefer that solution. Andriy, is there any reason why you'd prefer your approach over the generation version? Thanks, Max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 15/05/2011 19:09 Max Laier said the following: I don't think we ever intended to synchronize the local teardown part, and I believe that is the correct behavior for this API. This version is sufficiently close to what I have, so I am resonably sure that it will work for us. It seems, however, that if we move to check to after picking up the lock anyway, the generation approach has even less impact and I am starting to prefer that solution. Andriy, is there any reason why you'd prefer your approach over the generation version? No reason. And I even haven't said that I prefer it :-) I just wanted to show and explain it as apparently there was some misunderstanding about it. I think that generation count approach could even have a little bit better performance while perhaps being a tiny bit less obvious. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 15/05/2011 18:16 John Baldwin said the following: On 5/15/11 10:53 AM, Andriy Gapon wrote: on 15/05/2011 10:12 Andriy Gapon said the following: on 14/05/2011 18:25 John Baldwin said the following: Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. As a follow up to my previous question. Have you noticed that in my patch no slave CPU actually waits/spins on smp_rv_waiters[2]? It's always only master CPU (and under smp_ipi_mtx). Here's a cleaner version of my approach to the fix. This one does not remove the initial wait on smp_rv_waiters[0] in smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members and thus hopefully should be clearer. Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c(revision 221943) +++ sys/kern/subr_smp.c(working copy) @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar static void (*volatile smp_rv_action_func)(void *arg); static void (*volatile smp_rv_teardown_func)(void *arg); static void *volatile smp_rv_func_arg; -static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_waiters[4]; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -338,11 +338,15 @@ smp_rendezvous_action(void) /* spin on exit rendezvous */ atomic_add_int(smp_rv_waiters[2], 1); -if (local_teardown_func == smp_no_rendevous_barrier) +if (local_teardown_func == smp_no_rendevous_barrier) { +atomic_add_int(smp_rv_waiters[3], 1); return; +} while (smp_rv_waiters[2] smp_rv_ncpus) cpu_spinwait(); +atomic_add_int(smp_rv_waiters[3], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map, /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); +while (smp_rv_waiters[3] smp_rv_ncpus) +cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map, smp_rv_func_arg = arg; smp_rv_waiters[1] = 0; smp_rv_waiters[2] = 0; +smp_rv_waiters[3] = 0; atomic_store_rel_int(smp_rv_waiters[0], 0); /* signal other processors, which will enter the IPI with interrupts off */ Ahh, so the bump is after the change. I do think this will still be ok and I probably just didn't explain it well to Neel. I wonder though if the bump shouldn't happen until after the call of the local teardown function? That should not be necessary according to my understanding unless we have a reason to care about potentially disappearing function text. Teardown function pointer and argument are locally cached and I don't see any problems with a new rendezvous being setup and run while teardowns of the previous one are still running. On the other hand, you might be onto something here. If the argument is allocated in temporary storage (e.g. on stack) and the teardown function actually uses the argument, then it's not safe to deallocate the argument until all of the teardowns are done. In this case the master CPU should wait not only until all actions are done, but also until all teardowns are done. The current code doesn't do that. And I am not sure if we actually would want to add this safety net or just document the behavior and put a burden on smp_rendezvous API users. BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e. with non-trivial setup, action and teardown (except perhaps for jkim's stress test). OpenSolaris seems to be doing just fine with the following three types of simpler CPU cross-calls (as they call them): 1. async - a master CPU requests an action to be executed on target CPUs but doesn't wait for anything - we have no explicit equivalent for this 2. call - a master CPU requests an action to be executed on target CPUs and waits until all the target CPUs have finished executing it - this would be equivalent smp_rendezvous_cpus(no_barrier, action, no_barrier) 3. synch - a master CPU requests an action to be executed on target CPUs and waits until all the target CPUs have finished executing it and the slave CPUs are also waiting for their siblings to be done - this would be equivalent smp_rendezvous_cpus(no_barrier, action, NULL) Seems that our current code doesn't use/need more than that as well. But I really like the capability that
Re: proposed smp_rendezvous change
On 5/13/11 9:43 AM, Andriy Gapon wrote: This is a change in vein of what I've been doing in the xcpu branch and it's supposed to fix the issue by the recent commit that (probably unintentionally) stress-tests smp_rendezvous in TSC code. Non-essential changes: - ditch initial, and in my opinion useless, pre-setup rendezvous in smp_rendezvous_action() As long as IPIs ensure all data is up to date (I think this is certainly true on x86) that is fine. Presumably sending an IPI has an implicit store barrier on all other platforms as well? Essential changes (the fix): - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully done with rendezvous (i.e. it's not going to access any members of smp_rv_* pseudo-structure) - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use the smp_rv_* pseudo-structure too early Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. A more concrete example: Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator. All three CPUs run the rendezvous down to the point of increasing smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason so he notices smp_rv_waiters[2] being correct first and exits the rendezvous. It immediately does a new rendezvous. Even with your change, it will see that smp_rv_waiters[2] is correct and will proceed to clear it before starting the next rendezvous. Meanwhile one of the other CPUs never sees the final update by CPU 1 to smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU 1's two writes were coalesced, or in the case of the hypervisor, CPU 2 or 3's backing thread is preempted and both writes have posted before the thread backing CPU 2 or 3 gets to run again). At this point that CPU (call it CPU 2) will spin forever. When CPU 1 sends a second rendezvous IPI it will be held in the local APIC of CPU 2 because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump smp_rv_waiters[2] and the entire system will deadlock. NetApp's solution is to add a monotonically increasing generation count to the rendezvous data set, which is cached in the rendezvous handler and to exit the last synchronization point if either smp_rv_waiters[2] is high enough, or the generation count has changed. I think this would also handle the case the TSC changes have provoked. I'm not sure if this would be sufficient for the error case Max Laier has encountered. NetApp's patch: Extra protection for consecutive smp_rendezvous() calls. We need this because interrupts are not really disabled when executing the smp_rendezvous_action() when running inside a virtual machine. In particular it is possible for the last cpu to increment smp_rv_waiters[2] so that the exit rendezvous condition is satisfied and then get interrupted by a hardware interrupt. When the execution of the interrupted vcpu continues it is possible that one of the other vcpus that did *not* get interrupted exited the old smp_rendezvous() and started a new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is again reset to 0. This would mean that the vcpu that got interrupted will spin forever on the exit rendezvous. We protect this by spinning on the exit rendezvous only if the generation of the smp_rendezvous() matches what we started out with before incrementing 'smp_rv_waiters[2]'. Differences ... //private//sys/kern/subr_smp.c#3 (text) @@ -127,6 +127,7 @@ static void (*volatile smp_rv_teardown_func)(void *arg); static void * volatile smp_rv_func_arg; static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_generation; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -418,6 +419,7 @@ void smp_rendezvous_action(void) { +int generation; void* local_func_arg = smp_rv_func_arg; void (*local_setup_func)(void*) = smp_rv_setup_func; void (*local_action_func)(void*) = smp_rv_action_func; @@ -457,11 +459,14 @@ if (local_action_func != NULL) local_action_func(local_func_arg); +generation = atomic_load_acq_int(smp_rv_generation); + /* spin on exit rendezvous */ atomic_add_int(smp_rv_waiters[2], 1); if (local_teardown_func == smp_no_rendevous_barrier) return; -while (smp_rv_waiters[2] smp_rv_ncpus) +while (smp_rv_waiters[2] smp_rv_ncpus + generation == smp_rv_generation) cpu_spinwait();
Re: proposed smp_rendezvous change
On Saturday 14 May 2011 11:25:36 John Baldwin wrote: On 5/13/11 9:43 AM, Andriy Gapon wrote: This is a change in vein of what I've been doing in the xcpu branch and it's supposed to fix the issue by the recent commit that (probably unintentionally) stress-tests smp_rendezvous in TSC code. Non-essential changes: - ditch initial, and in my opinion useless, pre-setup rendezvous in smp_rendezvous_action() As long as IPIs ensure all data is up to date (I think this is certainly true on x86) that is fine. Presumably sending an IPI has an implicit store barrier on all other platforms as well? Note that if the barrier is required on any platform, then we have to move all the initializations after the _acq, otherwise it is pointless indeed. Essential changes (the fix): - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully done with rendezvous (i.e. it's not going to access any members of smp_rv_* pseudo-structure) - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use the smp_rv_* pseudo-structure too early Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. A more concrete example: Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator. All three CPUs run the rendezvous down to the point of increasing smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason so he notices smp_rv_waiters[2] being correct first and exits the rendezvous. It immediately does a new rendezvous. Even with your change, it will see that smp_rv_waiters[2] is correct and will proceed to clear it before starting the next rendezvous. Meanwhile one of the other CPUs never sees the final update by CPU 1 to smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU 1's two writes were coalesced, or in the case of the hypervisor, CPU 2 or 3's backing thread is preempted and both writes have posted before the thread backing CPU 2 or 3 gets to run again). At this point that CPU (call it CPU 2) will spin forever. When CPU 1 sends a second rendezvous IPI it will be held in the local APIC of CPU 2 because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump smp_rv_waiters[2] and the entire system will deadlock. NetApp's solution is to add a monotonically increasing generation count to the rendezvous data set, which is cached in the rendezvous handler and to exit the last synchronization point if either smp_rv_waiters[2] is high enough, or the generation count has changed. I think this would also handle the case the TSC changes have provoked. I'm not sure if this would be sufficient for the error case Max Laier has encountered. It seems to address the issue, albeit a bit difficult to understand why this is correct. I'd like to see some comments in the code instead of just the commit log, but otherwise ... please go ahead with this. I'll open a new thread with the second issue we have been seeing wrt rmlocks, which is related, but different. NetApp's patch: Extra protection for consecutive smp_rendezvous() calls. We need this because interrupts are not really disabled when executing the smp_rendezvous_action() when running inside a virtual machine. In particular it is possible for the last cpu to increment smp_rv_waiters[2] so that the exit rendezvous condition is satisfied and then get interrupted by a hardware interrupt. When the execution of the interrupted vcpu continues it is possible that one of the other vcpus that did *not* get interrupted exited the old smp_rendezvous() and started a new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is again reset to 0. This would mean that the vcpu that got interrupted will spin forever on the exit rendezvous. We protect this by spinning on the exit rendezvous only if the generation of the smp_rendezvous() matches what we started out with before incrementing 'smp_rv_waiters[2]'. Differences ... //private//sys/kern/subr_smp.c#3 (text) @@ -127,6 +127,7 @@ static void (*volatile smp_rv_teardown_func)(void *arg); static void * volatile smp_rv_func_arg; static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_generation; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -418,6 +419,7 @@ void smp_rendezvous_action(void) { +int generation; void*
Re: proposed smp_rendezvous change
On Friday 13 May 2011 09:43:25 Andriy Gapon wrote: This is a change in vein of what I've been doing in the xcpu branch and it's supposed to fix the issue by the recent commit that (probably unintentionally) stress-tests smp_rendezvous in TSC code. Non-essential changes: - ditch initial, and in my opinion useless, pre-setup rendezvous in smp_rendezvous_action() - shift smp_rv_waiters indexes by one because of the above Essential changes (the fix): - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully done with rendezvous (i.e. it's not going to access any members of smp_rv_* pseudo-structure) - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use the smp_rv_* pseudo-structure too early Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; - /* Ensure we have up-to-date values. */ - atomic_add_acq_int(smp_rv_waiters[0], 1); - while (smp_rv_waiters[0] smp_rv_ncpus) - cpu_spinwait(); - /* setup function */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); /* spin on entry rendezvous */ - atomic_add_int(smp_rv_waiters[1], 1); - while (smp_rv_waiters[1] smp_rv_ncpus) + atomic_add_int(smp_rv_waiters[0], 1); + while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); } @@ -337,12 +332,16 @@ local_action_func(local_func_arg); /* spin on exit rendezvous */ - atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + atomic_add_int(smp_rv_waiters[1], 1); + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[2], 1); return; - while (smp_rv_waiters[2] smp_rv_ncpus) + } + while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[2], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +376,10 @@ /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + /* Wait for any previous unwaited rendezvous to finish. */ + while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) this ncpus isn't the one you are looking for. I have a patch of my own that is attached. This might be overdoing it to some extend, but it has been running very well on our system for quite some time, which now uses rmlock heavily. The rmlock diff is unrelated, but since I have this diff around for some time now ... + cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -395,7 +398,7 @@ smp_rendezvous_action(); if (teardown_func == smp_no_rendevous_barrier) - while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + while (atomic_load_acq_int(smp_rv_waiters[1]) ncpus) cpu_spinwait(); /* release lock */ diff --git a/src/sys/kern/kern_rmlock.c b/src/sys/kern/kern_rmlock.c index f91e61a..a30c16b 100644 --- a/src/sys/kern/kern_rmlock.c +++ b/src/sys/kern/kern_rmlock.c @@ -154,6 +154,7 @@ rm_tracker_remove(struct pcpu *pc, struct rm_priotracker *tracker) static void rm_cleanIPI(void *arg) { + TAILQ_HEAD(,rm_priotracker) tmp_list = TAILQ_HEAD_INITIALIZER(tmp_list); struct pcpu *pc; struct rmlock *rm = arg; struct rm_priotracker *tracker; @@ -165,12 +166,12 @@ rm_cleanIPI(void *arg) tracker = (struct rm_priotracker *)queue; if (tracker-rmp_rmlock == rm tracker-rmp_flags == 0) { tracker-rmp_flags = RMPF_ONQUEUE; - mtx_lock_spin(rm_spinlock); - LIST_INSERT_HEAD(rm-rm_activeReaders, tracker, - rmp_qentry); - mtx_unlock_spin(rm_spinlock); + TAILQ_INSERT_HEAD(tmp_list, tracker, rmp_qentry); } } + mtx_lock_spin(rm_spinlock); + TAILQ_CONCAT(rm-rm_activeReaders, tmp_list, rmp_qentry); + mtx_unlock_spin(rm_spinlock); } CTASSERT((RM_SLEEPABLE LO_CLASSFLAGS) == RM_SLEEPABLE); @@ -186,7 +187,7 @@ rm_init_flags(struct rmlock *rm, const char *name, int opts) if (opts RM_RECURSE) liflags |= LO_RECURSABLE; rm-rm_writecpus = all_cpus; - LIST_INIT(rm-rm_activeReaders); + TAILQ_INIT(rm-rm_activeReaders); if (opts RM_SLEEPABLE) { liflags |= RM_SLEEPABLE; sx_init_flags(rm-rm_lock_sx, rmlock_sx, SX_RECURSE); @@ -281,7 +282,7 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker, int trylock) if
Re: proposed smp_rendezvous change
On Friday 13 May 2011 09:43:25 Andriy Gapon wrote: This is a change in vein of what I've been doing in the xcpu branch and it's supposed to fix the issue by the recent commit that (probably unintentionally) stress-tests smp_rendezvous in TSC code. Non-essential changes: - ditch initial, and in my opinion useless, pre-setup rendezvous in smp_rendezvous_action() - shift smp_rv_waiters indexes by one because of the above Essential changes (the fix): - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully done with rendezvous (i.e. it's not going to access any members of smp_rv_* pseudo-structure) - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use the smp_rv_* pseudo-structure too early Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; You really need a read/load with a memory fence here in order to make sure you get up to date values on all CPUs. - /* Ensure we have up-to-date values. */ - atomic_add_acq_int(smp_rv_waiters[0], 1); - while (smp_rv_waiters[0] smp_rv_ncpus) - cpu_spinwait(); - /* setup function */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); /* spin on entry rendezvous */ - atomic_add_int(smp_rv_waiters[1], 1); - while (smp_rv_waiters[1] smp_rv_ncpus) + atomic_add_int(smp_rv_waiters[0], 1); + while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); } @@ -337,12 +332,16 @@ local_action_func(local_func_arg); /* spin on exit rendezvous */ - atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + atomic_add_int(smp_rv_waiters[1], 1); + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[2], 1); return; - while (smp_rv_waiters[2] smp_rv_ncpus) + } + while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[2], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +376,10 @@ /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + /* Wait for any previous unwaited rendezvous to finish. */ + while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + cpu_spinwait(); + This is not the ncpus you are looking for. I have a patch of my own, that might be overdoing it ... but it attached nonetheless ... The rmlock is a separate issue, but what lead me to discover the problem with the rendevouz. /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -395,7 +398,7 @@ smp_rendezvous_action(); if (teardown_func == smp_no_rendevous_barrier) - while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + while (atomic_load_acq_int(smp_rv_waiters[1]) ncpus) cpu_spinwait(); /* release lock */ commit 48f16f01c28cc3006faefa024b0d3e4bcf02fb60 Author: ndire ndire@b72e2a10-2d34-0410-9a71-d3beadf02b57 Date: Tue Dec 14 18:49:55 2010 + Reintegrate BR_CHOPUVILLE. git-svn-id: https://svn.west.isilon.com/repo/onefs/head@169088 b72e2a10-2d34-0410-9a71-d3beadf02b57 diff --git a/src/sys/kern/kern_rmlock.c b/src/sys/kern/kern_rmlock.c index f91e61a..a30c16b 100644 --- a/src/sys/kern/kern_rmlock.c +++ b/src/sys/kern/kern_rmlock.c @@ -154,6 +154,7 @@ rm_tracker_remove(struct pcpu *pc, struct rm_priotracker *tracker) static void rm_cleanIPI(void *arg) { + TAILQ_HEAD(,rm_priotracker) tmp_list = TAILQ_HEAD_INITIALIZER(tmp_list); struct pcpu *pc; struct rmlock *rm = arg; struct rm_priotracker *tracker; @@ -165,12 +166,12 @@ rm_cleanIPI(void *arg) tracker = (struct rm_priotracker *)queue; if (tracker-rmp_rmlock == rm tracker-rmp_flags == 0) { tracker-rmp_flags = RMPF_ONQUEUE; - mtx_lock_spin(rm_spinlock); - LIST_INSERT_HEAD(rm-rm_activeReaders, tracker, - rmp_qentry); - mtx_unlock_spin(rm_spinlock); + TAILQ_INSERT_HEAD(tmp_list, tracker, rmp_qentry); } } + mtx_lock_spin(rm_spinlock); + TAILQ_CONCAT(rm-rm_activeReaders, tmp_list, rmp_qentry); + mtx_unlock_spin(rm_spinlock); } CTASSERT((RM_SLEEPABLE LO_CLASSFLAGS) == RM_SLEEPABLE); @@ -186,7 +187,7 @@ rm_init_flags(struct rmlock *rm, const char *name, int opts) if (opts RM_RECURSE) liflags |= LO_RECURSABLE;
Re: proposed smp_rendezvous change
on 13/05/2011 17:41 Max Laier said the following: this ncpus isn't the one you are looking for. Thank you! Here's an updated patch: Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; - /* Ensure we have up-to-date values. */ - atomic_add_acq_int(smp_rv_waiters[0], 1); - while (smp_rv_waiters[0] smp_rv_ncpus) - cpu_spinwait(); - /* setup function */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); /* spin on entry rendezvous */ - atomic_add_int(smp_rv_waiters[1], 1); - while (smp_rv_waiters[1] smp_rv_ncpus) + atomic_add_int(smp_rv_waiters[0], 1); + while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); } @@ -337,12 +332,16 @@ local_action_func(local_func_arg); /* spin on exit rendezvous */ - atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + atomic_add_int(smp_rv_waiters[1], 1); + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[2], 1); return; - while (smp_rv_waiters[2] smp_rv_ncpus) + } + while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[2], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +376,10 @@ /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + /* Wait for any previous unwaited rendezvous to finish. */ + while (atomic_load_acq_int(smp_rv_waiters[2]) smp_rv_ncpus) + cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -395,7 +398,7 @@ smp_rendezvous_action(); if (teardown_func == smp_no_rendevous_barrier) - while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + while (atomic_load_acq_int(smp_rv_waiters[1]) ncpus) cpu_spinwait(); /* release lock */ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Friday 13 May 2011 11:28:33 Andriy Gapon wrote: on 13/05/2011 17:41 Max Laier said the following: this ncpus isn't the one you are looking for. Thank you! Here's an updated patch: Can you attach the patch, so I can apply it locally. This code is really hard to read without context. Some more comments inline ... Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; - /* Ensure we have up-to-date values. */ - atomic_add_acq_int(smp_rv_waiters[0], 1); - while (smp_rv_waiters[0] smp_rv_ncpus) - cpu_spinwait(); - You really need this for architectures that need the memory barrier to ensure consistency. We also need to move the reads of smp_rv_* below this point to provide a consistent view. /* setup function */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); /* spin on entry rendezvous */ - atomic_add_int(smp_rv_waiters[1], 1); - while (smp_rv_waiters[1] smp_rv_ncpus) + atomic_add_int(smp_rv_waiters[0], 1); + while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); } @@ -337,12 +332,16 @@ local_action_func(local_func_arg); /* spin on exit rendezvous */ - atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + atomic_add_int(smp_rv_waiters[1], 1); + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[2], 1); return; - while (smp_rv_waiters[2] smp_rv_ncpus) + } + while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[2], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +376,10 @@ /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + /* Wait for any previous unwaited rendezvous to finish. */ + while (atomic_load_acq_int(smp_rv_waiters[2]) smp_rv_ncpus) + cpu_spinwait(); + This does not help you at all. Imagine the following (unlikely, but not impossible) case: CPUA: start rendevouz including self, finish the action first (i.e. CPUA is the first one to see smp_rv_waiters[2] == smp_rv_ncpus, drop the lock and start a new rendevouz. smp_rv_waiters[2] == smp_rv_ncpus is still true on that CPU, but ... CPUB might have increased smp_rv_waiters[2] for the first rendevouz, but never saw smp_rv_waiters[2] == smp_rv_ncpus, still ... CPUA is allowed to start a new rendevouz which will leave CPUB stranded and can lead to a deadlock. I think this is also possible with another CPU starting the second rendevous. /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -395,7 +398,7 @@ smp_rendezvous_action(); if (teardown_func == smp_no_rendevous_barrier) - while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + while (atomic_load_acq_int(smp_rv_waiters[1]) ncpus) cpu_spinwait(); /* release lock */ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
On Friday 13 May 2011 11:50:57 Max Laier wrote: On Friday 13 May 2011 11:28:33 Andriy Gapon wrote: on 13/05/2011 17:41 Max Laier said the following: this ncpus isn't the one you are looking for. Thank you! Here's an updated patch: Can you attach the patch, so I can apply it locally. This code is really hard to read without context. Some more comments inline ... Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; - /* Ensure we have up-to-date values. */ - atomic_add_acq_int(smp_rv_waiters[0], 1); - while (smp_rv_waiters[0] smp_rv_ncpus) - cpu_spinwait(); - You really need this for architectures that need the memory barrier to ensure consistency. We also need to move the reads of smp_rv_* below this point to provide a consistent view. /* setup function */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); /* spin on entry rendezvous */ - atomic_add_int(smp_rv_waiters[1], 1); - while (smp_rv_waiters[1] smp_rv_ncpus) + atomic_add_int(smp_rv_waiters[0], 1); + while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); } @@ -337,12 +332,16 @@ local_action_func(local_func_arg); /* spin on exit rendezvous */ - atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + atomic_add_int(smp_rv_waiters[1], 1); + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[2], 1); return; - while (smp_rv_waiters[2] smp_rv_ncpus) + } + while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[2], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +376,10 @@ /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + /* Wait for any previous unwaited rendezvous to finish. */ + while (atomic_load_acq_int(smp_rv_waiters[2]) smp_rv_ncpus) + cpu_spinwait(); + Disregard this ... I misread the diff. You are indeed using [2] correctly as the all-clear semaphore. I still believe, that it is safer/cleaner to do this spin before releasing the lock instead (see my patch). This does not help you at all. Imagine the following (unlikely, but not impossible) case: CPUA: start rendevouz including self, finish the action first (i.e. CPUA is the first one to see smp_rv_waiters[2] == smp_rv_ncpus, drop the lock and start a new rendevouz. smp_rv_waiters[2] == smp_rv_ncpus is still true on that CPU, but ... CPUB might have increased smp_rv_waiters[2] for the first rendevouz, but never saw smp_rv_waiters[2] == smp_rv_ncpus, still ... CPUA is allowed to start a new rendevouz which will leave CPUB stranded and can lead to a deadlock. I think this is also possible with another CPU starting the second rendevous. /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -395,7 +398,7 @@ smp_rendezvous_action(); if (teardown_func == smp_no_rendevous_barrier) - while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + while (atomic_load_acq_int(smp_rv_waiters[1]) ncpus) cpu_spinwait(); /* release lock */ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 13/05/2011 18:50 Max Laier said the following: On Friday 13 May 2011 11:28:33 Andriy Gapon wrote: on 13/05/2011 17:41 Max Laier said the following: this ncpus isn't the one you are looking for. Thank you! Here's an updated patch: Can you attach the patch, so I can apply it locally. This code is really hard to read without context. Some more comments inline ... Attached. Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; -/* Ensure we have up-to-date values. */ -atomic_add_acq_int(smp_rv_waiters[0], 1); -while (smp_rv_waiters[0] smp_rv_ncpus) -cpu_spinwait(); - You really need this for architectures that need the memory barrier to ensure consistency. We also need to move the reads of smp_rv_* below this point to provide a consistent view. I thought that this would be automatically handled by the fact that a master CPU sets smp_rv_waiters[0] using atomic operation with release semantics. But I am not very proficient in this matters... But I fail to see why we need to require that all CPUs should gather at this point/condition. That is, my point is that we don't start a new rendezvous until a previous one is completely finished. Then we set up the new rendezvous, finish the setup with an operation with release semantics and only then notify the target CPUs. I can't see how the slave CPUs would see stale values in the rendezvous pseudo-object, but, OTOH, I am not very familiar with architectures that have weaker memory consistency rules as compared to x86. -- Andriy Gapon Index: sys/kern/subr_smp.c === --- sys/kern/subr_smp.c (revision 221835) +++ sys/kern/subr_smp.c (working copy) @@ -316,19 +316,14 @@ void (*local_action_func)(void*) = smp_rv_action_func; void (*local_teardown_func)(void*) = smp_rv_teardown_func; - /* Ensure we have up-to-date values. */ - atomic_add_acq_int(smp_rv_waiters[0], 1); - while (smp_rv_waiters[0] smp_rv_ncpus) - cpu_spinwait(); - /* setup function */ if (local_setup_func != smp_no_rendevous_barrier) { if (smp_rv_setup_func != NULL) smp_rv_setup_func(smp_rv_func_arg); /* spin on entry rendezvous */ - atomic_add_int(smp_rv_waiters[1], 1); - while (smp_rv_waiters[1] smp_rv_ncpus) + atomic_add_int(smp_rv_waiters[0], 1); + while (smp_rv_waiters[0] smp_rv_ncpus) cpu_spinwait(); } @@ -337,12 +332,16 @@ local_action_func(local_func_arg); /* spin on exit rendezvous */ - atomic_add_int(smp_rv_waiters[2], 1); - if (local_teardown_func == smp_no_rendevous_barrier) + atomic_add_int(smp_rv_waiters[1], 1); + if (local_teardown_func == smp_no_rendevous_barrier) { + atomic_add_int(smp_rv_waiters[2], 1); return; - while (smp_rv_waiters[2] smp_rv_ncpus) + } + while (smp_rv_waiters[1] smp_rv_ncpus) cpu_spinwait(); + atomic_add_int(smp_rv_waiters[2], 1); + /* teardown function */ if (local_teardown_func != NULL) local_teardown_func(local_func_arg); @@ -377,6 +376,10 @@ /* obtain rendezvous lock */ mtx_lock_spin(smp_ipi_mtx); + /* Wait for any previous unwaited rendezvous to finish. */ + while (atomic_load_acq_int(smp_rv_waiters[2]) smp_rv_ncpus) + cpu_spinwait(); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; @@ -395,7 +398,7 @@ smp_rendezvous_action(); if (teardown_func == smp_no_rendevous_barrier) - while (atomic_load_acq_int(smp_rv_waiters[2]) ncpus) + while (atomic_load_acq_int(smp_rv_waiters[1]) ncpus) cpu_spinwait(); /* release lock */ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: proposed smp_rendezvous change
on 13/05/2011 20:13 Max Laier said the following: Disregard this ... I misread the diff. You are indeed using [2] correctly as the all-clear semaphore. I still believe, that it is safer/cleaner to do this spin before releasing the lock instead (see my patch). Maybe. I consider my approach a minor optimization - that is, I think that normally smp_rendezvous calls would be sparse enough to never require that synchronization (as proved by the past experience). So synchroning/delaying the master CPU at the end of smp_rendezvous would slightly hurt performance. Having the check at the start should trigger the synchronization only when it is really required. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org