Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, Aug 15, 2014 at 04:26:01PM +0200, Oleg Nesterov wrote: > On 08/15, Frederic Weisbecker wrote: > > > > 2014-08-14 16:39 GMT+02:00 Oleg Nesterov : > > > On 08/14, Frederic Weisbecker wrote: > > >> > > >> I mean the read side doesn't use a lock with seqlocks. It's only made > > >> of barriers and sequence numbers to ensure the reader doesn't read > > >> some half-complete update. But other than that it can as well see the > > >> update n - 1 since barriers don't enforce latest results. > > > > > > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after" > > > write_seqcount_begin-update-write_seqcount_begin can miss "update" part > > > along with ->sequence modifications. > > > > > > But I still can't understand how this can lead to non-monotonic results, > > > could you spell? > > > > Well lets say clock = T. > > CPU 0 updates at T + 1. > > Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1 > > while CPU 1 still reads T. > > If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's > > possible that CPU 2 still sees T. With the spinlocked version that > > thing can't happen, the second round would read at least T + 1 for > > both CPUs. > > But this is fine? And CPU 2 doesn't see a non-monotonic result? > > OK, this could be wrong if, say, > > void print_clock(void) > { > lock(SOME_LOCK); > printk(..., clock_gettime()); > unlock(SOME_LOCK); > } > > printed the non-monotonic numbers if print_clock() is called on CPU_1 and > then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if > they were already visible to CPU_1 under the same lock. IOW, > > int T = 0; /* can be incremented at any time */ > > void check_monotony(void) > { > static int t = 0; > > lock(SOME_LOCK); > BUG(t > T); > T = t; > unlock(SOME_LOCK); > } > > must work corrrectly (ignoring overflow) even if T is changed without > SOME_LOCK. > > Otherwise, without some sort of synchronization the different results on > CPU_1/2 should be fine. > > Or I am still missing your point? No I think you're right, as long as ordering against something else is involved, monotonicity is enforced. Now I'm trying to think about a case where SMP ordering isn't involved. Perhaps some usecase based on coupling CPU local clocks and clock_gettime() where a drift between both can appear. Now using a local clock probably only makes sense in the context of local usecases where the thread clock update would be local as well. So that's probably not a problem. Now what if somebody couples multithread process wide clocks with per CPU local clocks. Well that's probably too foolish to be considered. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Rik van Riel wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 08/15/2014 12:49 PM, Oleg Nesterov wrote: > > > Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable. > > Just I do not know it's worth the trouble. > > If we don't know whether it is worth the trouble, it is probably best > to stick to a well-known generic locking algorithm, instead of brewing > our own and trying to maintain it. Perhaps. I am obviously biased and can't judge ;) Plus, again, I do understand that your approach has some advantages too. > Now to see if this change to cputime_adjust does the trick :) > > +++ b/kernel/sched/cputime.c > @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, >* If the tick based count grows faster than the scheduler one, >* the result of the scaling may go backward. >* Let's enforce monotonicity. > + * Atomic exchange protects against concurrent cputime_adjust. >*/ > - - prev->stime = max(prev->stime, stime); > - - prev->utime = max(prev->utime, utime); > + while (stime > (rtime = ACCESS_ONCE(prev->stime))) > + cmpxchg(>stime, rtime, stime); > + while (utime > (rtime = ACCESS_ONCE(prev->utime))) > + cmpxchg(>utime, rtime, utime); Yes, perhaps we need something like this in any case. To remind, at least do_task_stat() calls task_cputime_adjusted() lockless, although we could fix this separately. But I do not think the change above is enough. With this change cputime_adjust() can race with itself. Yes, this guarantees monotonicity even if it is called lockless, but this can lead to "obviously inconsistent" numbers. And I don't think we can ignore this. If we could, then we can remove the scale_stime recalculation and change cputime_adjust() to simply do: static void cputime_adjust(struct task_cputime *curr, struct cputime *prev, cputime_t *ut, cputime_t *st) { /* enforce monotonicity */ *ut = prev->stime = max(prev->stime, curr->stime); *st = prev->utime = max(prev->utime, curr->utime); } Yes, we have this problem either way. And personally I think that this "enforce monotonicity" logic is pointless, userspace could take care, but it is too late to complain. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/15/2014 12:49 PM, Oleg Nesterov wrote: > Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable. > Just I do not know it's worth the trouble. If we don't know whether it is worth the trouble, it is probably best to stick to a well-known generic locking algorithm, instead of brewing our own and trying to maintain it. I have fixed the other locking issue you pointed out, Oleg. Now to see if this change to cputime_adjust does the trick :) +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. +* Atomic exchange protects against concurrent cputime_adjust. */ - - prev->stime = max(prev->stime, stime); - - prev->utime = max(prev->utime, utime); + while (stime > (rtime = ACCESS_ONCE(prev->stime))) + cmpxchg(>stime, rtime, stime); + while (utime > (rtime = ACCESS_ONCE(prev->utime))) + cmpxchg(>utime, rtime, utime); -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJT7kJ4AAoJEM553pKExN6Do/oH/2lA5X/CrVuhOLBK1sVq3kRh gGiOTT9pDQZH1wwafVNHKWaro3T/s9GNqemgvgt4UiKbjFeYkaOycHp1cuntJj8j Wk8zNnWBOuGqqcSxzk1Duco3CByxshLNXxuYJfpdkdEXPqRyvURAOL58pxSybZzh E6lT747ntFJu3GIbfC6Ta3q58pWLpVrhWlvonhSaqat6tOvlzo4MKiJxz3SbT6i0 cCpmQ5p/JoQ5+IUEbTOZYbE2bK2y5tSrMggAFwKWLB3/0zJm1h4+2Q/5PenCX59X VDFmaOJLkNxGcVXg8x87itvqzfq/LkvDtwl9tTJmA5ccG37MPvM3803XF5OWVo0= =aKES -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Oleg Nesterov wrote: > > However, if we only want to make sys_times() more scalable(), then > perhaps the "lockless" version of thread_group_cputime() makes more > sense. And given that do_sys_times() uses current we can simplify it; > is_dead is not possible and we do not need to take ->siglock twice: > > void current_group_cputime(struct task_cputime *times) > { > struct task_struct *tsk = current, *t; > struct spinlock_t *siglock = >sighand->siglock; > struct signal_struct *sig = tsk->signal; > bool lockless = true; > u64 exec; > >retry: > spin_lock_irq(siglock); > times->utime = sig->utime; > times->stime = sig->stime; > times->sum_exec_runtime = exec = sig->sum_sched_runtime; > > if (lockless) > spin_unlock_irq(siglock); > > rcu_read_lock(); > for_each_thread(tsk, t) { > cputime_t utime, stime; > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > times->sum_exec_runtime += task_sched_runtime(t); > } > rcu_read_unlock(); > > if (lockless) { > lockless = false; > spin_unlock_wait(siglock); > smp_rmb(); > if (exec != sig->sum_sched_runtime) > goto retry; > } else { > spin_unlock_irq(siglock); > } > } Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable. Just I do not know it's worth the trouble. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Peter Zijlstra wrote: > > Also; why do we care about PROCESS_CPUTIME? People should really not use > it. What are the 'valid' usecases you guys care about? I do not really know. IIUC, the problematic usecase is sys_times(). I agree with Mike, "don't do this if you have a lot of threads". But perhaps the kernel can help to applications which already abuse times(). However, if we only want to make sys_times() more scalable(), then perhaps the "lockless" version of thread_group_cputime() makes more sense. And given that do_sys_times() uses current we can simplify it; is_dead is not possible and we do not need to take ->siglock twice: void current_group_cputime(struct task_cputime *times) { struct task_struct *tsk = current, *t; struct spinlock_t *siglock = >sighand->siglock; struct signal_struct *sig = tsk->signal; bool lockless = true; u64 exec; retry: spin_lock_irq(siglock); times->utime = sig->utime; times->stime = sig->stime; times->sum_exec_runtime = exec = sig->sum_sched_runtime; if (lockless) spin_unlock_irq(siglock); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, , ); times->utime += utime; times->stime += stime; times->sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; spin_unlock_wait(siglock); smp_rmb(); if (exec != sig->sum_sched_runtime) goto retry; } else { spin_unlock_irq(siglock); } } Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: > > @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, > struct task_cputime *times) > struct signal_struct *sig = tsk->signal; > cputime_t utime, stime; > struct task_struct *t; > - > - times->utime = sig->utime; > - times->stime = sig->stime; > - times->sum_exec_runtime = sig->sum_sched_runtime; > + unsigned int seq, nextseq; > > rcu_read_lock(); > - for_each_thread(tsk, t) { > - task_cputime(t, , ); > - times->utime += utime; > - times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > - } > + /* Attempt a lockless read on the first round. */ > + nextseq = 0; > + do { > + seq = nextseq; > + read_seqbegin_or_lock(>stats_lock, ); > + times->utime = sig->utime; > + times->stime = sig->stime; > + times->sum_exec_runtime = sig->sum_sched_runtime; > + > + for_each_thread(tsk, t) { > + task_cputime(t, , ); > + times->utime += utime; > + times->stime += stime; > + times->sum_exec_runtime += task_sched_runtime(t); > + } > + /* > + * If a writer is currently active, seq will be odd, and > + * read_seqbegin_or_lock will take the lock. > + */ > + nextseq = raw_read_seqcount(>stats_lock.seqcount); > + } while (need_seqretry(>stats_lock, seq)); > + done_seqretry(>stats_lock, seq); > rcu_read_unlock(); > } I still think this is not right. Let me quote my previous email, > @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > struct signal_struct *sig = tsk->signal; > cputime_t utime, stime; > struct task_struct *t; > - > - times->utime = sig->utime; > - times->stime = sig->stime; > - times->sum_exec_runtime = sig->sum_sched_runtime; > + unsigned int seq, nextseq; > > rcu_read_lock(); Almost cosmetic nit, but afaics this patch expands the rcu critical section for no reason. We only need rcu_read_lock/unlock around for_each_thread() below. > + nextseq = 0; > + do { > + seq = nextseq; > + read_seqbegin_or_lock(>stats_lock, ); > + times->utime = sig->utime; > + times->stime = sig->stime; > + times->sum_exec_runtime = sig->sum_sched_runtime; > + > + for_each_thread(tsk, t) { > + task_cputime(t, , ); > + times->utime += utime; > + times->stime += stime; > + times->sum_exec_runtime += task_sched_runtime(t); > + } > + /* > + * If a writer is currently active, seq will be odd, and > + * read_seqbegin_or_lock will take the lock. > + */ > + nextseq = raw_read_seqcount(>stats_lock.seqcount); > + } while (need_seqretry(>stats_lock, seq)); > + done_seqretry(>stats_lock, seq); Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean, this code still can livelock in theory. Just suppose that anoter CPU does write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this case "seq & 1" will be never true and thus "or_lock" will never happen. IMO, this should be fixed. Either we should guarantee the forward progress or we should not play with read_seqbegin_or_lock() at all. This code assumes that sooner or later "nextseq = raw_read_seqcount()" should return the odd counter, but in theory this can never happen. And if we want to fix this we do not need 2 counters, just we need to set "seq = 1" manually after need_seqretry() == T. Say, like __dentry_path() does. (but unlike __dentry_path() we do not need to worry about rcu_read_unlock so the code will be simpler). I am wondering if it makes sense to introduce bool read_seqretry_or_lock(const seqlock_t *sl, int *seq) { if (*seq & 1) { read_sequnlock_excl(lock); return false; } if (!read_seqretry(lock, *seq)) return false; *seq = 1; return true; } Or I missed your reply? Oleg. -- To unsubscribe from this list: send the line
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Frederic Weisbecker wrote: > > 2014-08-14 16:39 GMT+02:00 Oleg Nesterov : > > On 08/14, Frederic Weisbecker wrote: > >> > >> I mean the read side doesn't use a lock with seqlocks. It's only made > >> of barriers and sequence numbers to ensure the reader doesn't read > >> some half-complete update. But other than that it can as well see the > >> update n - 1 since barriers don't enforce latest results. > > > > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after" > > write_seqcount_begin-update-write_seqcount_begin can miss "update" part > > along with ->sequence modifications. > > > > But I still can't understand how this can lead to non-monotonic results, > > could you spell? > > Well lets say clock = T. > CPU 0 updates at T + 1. > Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1 > while CPU 1 still reads T. > If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's > possible that CPU 2 still sees T. With the spinlocked version that > thing can't happen, the second round would read at least T + 1 for > both CPUs. But this is fine? And CPU 2 doesn't see a non-monotonic result? OK, this could be wrong if, say, void print_clock(void) { lock(SOME_LOCK); printk(..., clock_gettime()); unlock(SOME_LOCK); } printed the non-monotonic numbers if print_clock() is called on CPU_1 and then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if they were already visible to CPU_1 under the same lock. IOW, int T = 0; /* can be incremented at any time */ void check_monotony(void) { static int t = 0; lock(SOME_LOCK); BUG(t > T); T = t; unlock(SOME_LOCK); } must work corrrectly (ignoring overflow) even if T is changed without SOME_LOCK. Otherwise, without some sort of synchronization the different results on CPU_1/2 should be fine. Or I am still missing your point? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, Aug 15, 2014 at 11:37:33AM +0200, Mike Galbraith wrote: > On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: > > On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote: > > > For the N threads doing this on N cores case, seems rq->lock hammering > > > will still be a source of major box wide pain. Is there any correctness > > > reason to add up unaccounted ->on_cpu beans, or is that just value > > > added? > > > > That delta can be arbitrarily large with nohz_full. And without > > nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the > > reported clock resolution. > > > > Having a non-constant error bound is annoying for you never quite know > > what to expect. > > Ah, yeah, that could get rather large. > > > Also; why do we care about PROCESS_CPUTIME? People should really not use > > it. What are the 'valid' usecases you guys care about? > > I don't care much, said "don't do that" before I saw a similar big box > problem had popped up with times(). Urgh, yes times().. Now I don't think we do very accurate accounting of those particular numbers, so we could fudge some of that. Typically we only do TICK_NSEC granularity accounting on user/system divide anyhow, seeing how putting timestamp reads in the kernel<>user switch is _expensive_ -- see NOHZ_FULL. pgp5yv9CvnsYp.pgp Description: PGP signature
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: > On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote: > > For the N threads doing this on N cores case, seems rq->lock hammering > > will still be a source of major box wide pain. Is there any correctness > > reason to add up unaccounted ->on_cpu beans, or is that just value > > added? > > That delta can be arbitrarily large with nohz_full. And without > nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the > reported clock resolution. > > Having a non-constant error bound is annoying for you never quite know > what to expect. Ah, yeah, that could get rather large. > Also; why do we care about PROCESS_CPUTIME? People should really not use > it. What are the 'valid' usecases you guys care about? I don't care much, said "don't do that" before I saw a similar big box problem had popped up with times(). -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote: > For the N threads doing this on N cores case, seems rq->lock hammering > will still be a source of major box wide pain. Is there any correctness > reason to add up unaccounted ->on_cpu beans, or is that just value > added? That delta can be arbitrarily large with nohz_full. And without nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the reported clock resolution. Having a non-constant error bound is annoying for you never quite know what to expect. Also; why do we care about PROCESS_CPUTIME? People should really not use it. What are the 'valid' usecases you guys care about? pgpKXt9BDsAqg.pgp Description: PGP signature
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote: For the N threads doing this on N cores case, seems rq-lock hammering will still be a source of major box wide pain. Is there any correctness reason to add up unaccounted -on_cpu beans, or is that just value added? That delta can be arbitrarily large with nohz_full. And without nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the reported clock resolution. Having a non-constant error bound is annoying for you never quite know what to expect. Also; why do we care about PROCESS_CPUTIME? People should really not use it. What are the 'valid' usecases you guys care about? pgpKXt9BDsAqg.pgp Description: PGP signature
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote: For the N threads doing this on N cores case, seems rq-lock hammering will still be a source of major box wide pain. Is there any correctness reason to add up unaccounted -on_cpu beans, or is that just value added? That delta can be arbitrarily large with nohz_full. And without nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the reported clock resolution. Having a non-constant error bound is annoying for you never quite know what to expect. Ah, yeah, that could get rather large. Also; why do we care about PROCESS_CPUTIME? People should really not use it. What are the 'valid' usecases you guys care about? I don't care much, said don't do that before I saw a similar big box problem had popped up with times(). -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, Aug 15, 2014 at 11:37:33AM +0200, Mike Galbraith wrote: On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote: For the N threads doing this on N cores case, seems rq-lock hammering will still be a source of major box wide pain. Is there any correctness reason to add up unaccounted -on_cpu beans, or is that just value added? That delta can be arbitrarily large with nohz_full. And without nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the reported clock resolution. Having a non-constant error bound is annoying for you never quite know what to expect. Ah, yeah, that could get rather large. Also; why do we care about PROCESS_CPUTIME? People should really not use it. What are the 'valid' usecases you guys care about? I don't care much, said don't do that before I saw a similar big box problem had popped up with times(). Urgh, yes times().. Now I don't think we do very accurate accounting of those particular numbers, so we could fudge some of that. Typically we only do TICK_NSEC granularity accounting on user/system divide anyhow, seeing how putting timestamp reads in the kerneluser switch is _expensive_ -- see NOHZ_FULL. pgp5yv9CvnsYp.pgp Description: PGP signature
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Frederic Weisbecker wrote: 2014-08-14 16:39 GMT+02:00 Oleg Nesterov o...@redhat.com: On 08/14, Frederic Weisbecker wrote: I mean the read side doesn't use a lock with seqlocks. It's only made of barriers and sequence numbers to ensure the reader doesn't read some half-complete update. But other than that it can as well see the update n - 1 since barriers don't enforce latest results. Yes, sure, read_seqcount_begin/read_seqcount_retry right after write_seqcount_begin-update-write_seqcount_begin can miss update part along with -sequence modifications. But I still can't understand how this can lead to non-monotonic results, could you spell? Well lets say clock = T. CPU 0 updates at T + 1. Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1 while CPU 1 still reads T. If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's possible that CPU 2 still sees T. With the spinlocked version that thing can't happen, the second round would read at least T + 1 for both CPUs. But this is fine? And CPU 2 doesn't see a non-monotonic result? OK, this could be wrong if, say, void print_clock(void) { lock(SOME_LOCK); printk(..., clock_gettime()); unlock(SOME_LOCK); } printed the non-monotonic numbers if print_clock() is called on CPU_1 and then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if they were already visible to CPU_1 under the same lock. IOW, int T = 0; /* can be incremented at any time */ void check_monotony(void) { static int t = 0; lock(SOME_LOCK); BUG(t T); T = t; unlock(SOME_LOCK); } must work corrrectly (ignoring overflow) even if T is changed without SOME_LOCK. Otherwise, without some sort of synchronization the different results on CPU_1/2 should be fine. Or I am still missing your point? Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk-signal; cputime_t utime, stime; struct task_struct *t; - - times-utime = sig-utime; - times-stime = sig-stime; - times-sum_exec_runtime = sig-sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); - for_each_thread(tsk, t) { - task_cputime(t, utime, stime); - times-utime += utime; - times-stime += stime; - times-sum_exec_runtime += task_sched_runtime(t); - } + /* Attempt a lockless read on the first round. */ + nextseq = 0; + do { + seq = nextseq; + read_seqbegin_or_lock(sig-stats_lock, seq); + times-utime = sig-utime; + times-stime = sig-stime; + times-sum_exec_runtime = sig-sum_sched_runtime; + + for_each_thread(tsk, t) { + task_cputime(t, utime, stime); + times-utime += utime; + times-stime += stime; + times-sum_exec_runtime += task_sched_runtime(t); + } + /* + * If a writer is currently active, seq will be odd, and + * read_seqbegin_or_lock will take the lock. + */ + nextseq = raw_read_seqcount(sig-stats_lock.seqcount); + } while (need_seqretry(sig-stats_lock, seq)); + done_seqretry(sig-stats_lock, seq); rcu_read_unlock(); } I still think this is not right. Let me quote my previous email, @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk-signal; cputime_t utime, stime; struct task_struct *t; - - times-utime = sig-utime; - times-stime = sig-stime; - times-sum_exec_runtime = sig-sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); Almost cosmetic nit, but afaics this patch expands the rcu critical section for no reason. We only need rcu_read_lock/unlock around for_each_thread() below. + nextseq = 0; + do { + seq = nextseq; + read_seqbegin_or_lock(sig-stats_lock, seq); + times-utime = sig-utime; + times-stime = sig-stime; + times-sum_exec_runtime = sig-sum_sched_runtime; + + for_each_thread(tsk, t) { + task_cputime(t, utime, stime); + times-utime += utime; + times-stime += stime; + times-sum_exec_runtime += task_sched_runtime(t); + } + /* + * If a writer is currently active, seq will be odd, and + * read_seqbegin_or_lock will take the lock. + */ + nextseq = raw_read_seqcount(sig-stats_lock.seqcount); + } while (need_seqretry(sig-stats_lock, seq)); + done_seqretry(sig-stats_lock, seq); Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean, this code still can livelock in theory. Just suppose that anoter CPU does write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this case seq 1 will be never true and thus or_lock will never happen. IMO, this should be fixed. Either we should guarantee the forward progress or we should not play with read_seqbegin_or_lock() at all. This code assumes that sooner or later nextseq = raw_read_seqcount() should return the odd counter, but in theory this can never happen. And if we want to fix this we do not need 2 counters, just we need to set seq = 1 manually after need_seqretry() == T. Say, like __dentry_path() does. (but unlike __dentry_path() we do not need to worry about rcu_read_unlock so the code will be simpler). I am wondering if it makes sense to introduce bool read_seqretry_or_lock(const seqlock_t *sl, int *seq) { if (*seq 1) { read_sequnlock_excl(lock); return false; } if (!read_seqretry(lock, *seq)) return false; *seq = 1; return true; } Or I missed your reply? Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Peter Zijlstra wrote: Also; why do we care about PROCESS_CPUTIME? People should really not use it. What are the 'valid' usecases you guys care about? I do not really know. IIUC, the problematic usecase is sys_times(). I agree with Mike, don't do this if you have a lot of threads. But perhaps the kernel can help to applications which already abuse times(). However, if we only want to make sys_times() more scalable(), then perhaps the lockless version of thread_group_cputime() makes more sense. And given that do_sys_times() uses current we can simplify it; is_dead is not possible and we do not need to take -siglock twice: void current_group_cputime(struct task_cputime *times) { struct task_struct *tsk = current, *t; struct spinlock_t *siglock = tsk-sighand-siglock; struct signal_struct *sig = tsk-signal; bool lockless = true; u64 exec; retry: spin_lock_irq(siglock); times-utime = sig-utime; times-stime = sig-stime; times-sum_exec_runtime = exec = sig-sum_sched_runtime; if (lockless) spin_unlock_irq(siglock); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, utime, stime); times-utime += utime; times-stime += stime; times-sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; spin_unlock_wait(siglock); smp_rmb(); if (exec != sig-sum_sched_runtime) goto retry; } else { spin_unlock_irq(siglock); } } Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Oleg Nesterov wrote: However, if we only want to make sys_times() more scalable(), then perhaps the lockless version of thread_group_cputime() makes more sense. And given that do_sys_times() uses current we can simplify it; is_dead is not possible and we do not need to take -siglock twice: void current_group_cputime(struct task_cputime *times) { struct task_struct *tsk = current, *t; struct spinlock_t *siglock = tsk-sighand-siglock; struct signal_struct *sig = tsk-signal; bool lockless = true; u64 exec; retry: spin_lock_irq(siglock); times-utime = sig-utime; times-stime = sig-stime; times-sum_exec_runtime = exec = sig-sum_sched_runtime; if (lockless) spin_unlock_irq(siglock); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, utime, stime); times-utime += utime; times-stime += stime; times-sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; spin_unlock_wait(siglock); smp_rmb(); if (exec != sig-sum_sched_runtime) goto retry; } else { spin_unlock_irq(siglock); } } Just in case... Yes, sure, seqlock_t stats_lock is more scalable. Just I do not know it's worth the trouble. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/15/2014 12:49 PM, Oleg Nesterov wrote: Just in case... Yes, sure, seqlock_t stats_lock is more scalable. Just I do not know it's worth the trouble. If we don't know whether it is worth the trouble, it is probably best to stick to a well-known generic locking algorithm, instead of brewing our own and trying to maintain it. I have fixed the other locking issue you pointed out, Oleg. Now to see if this change to cputime_adjust does the trick :) +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. +* Atomic exchange protects against concurrent cputime_adjust. */ - - prev-stime = max(prev-stime, stime); - - prev-utime = max(prev-utime, utime); + while (stime (rtime = ACCESS_ONCE(prev-stime))) + cmpxchg(prev-stime, rtime, stime); + while (utime (rtime = ACCESS_ONCE(prev-utime))) + cmpxchg(prev-utime, rtime, utime); -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJT7kJ4AAoJEM553pKExN6Do/oH/2lA5X/CrVuhOLBK1sVq3kRh gGiOTT9pDQZH1wwafVNHKWaro3T/s9GNqemgvgt4UiKbjFeYkaOycHp1cuntJj8j Wk8zNnWBOuGqqcSxzk1Duco3CByxshLNXxuYJfpdkdEXPqRyvURAOL58pxSybZzh E6lT747ntFJu3GIbfC6Ta3q58pWLpVrhWlvonhSaqat6tOvlzo4MKiJxz3SbT6i0 cCpmQ5p/JoQ5+IUEbTOZYbE2bK2y5tSrMggAFwKWLB3/0zJm1h4+2Q/5PenCX59X VDFmaOJLkNxGcVXg8x87itvqzfq/LkvDtwl9tTJmA5ccG37MPvM3803XF5OWVo0= =aKES -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/15, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/15/2014 12:49 PM, Oleg Nesterov wrote: Just in case... Yes, sure, seqlock_t stats_lock is more scalable. Just I do not know it's worth the trouble. If we don't know whether it is worth the trouble, it is probably best to stick to a well-known generic locking algorithm, instead of brewing our own and trying to maintain it. Perhaps. I am obviously biased and can't judge ;) Plus, again, I do understand that your approach has some advantages too. Now to see if this change to cputime_adjust does the trick :) +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. + * Atomic exchange protects against concurrent cputime_adjust. */ - - prev-stime = max(prev-stime, stime); - - prev-utime = max(prev-utime, utime); + while (stime (rtime = ACCESS_ONCE(prev-stime))) + cmpxchg(prev-stime, rtime, stime); + while (utime (rtime = ACCESS_ONCE(prev-utime))) + cmpxchg(prev-utime, rtime, utime); Yes, perhaps we need something like this in any case. To remind, at least do_task_stat() calls task_cputime_adjusted() lockless, although we could fix this separately. But I do not think the change above is enough. With this change cputime_adjust() can race with itself. Yes, this guarantees monotonicity even if it is called lockless, but this can lead to obviously inconsistent numbers. And I don't think we can ignore this. If we could, then we can remove the scale_stime recalculation and change cputime_adjust() to simply do: static void cputime_adjust(struct task_cputime *curr, struct cputime *prev, cputime_t *ut, cputime_t *st) { /* enforce monotonicity */ *ut = prev-stime = max(prev-stime, curr-stime); *st = prev-utime = max(prev-utime, curr-utime); } Yes, we have this problem either way. And personally I think that this enforce monotonicity logic is pointless, userspace could take care, but it is too late to complain. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Fri, Aug 15, 2014 at 04:26:01PM +0200, Oleg Nesterov wrote: On 08/15, Frederic Weisbecker wrote: 2014-08-14 16:39 GMT+02:00 Oleg Nesterov o...@redhat.com: On 08/14, Frederic Weisbecker wrote: I mean the read side doesn't use a lock with seqlocks. It's only made of barriers and sequence numbers to ensure the reader doesn't read some half-complete update. But other than that it can as well see the update n - 1 since barriers don't enforce latest results. Yes, sure, read_seqcount_begin/read_seqcount_retry right after write_seqcount_begin-update-write_seqcount_begin can miss update part along with -sequence modifications. But I still can't understand how this can lead to non-monotonic results, could you spell? Well lets say clock = T. CPU 0 updates at T + 1. Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1 while CPU 1 still reads T. If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's possible that CPU 2 still sees T. With the spinlocked version that thing can't happen, the second round would read at least T + 1 for both CPUs. But this is fine? And CPU 2 doesn't see a non-monotonic result? OK, this could be wrong if, say, void print_clock(void) { lock(SOME_LOCK); printk(..., clock_gettime()); unlock(SOME_LOCK); } printed the non-monotonic numbers if print_clock() is called on CPU_1 and then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if they were already visible to CPU_1 under the same lock. IOW, int T = 0; /* can be incremented at any time */ void check_monotony(void) { static int t = 0; lock(SOME_LOCK); BUG(t T); T = t; unlock(SOME_LOCK); } must work corrrectly (ignoring overflow) even if T is changed without SOME_LOCK. Otherwise, without some sort of synchronization the different results on CPU_1/2 should be fine. Or I am still missing your point? No I think you're right, as long as ordering against something else is involved, monotonicity is enforced. Now I'm trying to think about a case where SMP ordering isn't involved. Perhaps some usecase based on coupling CPU local clocks and clock_gettime() where a drift between both can appear. Now using a local clock probably only makes sense in the context of local usecases where the thread clock update would be local as well. So that's probably not a problem. Now what if somebody couples multithread process wide clocks with per CPU local clocks. Well that's probably too foolish to be considered. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Thu, 2014-08-14 at 19:48 +0200, Oleg Nesterov wrote: > On 08/14, Oleg Nesterov wrote: > > > > OK, lets forget about alternative approach for now. We can reconsider > > it later. At least I have to admit that seqlock is more straighforward. > > Yes. > > But just for record, the "lockless" version doesn't look that bad to me, > > void thread_group_cputime(struct task_struct *tsk, struct task_cputime > *times) > { > struct signal_struct *sig = tsk->signal; > bool lockless, is_dead; > struct task_struct *t; > unsigned long flags; > u64 exec; > > lockless = true; > is_dead = !lock_task_sighand(p, ); >retry: > times->utime = sig->utime; > times->stime = sig->stime; > times->sum_exec_runtime = exec = sig->sum_sched_runtime; > if (is_dead) > return; > > if (lockless) > unlock_task_sighand(p, ); > > rcu_read_lock(); > for_each_thread(tsk, t) { > cputime_t utime, stime; > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > times->sum_exec_runtime += task_sched_runtime(t); > } > rcu_read_unlock(); > > if (lockless) { > lockless = false; > is_dead = !lock_task_sighand(p, ); > if (is_dead || exec != sig->sum_sched_runtime) > goto retry; > } > unlock_task_sighand(p, ); > } > > The obvious problem is that we should shift lock_task_sighand() from the > callers to thread_group_cputime() first, or add > thread_group_cputime_lockless() > and change the current users one by one. > > And of course, stats_lock is more generic. Yours looks nice to me, particularly in that it doesn't munge structure layout, could perhaps be backported to fix up production kernels. For the N threads doing this on N cores case, seems rq->lock hammering will still be a source of major box wide pain. Is there any correctness reason to add up unaccounted ->on_cpu beans, or is that just value added? Seems to me it can't matter, as you traverse, what you added up on previous threads becomes ever more stale as you proceed, so big boxen would be better off not doing that. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
2014-08-14 16:39 GMT+02:00 Oleg Nesterov : > On 08/14, Frederic Weisbecker wrote: >> >> 2014-08-14 3:57 GMT+02:00 Rik van Riel : >> > -BEGIN PGP SIGNED MESSAGE- >> > Hash: SHA1 >> > >> > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: >> >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: >> >> >> >> I'm worried about such lockless solution based on RCU or read >> >> seqcount because we lose the guarantee that an update is >> >> immediately visible by all subsequent readers. >> >> >> >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right >> >> after that call clock_gettime(), with the spinlock we were >> >> guaranteed to see the new update. Now with a pure seqlock read >> >> approach, we guarantee a read sequence coherency but we don't >> >> guarantee the freshest update result. >> >> >> >> So that looks like a source of non monotonic results. >> > >> > Which update are you worried about, specifically? >> > >> > The seq_write_lock to update the usage stat in p->signal will lock out >> > the seqlock read side used to check those results. >> > >> > Is there another kind of thing read by cpu_clock_sample_group that you >> > believe is not excluded by the seq_lock? >> >> I mean the read side doesn't use a lock with seqlocks. It's only made >> of barriers and sequence numbers to ensure the reader doesn't read >> some half-complete update. But other than that it can as well see the >> update n - 1 since barriers don't enforce latest results. > > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after" > write_seqcount_begin-update-write_seqcount_begin can miss "update" part > along with ->sequence modifications. > > But I still can't understand how this can lead to non-monotonic results, > could you spell? Well lets say clock = T. CPU 0 updates at T + 1. Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1 while CPU 1 still reads T. If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's possible that CPU 2 still sees T. With the spinlocked version that thing can't happen, the second round would read at least T + 1 for both CPUs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Thu, 14 Aug 2014 18:12:47 +0200 Oleg Nesterov wrote: > Or you can expand the scope of write_seqlock/write_sequnlock, so that > __unhash_process in called from inside the critical section. This looks > simpler at first glance. > > Hmm, wait, it seems there is yet another problem ;) Afaics, you also > need to modify __exit_signal() so that ->sum_sched_runtime/etc are > accounted unconditionally, even if the group leader exits. OK, this is what I have now. I am still getting backwards time sometimes, but only tiny increments. This suggests that cputime_adjust() may be the culprit, and I have no good idea on how to fix that yet... Should task_cputime_adjusted and thread_group_cputime_adjusted pass in the address of a seqlock to use in case the values in prev need to be updated? Should we check whether the values in prev changed during the time spent in the function? Is this a race between task_cputime_adjusted and other writers of signal->utime and signal->stime, instead of task_cputime_adjusted racing with itself? I am not sure what the best approach here is... ---8<--- Subject: time,signal: protect resource use statistics with seqlock Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times and clock_gettime being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime, slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task things can be simplified a lot, because the calling function already ensures tsk sticks around, and the rest is now taken care of in thread_group_cputime. This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel --- include/linux/sched.h | 1 + kernel/exit.c | 48 +++--- kernel/fork.c | 1 + kernel/sched/cputime.c | 36 +++ kernel/sys.c | 2 -- kernel/time/posix-cpu-timers.c | 14 6 files changed, 51 insertions(+), 51 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ba40..91f9209 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; diff --git a/kernel/exit.c b/kernel/exit.c index 32c58f7..c1a0ef2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -115,32 +115,34 @@ static void __exit_signal(struct task_struct *tsk) if (tsk == sig->curr_target) sig->curr_target = next_thread(tsk); - /* -* Accumulate here the counters for all threads but the -* group leader as they die, so they can be added into -* the process-wide totals when those are taken. -* The group leader stays around as a zombie as long -* as there are other threads. When it gets reaped, -* the exit.c code will add its counts into these totals. -* We won't ever get here for the group leader, since it -* will have been the last reference on the signal_struct. -*/ - task_cputime(tsk, , ); - sig->utime += utime; - sig->stime += stime; - sig->gtime += task_gtime(tsk); - sig->min_flt += tsk->min_flt; - sig->maj_flt += tsk->maj_flt; - sig->nvcsw += tsk->nvcsw; - sig->nivcsw += tsk->nivcsw; - sig->inblock += task_io_get_inblock(tsk); - sig->oublock += task_io_get_oublock(tsk); - task_io_accounting_add(>ioac, >ioac); - sig->sum_sched_runtime += tsk->se.sum_exec_runtime; } + /* +* Accumulate here the counters for all threads but the +* group leader as they die, so they can be added into +* the process-wide totals when those are taken. +* The group leader stays around as a zombie as long +* as there are other threads. When it gets reaped, +* the exit.c code will add its counts into these totals. +* We won't ever get here for the group leader, since it +* will have been the last reference on the signal_struct. +*/ +
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: > > On 08/14/2014 02:15 PM, Oleg Nesterov wrote: > > On 08/14, Rik van Riel wrote: > >> > >> On 08/14/2014 12:12 PM, Oleg Nesterov wrote: > >>> > >>> Or you can expand the scope of write_seqlock/write_sequnlock, so that > >>> __unhash_process in called from inside the critical section. This looks > >>> simpler at first glance. > >> > >> The problem with that is that wait_task_zombie() calls > >> thread_group_cputime_adjusted() in that if() branch, and > >> that code ends up taking the seqlock for read... > > > > Not sure I understand... This modifies parent->signal->c* counters, > > and obviously the exiting thread is not the member of parent's thread > > group, so thread_group_cputime_adjusted(parent) can never account the > > exiting child twice simply because it won't see it? > > You are right, the tree of processes only goes one way, > so there should be no deadlock in taking psig->stats_lock > and having thread_group_cputime_adjusted take sig->stats_lock > for read within that section. > > However, it might need some lockdep annotation to keep > lockdep from thinking we might the same lock recursively :) But wait_task_zombie() can (and should) call thread_group_cputime_adjusted(zombie_child) outside of parent's ->siglock or ->stats_lock so this this should be safe. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14/2014 02:15 PM, Oleg Nesterov wrote: > On 08/14, Rik van Riel wrote: >> >> On 08/14/2014 12:12 PM, Oleg Nesterov wrote: >>> >>> Or you can expand the scope of write_seqlock/write_sequnlock, so that >>> __unhash_process in called from inside the critical section. This looks >>> simpler at first glance. >> >> The problem with that is that wait_task_zombie() calls >> thread_group_cputime_adjusted() in that if() branch, and >> that code ends up taking the seqlock for read... > > Not sure I understand... This modifies parent->signal->c* counters, > and obviously the exiting thread is not the member of parent's thread > group, so thread_group_cputime_adjusted(parent) can never account the > exiting child twice simply because it won't see it? You are right, the tree of processes only goes one way, so there should be no deadlock in taking psig->stats_lock and having thread_group_cputime_adjusted take sig->stats_lock for read within that section. However, it might need some lockdep annotation to keep lockdep from thinking we might the same lock recursively :) >> However, in __exit_signal that approach should work. > > Yes, > >>> Hmm, wait, it seems there is yet another problem ;) Afaics, you also >>> need to modify __exit_signal() so that ->sum_sched_runtime/etc are >>> accounted unconditionally, even if the group leader exits. >>> >>> Probably this is not a big problem, and sys_times() or clock_gettime() >>> do not care at all because they use current. >>> >>> But without this change thread_group_cputime(reaped_zombie) won't look >>> at this task_struct at all, this can lead to non-monotonic result if >>> it was previously called when this task was alive (non-reaped). >> >> You mean this whole block needs to run regardless of whether >> the group is dead? >> >> task_cputime(tsk, , ); >> write_seqlock(>stats_lock); >> sig->utime += utime; >> sig->stime += stime; >> sig->gtime += task_gtime(tsk); >> sig->min_flt += tsk->min_flt; >> sig->maj_flt += tsk->maj_flt; >> sig->nvcsw += tsk->nvcsw; >> sig->nivcsw += tsk->nivcsw; >> sig->inblock += task_io_get_inblock(tsk); >> sig->oublock += task_io_get_oublock(tsk); >> task_io_accounting_add(>ioac, >ioac); >> sig->sum_sched_runtime += tsk->se.sum_exec_runtime; > > Yes. Let me give that a try and see what happens :) >> How does that square with wait_task_zombie reaping the >> statistics of the whole group with thread_group_cputime_adjusted() >> when the group leader is exiting? > > Again, not sure I understand... thread_group_cputime_adjusted() in > wait_task_zombie() is fine in any case. Nobody but us can reap this > zombie. > > It seems that we misunderstood each other, let me try again. Just to > simplify, suppose we have, say, > > sys_times_by_pid(pid, ...) > { > rcu_read_lock(); > task = find_task_by_vpid(pid); > if (task) > get_task_struct(task); > rcu_read_unlock(); > > if (!task) > return -ESRCH; > > thread_group_cputime(task, ...); > copy_to_user(); > return 0; > } > > Note that this task can exit right after rcu_read_unlock(), and it can > be also reaped (by its parent or by itself) and removed from the thread > list. In this case for_each_thread() will see no threads, and thus it > will only read task->signal->*time. > > This means that sys_times_by_pid() can simply return the wrong result > instead of failure. Say, It can even return "all zeros" if this task was > single-threaded. Ahh, that makes sense. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Oleg Nesterov wrote: > > But just for record, the "lockless" version doesn't look that bad to me, > > void thread_group_cputime(struct task_struct *tsk, struct task_cputime > *times) > { > struct signal_struct *sig = tsk->signal; > bool lockless, is_dead; > struct task_struct *t; > unsigned long flags; > u64 exec; > > lockless = true; > is_dead = !lock_task_sighand(p, ); >retry: > times->utime = sig->utime; > times->stime = sig->stime; > times->sum_exec_runtime = exec = sig->sum_sched_runtime; > if (is_dead) > return; > > if (lockless) > unlock_task_sighand(p, ); > > rcu_read_lock(); > for_each_thread(tsk, t) { > cputime_t utime, stime; > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > times->sum_exec_runtime += task_sched_runtime(t); > } > rcu_read_unlock(); > > if (lockless) { > lockless = false; > is_dead = !lock_task_sighand(p, ); > if (is_dead || exec != sig->sum_sched_runtime) > goto retry; > } > unlock_task_sighand(p, ); > } > > The obvious problem is that we should shift lock_task_sighand() from the > callers to thread_group_cputime() first, or add > thread_group_cputime_lockless() > and change the current users one by one. OTOH, it is simple to convert do_sys_times() and posix_cpu_clock_get_task() to use the lockless version, and avoid the new stats_lock and other changes it needs. > And of course, stats_lock is more generic. Yes, this is true in any case. So I simply do not know. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: > > On 08/14/2014 12:12 PM, Oleg Nesterov wrote: > > > > Or you can expand the scope of write_seqlock/write_sequnlock, so that > > __unhash_process in called from inside the critical section. This looks > > simpler at first glance. > > The problem with that is that wait_task_zombie() calls > thread_group_cputime_adjusted() in that if() branch, and > that code ends up taking the seqlock for read... Not sure I understand... This modifies parent->signal->c* counters, and obviously the exiting thread is not the member of parent's thread group, so thread_group_cputime_adjusted(parent) can never account the exiting child twice simply because it won't see it? > However, in __exit_signal that approach should work. Yes, > > Hmm, wait, it seems there is yet another problem ;) Afaics, you also > > need to modify __exit_signal() so that ->sum_sched_runtime/etc are > > accounted unconditionally, even if the group leader exits. > > > > Probably this is not a big problem, and sys_times() or clock_gettime() > > do not care at all because they use current. > > > > But without this change thread_group_cputime(reaped_zombie) won't look > > at this task_struct at all, this can lead to non-monotonic result if > > it was previously called when this task was alive (non-reaped). > > You mean this whole block needs to run regardless of whether > the group is dead? > > task_cputime(tsk, , ); > write_seqlock(>stats_lock); > sig->utime += utime; > sig->stime += stime; > sig->gtime += task_gtime(tsk); > sig->min_flt += tsk->min_flt; > sig->maj_flt += tsk->maj_flt; > sig->nvcsw += tsk->nvcsw; > sig->nivcsw += tsk->nivcsw; > sig->inblock += task_io_get_inblock(tsk); > sig->oublock += task_io_get_oublock(tsk); > task_io_accounting_add(>ioac, >ioac); > sig->sum_sched_runtime += tsk->se.sum_exec_runtime; Yes. > How does that square with wait_task_zombie reaping the > statistics of the whole group with thread_group_cputime_adjusted() > when the group leader is exiting? Again, not sure I understand... thread_group_cputime_adjusted() in wait_task_zombie() is fine in any case. Nobody but us can reap this zombie. It seems that we misunderstood each other, let me try again. Just to simplify, suppose we have, say, sys_times_by_pid(pid, ...) { rcu_read_lock(); task = find_task_by_vpid(pid); if (task) get_task_struct(task); rcu_read_unlock(); if (!task) return -ESRCH; thread_group_cputime(task, ...); copy_to_user(); return 0; } Note that this task can exit right after rcu_read_unlock(), and it can be also reaped (by its parent or by itself) and removed from the thread list. In this case for_each_thread() will see no threads, and thus it will only read task->signal->*time. This means that sys_times_by_pid() can simply return the wrong result instead of failure. Say, It can even return "all zeros" if this task was single-threaded. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Oleg Nesterov wrote: > > OK, lets forget about alternative approach for now. We can reconsider > it later. At least I have to admit that seqlock is more straighforward. Yes. But just for record, the "lockless" version doesn't look that bad to me, void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) { struct signal_struct *sig = tsk->signal; bool lockless, is_dead; struct task_struct *t; unsigned long flags; u64 exec; lockless = true; is_dead = !lock_task_sighand(p, ); retry: times->utime = sig->utime; times->stime = sig->stime; times->sum_exec_runtime = exec = sig->sum_sched_runtime; if (is_dead) return; if (lockless) unlock_task_sighand(p, ); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, , ); times->utime += utime; times->stime += stime; times->sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; is_dead = !lock_task_sighand(p, ); if (is_dead || exec != sig->sum_sched_runtime) goto retry; } unlock_task_sighand(p, ); } The obvious problem is that we should shift lock_task_sighand() from the callers to thread_group_cputime() first, or add thread_group_cputime_lockless() and change the current users one by one. And of course, stats_lock is more generic. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14/2014 12:12 PM, Oleg Nesterov wrote: > On 08/14, Rik van Riel wrote: >> >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA1 >> >> On 08/14/2014 10:24 AM, Oleg Nesterov wrote: >>> On 08/13, Rik van Riel wrote: @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { cputime_t tgutime, tgstime, cutime, cstime; - spin_lock_irq(>sighand->siglock); thread_group_cputime_adjusted(current, , ); cutime = current->signal->cutime; cstime = current->signal->cstime; - spin_unlock_irq(>sighand->siglock); >>> >>> Ah, wait, there is another problem afaics... >> >> Last night I worked on another problem with this code. >> >> After propagating the stats from a dying task to the signal struct, >> we need to make sure that that task's stats are not counted twice. > > Heh indeed ;) Can't understand how I missed that. > >> This requires zeroing the stats under the write_seqlock, which was >> easy enough to add. > > Or you can expand the scope of write_seqlock/write_sequnlock, so that > __unhash_process in called from inside the critical section. This looks > simpler at first glance. The problem with that is that wait_task_zombie() calls thread_group_cputime_adjusted() in that if() branch, and that code ends up taking the seqlock for read... However, in __exit_signal that approach should work. > Hmm, wait, it seems there is yet another problem ;) Afaics, you also > need to modify __exit_signal() so that ->sum_sched_runtime/etc are > accounted unconditionally, even if the group leader exits. > > Probably this is not a big problem, and sys_times() or clock_gettime() > do not care at all because they use current. > > But without this change thread_group_cputime(reaped_zombie) won't look > at this task_struct at all, this can lead to non-monotonic result if > it was previously called when this task was alive (non-reaped). You mean this whole block needs to run regardless of whether the group is dead? task_cputime(tsk, , ); write_seqlock(>stats_lock); sig->utime += utime; sig->stime += stime; sig->gtime += task_gtime(tsk); sig->min_flt += tsk->min_flt; sig->maj_flt += tsk->maj_flt; sig->nvcsw += tsk->nvcsw; sig->nivcsw += tsk->nivcsw; sig->inblock += task_io_get_inblock(tsk); sig->oublock += task_io_get_oublock(tsk); task_io_accounting_add(>ioac, >ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; How does that square with wait_task_zombie reaping the statistics of the whole group with thread_group_cputime_adjusted() when the group leader is exiting? Could that lead to things being double-counted? Or do you mean ONLY ->sum_sched_runtime is unconditionally accounted in __exit_signal(), because wait_task_zombie() seems to be missing that one? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 08/14/2014 10:24 AM, Oleg Nesterov wrote: > > On 08/13, Rik van Riel wrote: > >> > >> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { > >> cputime_t tgutime, tgstime, cutime, cstime; > >> > >> - spin_lock_irq(>sighand->siglock); > >> thread_group_cputime_adjusted(current, , ); > >> cutime = current->signal->cutime; cstime = > >> current->signal->cstime; - > >> spin_unlock_irq(>sighand->siglock); > > > > Ah, wait, there is another problem afaics... > > Last night I worked on another problem with this code. > > After propagating the stats from a dying task to the signal struct, > we need to make sure that that task's stats are not counted twice. Heh indeed ;) Can't understand how I missed that. > This requires zeroing the stats under the write_seqlock, which was > easy enough to add. Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. Hmm, wait, it seems there is yet another problem ;) Afaics, you also need to modify __exit_signal() so that ->sum_sched_runtime/etc are accounted unconditionally, even if the group leader exits. Probably this is not a big problem, and sys_times() or clock_gettime() do not care at all because they use current. But without this change thread_group_cputime(reaped_zombie) won't look at this task_struct at all, this can lead to non-monotonic result if it was previously called when this task was alive (non-reaped). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/14/2014 10:24 AM, Oleg Nesterov wrote: > On 08/13, Rik van Riel wrote: >> >> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { >> cputime_t tgutime, tgstime, cutime, cstime; >> >> -spin_lock_irq(>sighand->siglock); >> thread_group_cputime_adjusted(current, , ); >> cutime = current->signal->cutime; cstime = >> current->signal->cstime; - >> spin_unlock_irq(>sighand->siglock); > > Ah, wait, there is another problem afaics... Last night I worked on another problem with this code. After propagating the stats from a dying task to the signal struct, we need to make sure that that task's stats are not counted twice. This requires zeroing the stats under the write_seqlock, which was easy enough to add. We cannot rely on any state in the task that was set outside of the write_seqlock... > thread_group_cputime_adjusted()->cputime_adjust() plays with > signal->prev_cputime and thus it needs siglock or stats_lock to > ensure it can't race with itself. Not sure it is safe to simply > take the lock in cputime_adjust(), this should be checked. > > OTOH, do_task_stat() already calls task_cputime_adjusted() lockless > and this looks wrong or I missed something. So perhaps we need a > lock in or around cputime_adjust() anyway. I'll take a look at this. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJT7NfHAAoJEM553pKExN6DTVIH/RIFVl42fM+cBpiSavSa2s4k B0ykVu/VwFbqoYVo5I5joSl25IpU5Xma3AwMBQHoJ7aY9a8w63iGFMoycKcDWbrY nOyaOTvR92aMdn/GuGwS/XlU83PwIbLEyYWFrvn0CrnBqJw9pHz/sLYsvP/jASem LbUStuWFzqGyasb4lJVZmLQKaIVhy30CM5Y3llTFuc16zyH/YG65tUasG+aR2miA g3CiPOHP/IY0vZ+L3YYlLthLY4acVX/bwImE0vsx9fY+rG4hgj5xF9b0CnbaN41g 62oJ4jkFSH/voNFPjR7I5AnKpSeMsBqW2/l1tHlcaKcNCtkd9nri/HinxXN5bN4= =dfSt -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Frederic Weisbecker wrote: > > 2014-08-14 3:57 GMT+02:00 Rik van Riel : > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: > >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: > >> > >> I'm worried about such lockless solution based on RCU or read > >> seqcount because we lose the guarantee that an update is > >> immediately visible by all subsequent readers. > >> > >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right > >> after that call clock_gettime(), with the spinlock we were > >> guaranteed to see the new update. Now with a pure seqlock read > >> approach, we guarantee a read sequence coherency but we don't > >> guarantee the freshest update result. > >> > >> So that looks like a source of non monotonic results. > > > > Which update are you worried about, specifically? > > > > The seq_write_lock to update the usage stat in p->signal will lock out > > the seqlock read side used to check those results. > > > > Is there another kind of thing read by cpu_clock_sample_group that you > > believe is not excluded by the seq_lock? > > I mean the read side doesn't use a lock with seqlocks. It's only made > of barriers and sequence numbers to ensure the reader doesn't read > some half-complete update. But other than that it can as well see the > update n - 1 since barriers don't enforce latest results. Yes, sure, read_seqcount_begin/read_seqcount_retry "right after" write_seqcount_begin-update-write_seqcount_begin can miss "update" part along with ->sequence modifications. But I still can't understand how this can lead to non-monotonic results, could you spell? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/13, Rik van Riel wrote: > > @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) > { > cputime_t tgutime, tgstime, cutime, cstime; > > - spin_lock_irq(>sighand->siglock); > thread_group_cputime_adjusted(current, , ); > cutime = current->signal->cutime; > cstime = current->signal->cstime; > - spin_unlock_irq(>sighand->siglock); Ah, wait, there is another problem afaics... thread_group_cputime_adjusted()->cputime_adjust() plays with signal->prev_cputime and thus it needs siglock or stats_lock to ensure it can't race with itself. Not sure it is safe to simply take the lock in cputime_adjust(), this should be checked. OTOH, do_task_stat() already calls task_cputime_adjusted() lockless and this looks wrong or I missed something. So perhaps we need a lock in or around cputime_adjust() anyway. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Frederic Weisbecker wrote: > > 2014-08-14 15:22 GMT+02:00 Oleg Nesterov : > > On 08/13, Rik van Riel wrote: > >> > >> @@ -646,6 +646,7 @@ struct signal_struct { > >>* Live threads maintain their own counters and add to these > >>* in __exit_signal, except for the group leader. > >>*/ > >> + seqlock_t stats_lock; > > > > Ah. Somehow I thought that you were going to use seqcount_t and fallback > > to taking ->siglock if seqcount_retry, but this patch adds the "full blown" > > seqlock_t. > > > > OK, I won't argue, this can make the seqbegin_or_lock simpler... > > Is this really needed? seqlock are useful when we have concurrent > updaters. But updaters of thread stats should be under the thread lock > already, right? If we have only one updater at a time, seqcount should > be enough. Yes, this is what I meant. Although I can see 2 reasons to use seqlock_t: 1. It can simplify the seqbegin-or-lock logic. If nothing else, you simply can't use read_seqbegin_or_lock() to take ->siglock. But this is just syntactic sugar. 2. If we use ->siglock in fallback path, we need to verify that thread_group_cputime() is never called with ->siglock held first. Or, we need a fat comment to explain that need_seqrtry == T is not possible if it is called under ->siglock, and thus "fallback to lock_task_sighand" must be always safe. But in this case we need to ensure that the caller didn't do write_seqcount_begin(). So perhaps seqlock_t makes more sense at least initially... Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
2014-08-14 15:22 GMT+02:00 Oleg Nesterov : > On 08/13, Rik van Riel wrote: >> >> On Wed, 13 Aug 2014 20:45:11 +0200 >> Oleg Nesterov wrote: >> >> > That said, it is not that I am really sure that seqcount_t in ->signal >> > is actually worse, not to mention that this is subjective anyway. IOW, >> > I am not going to really fight with your approach ;) >> >> This is what it looks like, on top of your for_each_thread series >> from yesterday: > > OK, lets forget about alternative approach for now. We can reconsider > it later. At least I have to admit that seqlock is more straighforward. > >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -646,6 +646,7 @@ struct signal_struct { >>* Live threads maintain their own counters and add to these >>* in __exit_signal, except for the group leader. >>*/ >> + seqlock_t stats_lock; > > Ah. Somehow I thought that you were going to use seqcount_t and fallback > to taking ->siglock if seqcount_retry, but this patch adds the "full blown" > seqlock_t. > > OK, I won't argue, this can make the seqbegin_or_lock simpler... Is this really needed? seqlock are useful when we have concurrent updaters. But updaters of thread stats should be under the thread lock already, right? If we have only one updater at a time, seqcount should be enough. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
2014-08-14 3:57 GMT+02:00 Rik van Riel : > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: >>> --- a/kernel/time/posix-cpu-timers.c +++ >>> b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int >>> posix_cpu_clock_get_task(struct task_struct *tsk, if >>> (same_thread_group(tsk, current)) err = >>> cpu_clock_sample(which_clock, tsk, ); } else { - unsigned >>> long flags; -struct sighand_struct *sighand; - - >>> /* - * >>> while_each_thread() is not yet entirely RCU safe, - * keep >>> locking the group while sampling process -* clock for now. - >>> */ - sighand = lock_task_sighand(tsk, ); - if >>> (!sighand) >>> -return err; - if (tsk == current || >>> thread_group_leader(tsk)) err = >>> cpu_clock_sample_group(which_clock, tsk, ); - - >>> unlock_task_sighand(tsk, ); } >> >> I'm worried about such lockless solution based on RCU or read >> seqcount because we lose the guarantee that an update is >> immediately visible by all subsequent readers. >> >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right >> after that call clock_gettime(), with the spinlock we were >> guaranteed to see the new update. Now with a pure seqlock read >> approach, we guarantee a read sequence coherency but we don't >> guarantee the freshest update result. >> >> So that looks like a source of non monotonic results. > > Which update are you worried about, specifically? > > The seq_write_lock to update the usage stat in p->signal will lock out > the seqlock read side used to check those results. > > Is there another kind of thing read by cpu_clock_sample_group that you > believe is not excluded by the seq_lock? I mean the read side doesn't use a lock with seqlocks. It's only made of barriers and sequence numbers to ensure the reader doesn't read some half-complete update. But other than that it can as well see the update n - 1 since barriers don't enforce latest results. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/13, Rik van Riel wrote: > > On Wed, 13 Aug 2014 20:45:11 +0200 > Oleg Nesterov wrote: > > > That said, it is not that I am really sure that seqcount_t in ->signal > > is actually worse, not to mention that this is subjective anyway. IOW, > > I am not going to really fight with your approach ;) > > This is what it looks like, on top of your for_each_thread series > from yesterday: OK, lets forget about alternative approach for now. We can reconsider it later. At least I have to admit that seqlock is more straighforward. > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -646,6 +646,7 @@ struct signal_struct { >* Live threads maintain their own counters and add to these >* in __exit_signal, except for the group leader. >*/ > + seqlock_t stats_lock; Ah. Somehow I thought that you were going to use seqcount_t and fallback to taking ->siglock if seqcount_retry, but this patch adds the "full blown" seqlock_t. OK, I won't argue, this can make the seqbegin_or_lock simpler... > @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, > struct task_cputime *times) > struct signal_struct *sig = tsk->signal; > cputime_t utime, stime; > struct task_struct *t; > - > - times->utime = sig->utime; > - times->stime = sig->stime; > - times->sum_exec_runtime = sig->sum_sched_runtime; > + unsigned int seq, nextseq; > > rcu_read_lock(); Almost cosmetic nit, but afaics this patch expands the rcu critical section for no reason. We only need rcu_read_lock/unlock around for_each_thread() below. > + nextseq = 0; > + do { > + seq = nextseq; > + read_seqbegin_or_lock(>stats_lock, ); > + times->utime = sig->utime; > + times->stime = sig->stime; > + times->sum_exec_runtime = sig->sum_sched_runtime; > + > + for_each_thread(tsk, t) { > + task_cputime(t, , ); > + times->utime += utime; > + times->stime += stime; > + times->sum_exec_runtime += task_sched_runtime(t); > + } > + /* > + * If a writer is currently active, seq will be odd, and > + * read_seqbegin_or_lock will take the lock. > + */ > + nextseq = raw_read_seqcount(>stats_lock.seqcount); > + } while (need_seqretry(>stats_lock, seq)); > + done_seqretry(>stats_lock, seq); Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean, this code still can livelock in theory. Just suppose that anoter CPU does write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this case "seq & 1" will be never true and thus "or_lock" will never happen. IMO, this should be fixed. Either we should guarantee the forward progress or we should not play with read_seqbegin_or_lock() at all. This code assumes that sooner or later "nextseq = raw_read_seqcount()" should return the odd counter, but in theory this can never happen. And if we want to fix this we do not need 2 counters, just we need to set "seq = 1" manually after need_seqretry() == T. Say, like __dentry_path() does. (but unlike __dentry_path() we do not need to worry about rcu_read_unlock so the code will be simpler). I am wondering if it makes sense to introduce bool read_seqretry_or_lock(const seqlock_t *sl, int *seq) { if (*seq & 1) { read_sequnlock_excl(lock); return false; } if (!read_seqretry(lock, *seq)) return false; *seq = 1; return true; } Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/13, Rik van Riel wrote: On Wed, 13 Aug 2014 20:45:11 +0200 Oleg Nesterov o...@redhat.com wrote: That said, it is not that I am really sure that seqcount_t in -signal is actually worse, not to mention that this is subjective anyway. IOW, I am not going to really fight with your approach ;) This is what it looks like, on top of your for_each_thread series from yesterday: OK, lets forget about alternative approach for now. We can reconsider it later. At least I have to admit that seqlock is more straighforward. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; Ah. Somehow I thought that you were going to use seqcount_t and fallback to taking -siglock if seqcount_retry, but this patch adds the full blown seqlock_t. OK, I won't argue, this can make the seqbegin_or_lock simpler... @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk-signal; cputime_t utime, stime; struct task_struct *t; - - times-utime = sig-utime; - times-stime = sig-stime; - times-sum_exec_runtime = sig-sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); Almost cosmetic nit, but afaics this patch expands the rcu critical section for no reason. We only need rcu_read_lock/unlock around for_each_thread() below. + nextseq = 0; + do { + seq = nextseq; + read_seqbegin_or_lock(sig-stats_lock, seq); + times-utime = sig-utime; + times-stime = sig-stime; + times-sum_exec_runtime = sig-sum_sched_runtime; + + for_each_thread(tsk, t) { + task_cputime(t, utime, stime); + times-utime += utime; + times-stime += stime; + times-sum_exec_runtime += task_sched_runtime(t); + } + /* + * If a writer is currently active, seq will be odd, and + * read_seqbegin_or_lock will take the lock. + */ + nextseq = raw_read_seqcount(sig-stats_lock.seqcount); + } while (need_seqretry(sig-stats_lock, seq)); + done_seqretry(sig-stats_lock, seq); Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean, this code still can livelock in theory. Just suppose that anoter CPU does write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this case seq 1 will be never true and thus or_lock will never happen. IMO, this should be fixed. Either we should guarantee the forward progress or we should not play with read_seqbegin_or_lock() at all. This code assumes that sooner or later nextseq = raw_read_seqcount() should return the odd counter, but in theory this can never happen. And if we want to fix this we do not need 2 counters, just we need to set seq = 1 manually after need_seqretry() == T. Say, like __dentry_path() does. (but unlike __dentry_path() we do not need to worry about rcu_read_unlock so the code will be simpler). I am wondering if it makes sense to introduce bool read_seqretry_or_lock(const seqlock_t *sl, int *seq) { if (*seq 1) { read_sequnlock_excl(lock); return false; } if (!read_seqretry(lock, *seq)) return false; *seq = 1; return true; } Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
2014-08-14 3:57 GMT+02:00 Rik van Riel r...@redhat.com: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk, if (same_thread_group(tsk, current)) err = cpu_clock_sample(which_clock, tsk, rtn); } else { - unsigned long flags; -struct sighand_struct *sighand; - - /* - * while_each_thread() is not yet entirely RCU safe, - * keep locking the group while sampling process -* clock for now. - */ - sighand = lock_task_sighand(tsk, flags); - if (!sighand) -return err; - if (tsk == current || thread_group_leader(tsk)) err = cpu_clock_sample_group(which_clock, tsk, rtn); - - unlock_task_sighand(tsk, flags); } I'm worried about such lockless solution based on RCU or read seqcount because we lose the guarantee that an update is immediately visible by all subsequent readers. Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that call clock_gettime(), with the spinlock we were guaranteed to see the new update. Now with a pure seqlock read approach, we guarantee a read sequence coherency but we don't guarantee the freshest update result. So that looks like a source of non monotonic results. Which update are you worried about, specifically? The seq_write_lock to update the usage stat in p-signal will lock out the seqlock read side used to check those results. Is there another kind of thing read by cpu_clock_sample_group that you believe is not excluded by the seq_lock? I mean the read side doesn't use a lock with seqlocks. It's only made of barriers and sequence numbers to ensure the reader doesn't read some half-complete update. But other than that it can as well see the update n - 1 since barriers don't enforce latest results. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
2014-08-14 15:22 GMT+02:00 Oleg Nesterov o...@redhat.com: On 08/13, Rik van Riel wrote: On Wed, 13 Aug 2014 20:45:11 +0200 Oleg Nesterov o...@redhat.com wrote: That said, it is not that I am really sure that seqcount_t in -signal is actually worse, not to mention that this is subjective anyway. IOW, I am not going to really fight with your approach ;) This is what it looks like, on top of your for_each_thread series from yesterday: OK, lets forget about alternative approach for now. We can reconsider it later. At least I have to admit that seqlock is more straighforward. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; Ah. Somehow I thought that you were going to use seqcount_t and fallback to taking -siglock if seqcount_retry, but this patch adds the full blown seqlock_t. OK, I won't argue, this can make the seqbegin_or_lock simpler... Is this really needed? seqlock are useful when we have concurrent updaters. But updaters of thread stats should be under the thread lock already, right? If we have only one updater at a time, seqcount should be enough. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Frederic Weisbecker wrote: 2014-08-14 15:22 GMT+02:00 Oleg Nesterov o...@redhat.com: On 08/13, Rik van Riel wrote: @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; Ah. Somehow I thought that you were going to use seqcount_t and fallback to taking -siglock if seqcount_retry, but this patch adds the full blown seqlock_t. OK, I won't argue, this can make the seqbegin_or_lock simpler... Is this really needed? seqlock are useful when we have concurrent updaters. But updaters of thread stats should be under the thread lock already, right? If we have only one updater at a time, seqcount should be enough. Yes, this is what I meant. Although I can see 2 reasons to use seqlock_t: 1. It can simplify the seqbegin-or-lock logic. If nothing else, you simply can't use read_seqbegin_or_lock() to take -siglock. But this is just syntactic sugar. 2. If we use -siglock in fallback path, we need to verify that thread_group_cputime() is never called with -siglock held first. Or, we need a fat comment to explain that need_seqrtry == T is not possible if it is called under -siglock, and thus fallback to lock_task_sighand must be always safe. But in this case we need to ensure that the caller didn't do write_seqcount_begin(). So perhaps seqlock_t makes more sense at least initially... Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/13, Rik van Riel wrote: @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { cputime_t tgutime, tgstime, cutime, cstime; - spin_lock_irq(current-sighand-siglock); thread_group_cputime_adjusted(current, tgutime, tgstime); cutime = current-signal-cutime; cstime = current-signal-cstime; - spin_unlock_irq(current-sighand-siglock); Ah, wait, there is another problem afaics... thread_group_cputime_adjusted()-cputime_adjust() plays with signal-prev_cputime and thus it needs siglock or stats_lock to ensure it can't race with itself. Not sure it is safe to simply take the lock in cputime_adjust(), this should be checked. OTOH, do_task_stat() already calls task_cputime_adjusted() lockless and this looks wrong or I missed something. So perhaps we need a lock in or around cputime_adjust() anyway. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Frederic Weisbecker wrote: 2014-08-14 3:57 GMT+02:00 Rik van Riel r...@redhat.com: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: I'm worried about such lockless solution based on RCU or read seqcount because we lose the guarantee that an update is immediately visible by all subsequent readers. Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that call clock_gettime(), with the spinlock we were guaranteed to see the new update. Now with a pure seqlock read approach, we guarantee a read sequence coherency but we don't guarantee the freshest update result. So that looks like a source of non monotonic results. Which update are you worried about, specifically? The seq_write_lock to update the usage stat in p-signal will lock out the seqlock read side used to check those results. Is there another kind of thing read by cpu_clock_sample_group that you believe is not excluded by the seq_lock? I mean the read side doesn't use a lock with seqlocks. It's only made of barriers and sequence numbers to ensure the reader doesn't read some half-complete update. But other than that it can as well see the update n - 1 since barriers don't enforce latest results. Yes, sure, read_seqcount_begin/read_seqcount_retry right after write_seqcount_begin-update-write_seqcount_begin can miss update part along with -sequence modifications. But I still can't understand how this can lead to non-monotonic results, could you spell? Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/14/2014 10:24 AM, Oleg Nesterov wrote: On 08/13, Rik van Riel wrote: @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { cputime_t tgutime, tgstime, cutime, cstime; -spin_lock_irq(current-sighand-siglock); thread_group_cputime_adjusted(current, tgutime, tgstime); cutime = current-signal-cutime; cstime = current-signal-cstime; - spin_unlock_irq(current-sighand-siglock); Ah, wait, there is another problem afaics... Last night I worked on another problem with this code. After propagating the stats from a dying task to the signal struct, we need to make sure that that task's stats are not counted twice. This requires zeroing the stats under the write_seqlock, which was easy enough to add. We cannot rely on any state in the task that was set outside of the write_seqlock... thread_group_cputime_adjusted()-cputime_adjust() plays with signal-prev_cputime and thus it needs siglock or stats_lock to ensure it can't race with itself. Not sure it is safe to simply take the lock in cputime_adjust(), this should be checked. OTOH, do_task_stat() already calls task_cputime_adjusted() lockless and this looks wrong or I missed something. So perhaps we need a lock in or around cputime_adjust() anyway. I'll take a look at this. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJT7NfHAAoJEM553pKExN6DTVIH/RIFVl42fM+cBpiSavSa2s4k B0ykVu/VwFbqoYVo5I5joSl25IpU5Xma3AwMBQHoJ7aY9a8w63iGFMoycKcDWbrY nOyaOTvR92aMdn/GuGwS/XlU83PwIbLEyYWFrvn0CrnBqJw9pHz/sLYsvP/jASem LbUStuWFzqGyasb4lJVZmLQKaIVhy30CM5Y3llTFuc16zyH/YG65tUasG+aR2miA g3CiPOHP/IY0vZ+L3YYlLthLY4acVX/bwImE0vsx9fY+rG4hgj5xF9b0CnbaN41g 62oJ4jkFSH/voNFPjR7I5AnKpSeMsBqW2/l1tHlcaKcNCtkd9nri/HinxXN5bN4= =dfSt -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/14/2014 10:24 AM, Oleg Nesterov wrote: On 08/13, Rik van Riel wrote: @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { cputime_t tgutime, tgstime, cutime, cstime; - spin_lock_irq(current-sighand-siglock); thread_group_cputime_adjusted(current, tgutime, tgstime); cutime = current-signal-cutime; cstime = current-signal-cstime; - spin_unlock_irq(current-sighand-siglock); Ah, wait, there is another problem afaics... Last night I worked on another problem with this code. After propagating the stats from a dying task to the signal struct, we need to make sure that that task's stats are not counted twice. Heh indeed ;) Can't understand how I missed that. This requires zeroing the stats under the write_seqlock, which was easy enough to add. Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. Hmm, wait, it seems there is yet another problem ;) Afaics, you also need to modify __exit_signal() so that -sum_sched_runtime/etc are accounted unconditionally, even if the group leader exits. Probably this is not a big problem, and sys_times() or clock_gettime() do not care at all because they use current. But without this change thread_group_cputime(reaped_zombie) won't look at this task_struct at all, this can lead to non-monotonic result if it was previously called when this task was alive (non-reaped). Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14/2014 12:12 PM, Oleg Nesterov wrote: On 08/14, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/14/2014 10:24 AM, Oleg Nesterov wrote: On 08/13, Rik van Riel wrote: @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { cputime_t tgutime, tgstime, cutime, cstime; - spin_lock_irq(current-sighand-siglock); thread_group_cputime_adjusted(current, tgutime, tgstime); cutime = current-signal-cutime; cstime = current-signal-cstime; - spin_unlock_irq(current-sighand-siglock); Ah, wait, there is another problem afaics... Last night I worked on another problem with this code. After propagating the stats from a dying task to the signal struct, we need to make sure that that task's stats are not counted twice. Heh indeed ;) Can't understand how I missed that. This requires zeroing the stats under the write_seqlock, which was easy enough to add. Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. The problem with that is that wait_task_zombie() calls thread_group_cputime_adjusted() in that if() branch, and that code ends up taking the seqlock for read... However, in __exit_signal that approach should work. Hmm, wait, it seems there is yet another problem ;) Afaics, you also need to modify __exit_signal() so that -sum_sched_runtime/etc are accounted unconditionally, even if the group leader exits. Probably this is not a big problem, and sys_times() or clock_gettime() do not care at all because they use current. But without this change thread_group_cputime(reaped_zombie) won't look at this task_struct at all, this can lead to non-monotonic result if it was previously called when this task was alive (non-reaped). You mean this whole block needs to run regardless of whether the group is dead? task_cputime(tsk, utime, stime); write_seqlock(sig-stats_lock); sig-utime += utime; sig-stime += stime; sig-gtime += task_gtime(tsk); sig-min_flt += tsk-min_flt; sig-maj_flt += tsk-maj_flt; sig-nvcsw += tsk-nvcsw; sig-nivcsw += tsk-nivcsw; sig-inblock += task_io_get_inblock(tsk); sig-oublock += task_io_get_oublock(tsk); task_io_accounting_add(sig-ioac, tsk-ioac); sig-sum_sched_runtime += tsk-se.sum_exec_runtime; How does that square with wait_task_zombie reaping the statistics of the whole group with thread_group_cputime_adjusted() when the group leader is exiting? Could that lead to things being double-counted? Or do you mean ONLY -sum_sched_runtime is unconditionally accounted in __exit_signal(), because wait_task_zombie() seems to be missing that one? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Oleg Nesterov wrote: OK, lets forget about alternative approach for now. We can reconsider it later. At least I have to admit that seqlock is more straighforward. Yes. But just for record, the lockless version doesn't look that bad to me, void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) { struct signal_struct *sig = tsk-signal; bool lockless, is_dead; struct task_struct *t; unsigned long flags; u64 exec; lockless = true; is_dead = !lock_task_sighand(p, flags); retry: times-utime = sig-utime; times-stime = sig-stime; times-sum_exec_runtime = exec = sig-sum_sched_runtime; if (is_dead) return; if (lockless) unlock_task_sighand(p, flags); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, utime, stime); times-utime += utime; times-stime += stime; times-sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; is_dead = !lock_task_sighand(p, flags); if (is_dead || exec != sig-sum_sched_runtime) goto retry; } unlock_task_sighand(p, flags); } The obvious problem is that we should shift lock_task_sighand() from the callers to thread_group_cputime() first, or add thread_group_cputime_lockless() and change the current users one by one. And of course, stats_lock is more generic. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: On 08/14/2014 12:12 PM, Oleg Nesterov wrote: Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. The problem with that is that wait_task_zombie() calls thread_group_cputime_adjusted() in that if() branch, and that code ends up taking the seqlock for read... Not sure I understand... This modifies parent-signal-c* counters, and obviously the exiting thread is not the member of parent's thread group, so thread_group_cputime_adjusted(parent) can never account the exiting child twice simply because it won't see it? However, in __exit_signal that approach should work. Yes, Hmm, wait, it seems there is yet another problem ;) Afaics, you also need to modify __exit_signal() so that -sum_sched_runtime/etc are accounted unconditionally, even if the group leader exits. Probably this is not a big problem, and sys_times() or clock_gettime() do not care at all because they use current. But without this change thread_group_cputime(reaped_zombie) won't look at this task_struct at all, this can lead to non-monotonic result if it was previously called when this task was alive (non-reaped). You mean this whole block needs to run regardless of whether the group is dead? task_cputime(tsk, utime, stime); write_seqlock(sig-stats_lock); sig-utime += utime; sig-stime += stime; sig-gtime += task_gtime(tsk); sig-min_flt += tsk-min_flt; sig-maj_flt += tsk-maj_flt; sig-nvcsw += tsk-nvcsw; sig-nivcsw += tsk-nivcsw; sig-inblock += task_io_get_inblock(tsk); sig-oublock += task_io_get_oublock(tsk); task_io_accounting_add(sig-ioac, tsk-ioac); sig-sum_sched_runtime += tsk-se.sum_exec_runtime; Yes. How does that square with wait_task_zombie reaping the statistics of the whole group with thread_group_cputime_adjusted() when the group leader is exiting? Again, not sure I understand... thread_group_cputime_adjusted() in wait_task_zombie() is fine in any case. Nobody but us can reap this zombie. It seems that we misunderstood each other, let me try again. Just to simplify, suppose we have, say, sys_times_by_pid(pid, ...) { rcu_read_lock(); task = find_task_by_vpid(pid); if (task) get_task_struct(task); rcu_read_unlock(); if (!task) return -ESRCH; thread_group_cputime(task, ...); copy_to_user(); return 0; } Note that this task can exit right after rcu_read_unlock(), and it can be also reaped (by its parent or by itself) and removed from the thread list. In this case for_each_thread() will see no threads, and thus it will only read task-signal-*time. This means that sys_times_by_pid() can simply return the wrong result instead of failure. Say, It can even return all zeros if this task was single-threaded. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Oleg Nesterov wrote: But just for record, the lockless version doesn't look that bad to me, void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) { struct signal_struct *sig = tsk-signal; bool lockless, is_dead; struct task_struct *t; unsigned long flags; u64 exec; lockless = true; is_dead = !lock_task_sighand(p, flags); retry: times-utime = sig-utime; times-stime = sig-stime; times-sum_exec_runtime = exec = sig-sum_sched_runtime; if (is_dead) return; if (lockless) unlock_task_sighand(p, flags); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, utime, stime); times-utime += utime; times-stime += stime; times-sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; is_dead = !lock_task_sighand(p, flags); if (is_dead || exec != sig-sum_sched_runtime) goto retry; } unlock_task_sighand(p, flags); } The obvious problem is that we should shift lock_task_sighand() from the callers to thread_group_cputime() first, or add thread_group_cputime_lockless() and change the current users one by one. OTOH, it is simple to convert do_sys_times() and posix_cpu_clock_get_task() to use the lockless version, and avoid the new stats_lock and other changes it needs. And of course, stats_lock is more generic. Yes, this is true in any case. So I simply do not know. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14/2014 02:15 PM, Oleg Nesterov wrote: On 08/14, Rik van Riel wrote: On 08/14/2014 12:12 PM, Oleg Nesterov wrote: Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. The problem with that is that wait_task_zombie() calls thread_group_cputime_adjusted() in that if() branch, and that code ends up taking the seqlock for read... Not sure I understand... This modifies parent-signal-c* counters, and obviously the exiting thread is not the member of parent's thread group, so thread_group_cputime_adjusted(parent) can never account the exiting child twice simply because it won't see it? You are right, the tree of processes only goes one way, so there should be no deadlock in taking psig-stats_lock and having thread_group_cputime_adjusted take sig-stats_lock for read within that section. However, it might need some lockdep annotation to keep lockdep from thinking we might the same lock recursively :) However, in __exit_signal that approach should work. Yes, Hmm, wait, it seems there is yet another problem ;) Afaics, you also need to modify __exit_signal() so that -sum_sched_runtime/etc are accounted unconditionally, even if the group leader exits. Probably this is not a big problem, and sys_times() or clock_gettime() do not care at all because they use current. But without this change thread_group_cputime(reaped_zombie) won't look at this task_struct at all, this can lead to non-monotonic result if it was previously called when this task was alive (non-reaped). You mean this whole block needs to run regardless of whether the group is dead? task_cputime(tsk, utime, stime); write_seqlock(sig-stats_lock); sig-utime += utime; sig-stime += stime; sig-gtime += task_gtime(tsk); sig-min_flt += tsk-min_flt; sig-maj_flt += tsk-maj_flt; sig-nvcsw += tsk-nvcsw; sig-nivcsw += tsk-nivcsw; sig-inblock += task_io_get_inblock(tsk); sig-oublock += task_io_get_oublock(tsk); task_io_accounting_add(sig-ioac, tsk-ioac); sig-sum_sched_runtime += tsk-se.sum_exec_runtime; Yes. Let me give that a try and see what happens :) How does that square with wait_task_zombie reaping the statistics of the whole group with thread_group_cputime_adjusted() when the group leader is exiting? Again, not sure I understand... thread_group_cputime_adjusted() in wait_task_zombie() is fine in any case. Nobody but us can reap this zombie. It seems that we misunderstood each other, let me try again. Just to simplify, suppose we have, say, sys_times_by_pid(pid, ...) { rcu_read_lock(); task = find_task_by_vpid(pid); if (task) get_task_struct(task); rcu_read_unlock(); if (!task) return -ESRCH; thread_group_cputime(task, ...); copy_to_user(); return 0; } Note that this task can exit right after rcu_read_unlock(), and it can be also reaped (by its parent or by itself) and removed from the thread list. In this case for_each_thread() will see no threads, and thus it will only read task-signal-*time. This means that sys_times_by_pid() can simply return the wrong result instead of failure. Say, It can even return all zeros if this task was single-threaded. Ahh, that makes sense. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On 08/14, Rik van Riel wrote: On 08/14/2014 02:15 PM, Oleg Nesterov wrote: On 08/14, Rik van Riel wrote: On 08/14/2014 12:12 PM, Oleg Nesterov wrote: Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. The problem with that is that wait_task_zombie() calls thread_group_cputime_adjusted() in that if() branch, and that code ends up taking the seqlock for read... Not sure I understand... This modifies parent-signal-c* counters, and obviously the exiting thread is not the member of parent's thread group, so thread_group_cputime_adjusted(parent) can never account the exiting child twice simply because it won't see it? You are right, the tree of processes only goes one way, so there should be no deadlock in taking psig-stats_lock and having thread_group_cputime_adjusted take sig-stats_lock for read within that section. However, it might need some lockdep annotation to keep lockdep from thinking we might the same lock recursively :) But wait_task_zombie() can (and should) call thread_group_cputime_adjusted(zombie_child) outside of parent's -siglock or -stats_lock so this this should be safe. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Thu, 14 Aug 2014 18:12:47 +0200 Oleg Nesterov o...@redhat.com wrote: Or you can expand the scope of write_seqlock/write_sequnlock, so that __unhash_process in called from inside the critical section. This looks simpler at first glance. Hmm, wait, it seems there is yet another problem ;) Afaics, you also need to modify __exit_signal() so that -sum_sched_runtime/etc are accounted unconditionally, even if the group leader exits. OK, this is what I have now. I am still getting backwards time sometimes, but only tiny increments. This suggests that cputime_adjust() may be the culprit, and I have no good idea on how to fix that yet... Should task_cputime_adjusted and thread_group_cputime_adjusted pass in the address of a seqlock to use in case the values in prev need to be updated? Should we check whether the values in prev changed during the time spent in the function? Is this a race between task_cputime_adjusted and other writers of signal-utime and signal-stime, instead of task_cputime_adjusted racing with itself? I am not sure what the best approach here is... ---8--- Subject: time,signal: protect resource use statistics with seqlock Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times and clock_gettime being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime, slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task things can be simplified a lot, because the calling function already ensures tsk sticks around, and the rest is now taken care of in thread_group_cputime. This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/sched.h | 1 + kernel/exit.c | 48 +++--- kernel/fork.c | 1 + kernel/sched/cputime.c | 36 +++ kernel/sys.c | 2 -- kernel/time/posix-cpu-timers.c | 14 6 files changed, 51 insertions(+), 51 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ba40..91f9209 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; diff --git a/kernel/exit.c b/kernel/exit.c index 32c58f7..c1a0ef2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -115,32 +115,34 @@ static void __exit_signal(struct task_struct *tsk) if (tsk == sig-curr_target) sig-curr_target = next_thread(tsk); - /* -* Accumulate here the counters for all threads but the -* group leader as they die, so they can be added into -* the process-wide totals when those are taken. -* The group leader stays around as a zombie as long -* as there are other threads. When it gets reaped, -* the exit.c code will add its counts into these totals. -* We won't ever get here for the group leader, since it -* will have been the last reference on the signal_struct. -*/ - task_cputime(tsk, utime, stime); - sig-utime += utime; - sig-stime += stime; - sig-gtime += task_gtime(tsk); - sig-min_flt += tsk-min_flt; - sig-maj_flt += tsk-maj_flt; - sig-nvcsw += tsk-nvcsw; - sig-nivcsw += tsk-nivcsw; - sig-inblock += task_io_get_inblock(tsk); - sig-oublock += task_io_get_oublock(tsk); - task_io_accounting_add(sig-ioac, tsk-ioac); - sig-sum_sched_runtime += tsk-se.sum_exec_runtime; } + /* +* Accumulate here the counters for all threads but the +* group leader as they die, so they can be added into +* the process-wide totals when those are taken. +* The group leader stays around as a zombie as long +* as there are other threads. When it gets reaped, +* the exit.c code will add its counts into these totals. +* We won't ever get here for the group leader, since it +* will have been the last reference on the
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
2014-08-14 16:39 GMT+02:00 Oleg Nesterov o...@redhat.com: On 08/14, Frederic Weisbecker wrote: 2014-08-14 3:57 GMT+02:00 Rik van Riel r...@redhat.com: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: I'm worried about such lockless solution based on RCU or read seqcount because we lose the guarantee that an update is immediately visible by all subsequent readers. Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that call clock_gettime(), with the spinlock we were guaranteed to see the new update. Now with a pure seqlock read approach, we guarantee a read sequence coherency but we don't guarantee the freshest update result. So that looks like a source of non monotonic results. Which update are you worried about, specifically? The seq_write_lock to update the usage stat in p-signal will lock out the seqlock read side used to check those results. Is there another kind of thing read by cpu_clock_sample_group that you believe is not excluded by the seq_lock? I mean the read side doesn't use a lock with seqlocks. It's only made of barriers and sequence numbers to ensure the reader doesn't read some half-complete update. But other than that it can as well see the update n - 1 since barriers don't enforce latest results. Yes, sure, read_seqcount_begin/read_seqcount_retry right after write_seqcount_begin-update-write_seqcount_begin can miss update part along with -sequence modifications. But I still can't understand how this can lead to non-monotonic results, could you spell? Well lets say clock = T. CPU 0 updates at T + 1. Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1 while CPU 1 still reads T. If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's possible that CPU 2 still sees T. With the spinlocked version that thing can't happen, the second round would read at least T + 1 for both CPUs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Thu, 2014-08-14 at 19:48 +0200, Oleg Nesterov wrote: On 08/14, Oleg Nesterov wrote: OK, lets forget about alternative approach for now. We can reconsider it later. At least I have to admit that seqlock is more straighforward. Yes. But just for record, the lockless version doesn't look that bad to me, void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) { struct signal_struct *sig = tsk-signal; bool lockless, is_dead; struct task_struct *t; unsigned long flags; u64 exec; lockless = true; is_dead = !lock_task_sighand(p, flags); retry: times-utime = sig-utime; times-stime = sig-stime; times-sum_exec_runtime = exec = sig-sum_sched_runtime; if (is_dead) return; if (lockless) unlock_task_sighand(p, flags); rcu_read_lock(); for_each_thread(tsk, t) { cputime_t utime, stime; task_cputime(t, utime, stime); times-utime += utime; times-stime += stime; times-sum_exec_runtime += task_sched_runtime(t); } rcu_read_unlock(); if (lockless) { lockless = false; is_dead = !lock_task_sighand(p, flags); if (is_dead || exec != sig-sum_sched_runtime) goto retry; } unlock_task_sighand(p, flags); } The obvious problem is that we should shift lock_task_sighand() from the callers to thread_group_cputime() first, or add thread_group_cputime_lockless() and change the current users one by one. And of course, stats_lock is more generic. Yours looks nice to me, particularly in that it doesn't munge structure layout, could perhaps be backported to fix up production kernels. For the N threads doing this on N cores case, seems rq-lock hammering will still be a source of major box wide pain. Is there any correctness reason to add up unaccounted -on_cpu beans, or is that just value added? Seems to me it can't matter, as you traverse, what you added up on previous threads becomes ever more stale as you proceed, so big boxen would be better off not doing that. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: > On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: >> --- a/kernel/time/posix-cpu-timers.c +++ >> b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int >> posix_cpu_clock_get_task(struct task_struct *tsk, if >> (same_thread_group(tsk, current)) err = >> cpu_clock_sample(which_clock, tsk, ); } else { - unsigned >> long flags; -struct sighand_struct *sighand; - - >> /* - * >> while_each_thread() is not yet entirely RCU safe, - * keep >> locking the group while sampling process -* clock for now. - >> */ - sighand = lock_task_sighand(tsk, ); - if >> (!sighand) >> -return err; - if (tsk == current || >> thread_group_leader(tsk)) err = >> cpu_clock_sample_group(which_clock, tsk, ); - - >> unlock_task_sighand(tsk, ); } > > I'm worried about such lockless solution based on RCU or read > seqcount because we lose the guarantee that an update is > immediately visible by all subsequent readers. > > Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right > after that call clock_gettime(), with the spinlock we were > guaranteed to see the new update. Now with a pure seqlock read > approach, we guarantee a read sequence coherency but we don't > guarantee the freshest update result. > > So that looks like a source of non monotonic results. Which update are you worried about, specifically? The seq_write_lock to update the usage stat in p->signal will lock out the seqlock read side used to check those results. Is there another kind of thing read by cpu_clock_sample_group that you believe is not excluded by the seq_lock? - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJT7BdtAAoJEM553pKExN6DngEH/1CJuBb6xij08AoZNQuW4WNQ geKakADsYTz8FmutbGi+lJEHNKAMZQ5wYbyFNczPAX/fVJsOlC92Qtfwy5z/VupN QzlRHh79ZJR5/6xGddlu/8LjGrMIXwKqShIeKtTzoENS+rxA82l42zoXTagal4yX Peb5/Q7cBMg9pFZzUMITEJssQhDtyTN1rXiU5IsEG54PhDbSgFk7HK1cO46tRefb x1WbUKZKUViVKnoKYhJqd6FDSWuPtL5EpebvMVj9TZ29JBQTMDGx8saUezjuY0YL STAv/wqigmbbcNnloJpr3gO1iJMkknv3jHk6Bfv1Dil8Um1D3mBAAKFK+Ov8Rk0= =kU1O -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: > --- a/kernel/time/posix-cpu-timers.c > +++ b/kernel/time/posix-cpu-timers.c > @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct > *tsk, > if (same_thread_group(tsk, current)) > err = cpu_clock_sample(which_clock, tsk, ); > } else { > - unsigned long flags; > - struct sighand_struct *sighand; > - > - /* > - * while_each_thread() is not yet entirely RCU safe, > - * keep locking the group while sampling process > - * clock for now. > - */ > - sighand = lock_task_sighand(tsk, ); > - if (!sighand) > - return err; > - > if (tsk == current || thread_group_leader(tsk)) > err = cpu_clock_sample_group(which_clock, tsk, ); > - > - unlock_task_sighand(tsk, ); > } I'm worried about such lockless solution based on RCU or read seqcount because we lose the guarantee that an update is immediately visible by all subsequent readers. Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that call clock_gettime(), with the spinlock we were guaranteed to see the new update. Now with a pure seqlock read approach, we guarantee a read sequence coherency but we don't guarantee the freshest update result. So that looks like a source of non monotonic results. > > if (!err) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] time,signal: protect resource use statistics with seqlock
On Wed, 13 Aug 2014 20:45:11 +0200 Oleg Nesterov wrote: > That said, it is not that I am really sure that seqcount_t in ->signal > is actually worse, not to mention that this is subjective anyway. IOW, > I am not going to really fight with your approach ;) This is what it looks like, on top of your for_each_thread series from yesterday: ---8<--- Subject: time,signal: protect resource use statistics with seqlock Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times and clock_gettime being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime, slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task things can be simplified a lot, because the calling function already ensures tsk sticks around, and the rest is now taken care of in thread_group_cputime. This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel --- include/linux/sched.h | 1 + kernel/exit.c | 4 kernel/fork.c | 1 + kernel/sched/cputime.c | 36 +++- kernel/sys.c | 2 -- kernel/time/posix-cpu-timers.c | 14 -- 6 files changed, 29 insertions(+), 29 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ba40..91f9209 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; diff --git a/kernel/exit.c b/kernel/exit.c index 32c58f7..8092e59 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk) * will have been the last reference on the signal_struct. */ task_cputime(tsk, , ); + write_seqlock(>stats_lock); sig->utime += utime; sig->stime += stime; sig->gtime += task_gtime(tsk); @@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk) sig->oublock += task_io_get_oublock(tsk); task_io_accounting_add(>ioac, >ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; + write_sequnlock(>stats_lock); } sig->nr_threads--; @@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) spin_lock_irq(>real_parent->sighand->siglock); psig = p->real_parent->signal; sig = p->signal; + write_seqlock(>stats_lock); psig->cutime += tgutime + sig->cutime; psig->cstime += tgstime + sig->cstime; psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime; @@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) psig->cmaxrss = maxrss; task_io_accounting_add(>ioac, >ioac); task_io_accounting_add(>ioac, >ioac); + write_sequnlock(>stats_lock); spin_unlock_irq(>real_parent->sighand->siglock); } diff --git a/kernel/fork.c b/kernel/fork.c index 1380d8a..5d7cf2b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->curr_target = tsk; init_sigpending(>shared_pending); INIT_LIST_HEAD(>posix_timers); + seqlock_init(>stats_lock); hrtimer_init(>real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); sig->real_timer.function = it_real_fn; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 3e52836..b5f1c58 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk->signal; cputime_t utime, stime; struct task_struct *t; - - times->utime = sig->utime; - times->stime = sig->stime; - times->sum_exec_runtime = sig->sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); - for_each_thread(tsk, t) { - task_cputime(t, , ); - times->utime
[PATCH RFC] time,signal: protect resource use statistics with seqlock
On Wed, 13 Aug 2014 20:45:11 +0200 Oleg Nesterov wrote: > That said, it is not that I am really sure that seqcount_t in ->signal > is actually worse, not to mention that this is subjective anyway. IOW, > I am not going to really fight with your approach ;) This is what it looks like, on top of your for_each_thread series from yesterday: ---8<--- Subject: time,signal: protect resource use statistics with seqlock Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times and clock_gettime being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime, slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task things can be simplified a lot, because the calling function already ensures tsk sticks around, and the rest is now taken care of in thread_group_cputime. This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel --- include/linux/sched.h | 1 + kernel/exit.c | 4 kernel/fork.c | 1 + kernel/sched/cputime.c | 36 +++- kernel/sys.c | 2 -- kernel/time/posix-cpu-timers.c | 14 -- 6 files changed, 29 insertions(+), 29 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ba40..91f9209 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; diff --git a/kernel/exit.c b/kernel/exit.c index 32c58f7..8092e59 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk) * will have been the last reference on the signal_struct. */ task_cputime(tsk, , ); + write_seqlock(>stats_lock); sig->utime += utime; sig->stime += stime; sig->gtime += task_gtime(tsk); @@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk) sig->oublock += task_io_get_oublock(tsk); task_io_accounting_add(>ioac, >ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; + write_sequnlock(>stats_lock); } sig->nr_threads--; @@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) spin_lock_irq(>real_parent->sighand->siglock); psig = p->real_parent->signal; sig = p->signal; + write_seqlock(>stats_lock); psig->cutime += tgutime + sig->cutime; psig->cstime += tgstime + sig->cstime; psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime; @@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) psig->cmaxrss = maxrss; task_io_accounting_add(>ioac, >ioac); task_io_accounting_add(>ioac, >ioac); + write_sequnlock(>stats_lock); spin_unlock_irq(>real_parent->sighand->siglock); } diff --git a/kernel/fork.c b/kernel/fork.c index 1380d8a..5d7cf2b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->curr_target = tsk; init_sigpending(>shared_pending); INIT_LIST_HEAD(>posix_timers); + seqlock_init(>stats_lock); hrtimer_init(>real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); sig->real_timer.function = it_real_fn; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 3e52836..b5f1c58 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk->signal; cputime_t utime, stime; struct task_struct *t; - - times->utime = sig->utime; - times->stime = sig->stime; - times->sum_exec_runtime = sig->sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); - for_each_thread(tsk, t) { - task_cputime(t, , ); - times->utime
[PATCH RFC] time,signal: protect resource use statistics with seqlock
On Wed, 13 Aug 2014 20:45:11 +0200 Oleg Nesterov o...@redhat.com wrote: That said, it is not that I am really sure that seqcount_t in -signal is actually worse, not to mention that this is subjective anyway. IOW, I am not going to really fight with your approach ;) This is what it looks like, on top of your for_each_thread series from yesterday: ---8--- Subject: time,signal: protect resource use statistics with seqlock Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times and clock_gettime being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime, slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task things can be simplified a lot, because the calling function already ensures tsk sticks around, and the rest is now taken care of in thread_group_cputime. This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/sched.h | 1 + kernel/exit.c | 4 kernel/fork.c | 1 + kernel/sched/cputime.c | 36 +++- kernel/sys.c | 2 -- kernel/time/posix-cpu-timers.c | 14 -- 6 files changed, 29 insertions(+), 29 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ba40..91f9209 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; diff --git a/kernel/exit.c b/kernel/exit.c index 32c58f7..8092e59 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk) * will have been the last reference on the signal_struct. */ task_cputime(tsk, utime, stime); + write_seqlock(sig-stats_lock); sig-utime += utime; sig-stime += stime; sig-gtime += task_gtime(tsk); @@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk) sig-oublock += task_io_get_oublock(tsk); task_io_accounting_add(sig-ioac, tsk-ioac); sig-sum_sched_runtime += tsk-se.sum_exec_runtime; + write_sequnlock(sig-stats_lock); } sig-nr_threads--; @@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) spin_lock_irq(p-real_parent-sighand-siglock); psig = p-real_parent-signal; sig = p-signal; + write_seqlock(psig-stats_lock); psig-cutime += tgutime + sig-cutime; psig-cstime += tgstime + sig-cstime; psig-cgtime += task_gtime(p) + sig-gtime + sig-cgtime; @@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) psig-cmaxrss = maxrss; task_io_accounting_add(psig-ioac, p-ioac); task_io_accounting_add(psig-ioac, sig-ioac); + write_sequnlock(psig-stats_lock); spin_unlock_irq(p-real_parent-sighand-siglock); } diff --git a/kernel/fork.c b/kernel/fork.c index 1380d8a..5d7cf2b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig-curr_target = tsk; init_sigpending(sig-shared_pending); INIT_LIST_HEAD(sig-posix_timers); + seqlock_init(sig-stats_lock); hrtimer_init(sig-real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); sig-real_timer.function = it_real_fn; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 3e52836..b5f1c58 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk-signal; cputime_t utime, stime; struct task_struct *t; - - times-utime = sig-utime; - times-stime = sig-stime; - times-sum_exec_runtime = sig-sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); - for_each_thread(tsk, t) { -
[PATCH RFC] time,signal: protect resource use statistics with seqlock
On Wed, 13 Aug 2014 20:45:11 +0200 Oleg Nesterov o...@redhat.com wrote: That said, it is not that I am really sure that seqcount_t in -signal is actually worse, not to mention that this is subjective anyway. IOW, I am not going to really fight with your approach ;) This is what it looks like, on top of your for_each_thread series from yesterday: ---8--- Subject: time,signal: protect resource use statistics with seqlock Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times and clock_gettime being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime, slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task things can be simplified a lot, because the calling function already ensures tsk sticks around, and the rest is now taken care of in thread_group_cputime. This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/sched.h | 1 + kernel/exit.c | 4 kernel/fork.c | 1 + kernel/sched/cputime.c | 36 +++- kernel/sys.c | 2 -- kernel/time/posix-cpu-timers.c | 14 -- 6 files changed, 29 insertions(+), 29 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ba40..91f9209 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -646,6 +646,7 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ + seqlock_t stats_lock; cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; diff --git a/kernel/exit.c b/kernel/exit.c index 32c58f7..8092e59 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk) * will have been the last reference on the signal_struct. */ task_cputime(tsk, utime, stime); + write_seqlock(sig-stats_lock); sig-utime += utime; sig-stime += stime; sig-gtime += task_gtime(tsk); @@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk) sig-oublock += task_io_get_oublock(tsk); task_io_accounting_add(sig-ioac, tsk-ioac); sig-sum_sched_runtime += tsk-se.sum_exec_runtime; + write_sequnlock(sig-stats_lock); } sig-nr_threads--; @@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) spin_lock_irq(p-real_parent-sighand-siglock); psig = p-real_parent-signal; sig = p-signal; + write_seqlock(psig-stats_lock); psig-cutime += tgutime + sig-cutime; psig-cstime += tgstime + sig-cstime; psig-cgtime += task_gtime(p) + sig-gtime + sig-cgtime; @@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) psig-cmaxrss = maxrss; task_io_accounting_add(psig-ioac, p-ioac); task_io_accounting_add(psig-ioac, sig-ioac); + write_sequnlock(psig-stats_lock); spin_unlock_irq(p-real_parent-sighand-siglock); } diff --git a/kernel/fork.c b/kernel/fork.c index 1380d8a..5d7cf2b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig-curr_target = tsk; init_sigpending(sig-shared_pending); INIT_LIST_HEAD(sig-posix_timers); + seqlock_init(sig-stats_lock); hrtimer_init(sig-real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); sig-real_timer.function = it_real_fn; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 3e52836..b5f1c58 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) struct signal_struct *sig = tsk-signal; cputime_t utime, stime; struct task_struct *t; - - times-utime = sig-utime; - times-stime = sig-stime; - times-sum_exec_runtime = sig-sum_sched_runtime; + unsigned int seq, nextseq; rcu_read_lock(); - for_each_thread(tsk, t) { -
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk, if (same_thread_group(tsk, current)) err = cpu_clock_sample(which_clock, tsk, rtn); } else { - unsigned long flags; - struct sighand_struct *sighand; - - /* - * while_each_thread() is not yet entirely RCU safe, - * keep locking the group while sampling process - * clock for now. - */ - sighand = lock_task_sighand(tsk, flags); - if (!sighand) - return err; - if (tsk == current || thread_group_leader(tsk)) err = cpu_clock_sample_group(which_clock, tsk, rtn); - - unlock_task_sighand(tsk, flags); } I'm worried about such lockless solution based on RCU or read seqcount because we lose the guarantee that an update is immediately visible by all subsequent readers. Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that call clock_gettime(), with the spinlock we were guaranteed to see the new update. Now with a pure seqlock read approach, we guarantee a read sequence coherency but we don't guarantee the freshest update result. So that looks like a source of non monotonic results. if (!err) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/13/2014 08:43 PM, Frederic Weisbecker wrote: On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote: --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk, if (same_thread_group(tsk, current)) err = cpu_clock_sample(which_clock, tsk, rtn); } else { - unsigned long flags; -struct sighand_struct *sighand; - - /* - * while_each_thread() is not yet entirely RCU safe, - * keep locking the group while sampling process -* clock for now. - */ - sighand = lock_task_sighand(tsk, flags); - if (!sighand) -return err; - if (tsk == current || thread_group_leader(tsk)) err = cpu_clock_sample_group(which_clock, tsk, rtn); - - unlock_task_sighand(tsk, flags); } I'm worried about such lockless solution based on RCU or read seqcount because we lose the guarantee that an update is immediately visible by all subsequent readers. Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that call clock_gettime(), with the spinlock we were guaranteed to see the new update. Now with a pure seqlock read approach, we guarantee a read sequence coherency but we don't guarantee the freshest update result. So that looks like a source of non monotonic results. Which update are you worried about, specifically? The seq_write_lock to update the usage stat in p-signal will lock out the seqlock read side used to check those results. Is there another kind of thing read by cpu_clock_sample_group that you believe is not excluded by the seq_lock? - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJT7BdtAAoJEM553pKExN6DngEH/1CJuBb6xij08AoZNQuW4WNQ geKakADsYTz8FmutbGi+lJEHNKAMZQ5wYbyFNczPAX/fVJsOlC92Qtfwy5z/VupN QzlRHh79ZJR5/6xGddlu/8LjGrMIXwKqShIeKtTzoENS+rxA82l42zoXTagal4yX Peb5/Q7cBMg9pFZzUMITEJssQhDtyTN1rXiU5IsEG54PhDbSgFk7HK1cO46tRefb x1WbUKZKUViVKnoKYhJqd6FDSWuPtL5EpebvMVj9TZ29JBQTMDGx8saUezjuY0YL STAv/wqigmbbcNnloJpr3gO1iJMkknv3jHk6Bfv1Dil8Um1D3mBAAKFK+Ov8Rk0= =kU1O -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/