Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Frederic Weisbecker
On Fri, Aug 15, 2014 at 04:26:01PM +0200, Oleg Nesterov wrote:
> On 08/15, Frederic Weisbecker wrote:
> >
> > 2014-08-14 16:39 GMT+02:00 Oleg Nesterov :
> > > On 08/14, Frederic Weisbecker wrote:
> > >>
> > >> I mean the read side doesn't use a lock with seqlocks. It's only made
> > >> of barriers and sequence numbers to ensure the reader doesn't read
> > >> some half-complete update. But other than that it can as well see the
> > >> update n - 1 since barriers don't enforce latest results.
> > >
> > > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
> > > write_seqcount_begin-update-write_seqcount_begin can miss "update" part
> > > along with ->sequence modifications.
> > >
> > > But I still can't understand how this can lead to non-monotonic results,
> > > could you spell?
> >
> > Well lets say clock = T.
> > CPU 0 updates at T + 1.
> > Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
> > while CPU 1 still reads T.
> > If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
> > possible that CPU 2 still sees T. With the spinlocked version that
> > thing can't happen, the second round would read at least T + 1 for
> > both CPUs.
> 
> But this is fine? And CPU 2 doesn't see a non-monotonic result?
> 
> OK, this could be wrong if, say,
> 
>   void print_clock(void)
>   {
>   lock(SOME_LOCK);
>   printk(..., clock_gettime());
>   unlock(SOME_LOCK);
>   }
>   
> printed the non-monotonic numbers if print_clock() is called on CPU_1 and
> then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if
> they were already visible to CPU_1 under the same lock. IOW,
> 
>   int T = 0;  /* can be incremented at any time */
> 
>   void check_monotony(void)
>   {
>   static int t = 0;
> 
>   lock(SOME_LOCK);
>   BUG(t > T);
>   T = t;
>   unlock(SOME_LOCK);
>   }
> 
> must work corrrectly (ignoring overflow) even if T is changed without
> SOME_LOCK.
> 
> Otherwise, without some sort of synchronization the different results on
> CPU_1/2 should be fine.
> 
> Or I am still missing your point?

No I think you're right, as long as ordering against something else is involved,
monotonicity is enforced.

Now I'm trying to think about a case where SMP ordering isn't involved.
Perhaps some usecase based on coupling CPU local clocks and clock_gettime()
where a drift between both can appear. Now using a local clock probably only
makes sense in the context of local usecases where the thread clock update
would be local as well. So that's probably not a problem. Now what if somebody
couples multithread process wide clocks with per CPU local clocks. Well that's
probably too foolish to be considered.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Rik van Riel wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 08/15/2014 12:49 PM, Oleg Nesterov wrote:
>
> > Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable.
> > Just I do not know it's worth the trouble.
>
> If we don't know whether it is worth the trouble, it is probably best
> to stick to a well-known generic locking algorithm, instead of brewing
> our own and trying to maintain it.

Perhaps. I am obviously biased and can't judge ;) Plus, again, I do
understand that your approach has some advantages too.

> Now to see if this change to cputime_adjust does the trick :)
>
> +++ b/kernel/sched/cputime.c
> @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
>* If the tick based count grows faster than the scheduler one,
>* the result of the scaling may go backward.
>* Let's enforce monotonicity.
> +  * Atomic exchange protects against concurrent cputime_adjust.
>*/
> - -   prev->stime = max(prev->stime, stime);
> - -   prev->utime = max(prev->utime, utime);
> + while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> + cmpxchg(>stime, rtime, stime);
> + while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> + cmpxchg(>utime, rtime, utime);

Yes, perhaps we need something like this in any case. To remind, at least
do_task_stat() calls task_cputime_adjusted() lockless, although we could
fix this separately.

But I do not think the change above is enough. With this change cputime_adjust()
can race with itself. Yes, this guarantees monotonicity even if it is called
lockless, but this can lead to "obviously inconsistent" numbers.

And I don't think we can ignore this. If we could, then we can remove the
scale_stime recalculation and change cputime_adjust() to simply do:

static void cputime_adjust(struct task_cputime *curr,
   struct cputime *prev,
   cputime_t *ut, cputime_t *st)
{
/* enforce monotonicity */
*ut = prev->stime = max(prev->stime, curr->stime);
*st = prev->utime = max(prev->utime, curr->utime);
}

Yes, we have this problem either way. And personally I think that this
"enforce monotonicity" logic is pointless, userspace could take care,
but it is too late to complain.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/15/2014 12:49 PM, Oleg Nesterov wrote:

> Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable.
> Just I do not know it's worth the trouble.

If we don't know whether it is worth the trouble, it is probably best
to stick to a well-known generic locking algorithm, instead of brewing
our own and trying to maintain it.

I have fixed the other locking issue you pointed out, Oleg.

Now to see if this change to cputime_adjust does the trick :)

+++ b/kernel/sched/cputime.c
@@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
 * If the tick based count grows faster than the scheduler one,
 * the result of the scaling may go backward.
 * Let's enforce monotonicity.
+* Atomic exchange protects against concurrent cputime_adjust.
 */
- - prev->stime = max(prev->stime, stime);
- - prev->utime = max(prev->utime, utime);
+   while (stime > (rtime = ACCESS_ONCE(prev->stime)))
+   cmpxchg(>stime, rtime, stime);
+   while (utime > (rtime = ACCESS_ONCE(prev->utime)))
+   cmpxchg(>utime, rtime, utime);

-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJT7kJ4AAoJEM553pKExN6Do/oH/2lA5X/CrVuhOLBK1sVq3kRh
gGiOTT9pDQZH1wwafVNHKWaro3T/s9GNqemgvgt4UiKbjFeYkaOycHp1cuntJj8j
Wk8zNnWBOuGqqcSxzk1Duco3CByxshLNXxuYJfpdkdEXPqRyvURAOL58pxSybZzh
E6lT747ntFJu3GIbfC6Ta3q58pWLpVrhWlvonhSaqat6tOvlzo4MKiJxz3SbT6i0
cCpmQ5p/JoQ5+IUEbTOZYbE2bK2y5tSrMggAFwKWLB3/0zJm1h4+2Q/5PenCX59X
VDFmaOJLkNxGcVXg8x87itvqzfq/LkvDtwl9tTJmA5ccG37MPvM3803XF5OWVo0=
=aKES
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Oleg Nesterov wrote:
>
> However, if we only want to make sys_times() more scalable(), then
> perhaps the "lockless" version of thread_group_cputime() makes more
> sense. And given that do_sys_times() uses current we can simplify it;
> is_dead is not possible and we do not need to take ->siglock twice:
>
>   void current_group_cputime(struct task_cputime *times)
>   {
>   struct task_struct *tsk = current, *t;
>   struct spinlock_t *siglock = >sighand->siglock;
>   struct signal_struct *sig = tsk->signal;
>   bool lockless = true;
>   u64 exec;
>
>retry:
>   spin_lock_irq(siglock);
>   times->utime = sig->utime;
>   times->stime = sig->stime;
>   times->sum_exec_runtime = exec = sig->sum_sched_runtime;
>
>   if (lockless)
>   spin_unlock_irq(siglock);
>
>   rcu_read_lock();
>   for_each_thread(tsk, t) {
>   cputime_t utime, stime;
>   task_cputime(t, , );
>   times->utime += utime;
>   times->stime += stime;
>   times->sum_exec_runtime += task_sched_runtime(t);
>   }
>   rcu_read_unlock();
>
>   if (lockless) {
>   lockless = false;
>   spin_unlock_wait(siglock);
>   smp_rmb();
>   if (exec != sig->sum_sched_runtime)
>   goto retry;
>   } else {
>   spin_unlock_irq(siglock);
>   }
>   }

Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable. Just
I do not know it's worth the trouble.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Peter Zijlstra wrote:
>
> Also; why do we care about PROCESS_CPUTIME? People should really not use
> it. What are the 'valid' usecases you guys care about?

I do not really know. IIUC, the problematic usecase is sys_times().

I agree with Mike, "don't do this if you have a lot of threads". But
perhaps the kernel can help to applications which already abuse times().

However, if we only want to make sys_times() more scalable(), then
perhaps the "lockless" version of thread_group_cputime() makes more
sense. And given that do_sys_times() uses current we can simplify it;
is_dead is not possible and we do not need to take ->siglock twice:

void current_group_cputime(struct task_cputime *times)
{
struct task_struct *tsk = current, *t;
struct spinlock_t *siglock = >sighand->siglock;
struct signal_struct *sig = tsk->signal;
bool lockless = true;
u64 exec;

 retry:
spin_lock_irq(siglock);
times->utime = sig->utime;
times->stime = sig->stime;
times->sum_exec_runtime = exec = sig->sum_sched_runtime;

if (lockless)
spin_unlock_irq(siglock);

rcu_read_lock();
for_each_thread(tsk, t) {
cputime_t utime, stime;
task_cputime(t, , );
times->utime += utime;
times->stime += stime;
times->sum_exec_runtime += task_sched_runtime(t);
}
rcu_read_unlock();

if (lockless) {
lockless = false;
spin_unlock_wait(siglock);
smp_rmb();
if (exec != sig->sum_sched_runtime)
goto retry;
} else {
spin_unlock_irq(siglock);
}
}

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:
>
> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, 
> struct task_cputime *times)
>   struct signal_struct *sig = tsk->signal;
>   cputime_t utime, stime;
>   struct task_struct *t;
> -
> - times->utime = sig->utime;
> - times->stime = sig->stime;
> - times->sum_exec_runtime = sig->sum_sched_runtime;
> + unsigned int seq, nextseq;
>  
>   rcu_read_lock();
> - for_each_thread(tsk, t) {
> - task_cputime(t, , );
> - times->utime += utime;
> - times->stime += stime;
> - times->sum_exec_runtime += task_sched_runtime(t);
> - }
> + /* Attempt a lockless read on the first round. */
> + nextseq = 0;
> + do {
> + seq = nextseq;
> + read_seqbegin_or_lock(>stats_lock, );
> + times->utime = sig->utime;
> + times->stime = sig->stime;
> + times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> + for_each_thread(tsk, t) {
> + task_cputime(t, , );
> + times->utime += utime;
> + times->stime += stime;
> + times->sum_exec_runtime += task_sched_runtime(t);
> + }
> + /*
> +  * If a writer is currently active, seq will be odd, and
> +  * read_seqbegin_or_lock will take the lock.
> +  */
> + nextseq = raw_read_seqcount(>stats_lock.seqcount);
> + } while (need_seqretry(>stats_lock, seq));
> + done_seqretry(>stats_lock, seq);
>   rcu_read_unlock();
>  }

I still think this is not right. Let me quote my previous email,

> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct 
*tsk, struct task_cputime *times)
>   struct signal_struct *sig = tsk->signal;
>   cputime_t utime, stime;
>   struct task_struct *t;
> -
> - times->utime = sig->utime;
> - times->stime = sig->stime;
> - times->sum_exec_runtime = sig->sum_sched_runtime;
> + unsigned int seq, nextseq;
>
>   rcu_read_lock();

Almost cosmetic nit, but afaics this patch expands the rcu critical 
section
for no reason. We only need rcu_read_lock/unlock around 
for_each_thread()
below.

> + nextseq = 0;
> + do {
> + seq = nextseq;
> + read_seqbegin_or_lock(>stats_lock, );
> + times->utime = sig->utime;
> + times->stime = sig->stime;
> + times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> + for_each_thread(tsk, t) {
> + task_cputime(t, , );
> + times->utime += utime;
> + times->stime += stime;
> + times->sum_exec_runtime += 
task_sched_runtime(t);
> + }
> + /*
> +  * If a writer is currently active, seq will be odd, and
> +  * read_seqbegin_or_lock will take the lock.
> +  */
> + nextseq = raw_read_seqcount(>stats_lock.seqcount);
> + } while (need_seqretry(>stats_lock, seq));
> + done_seqretry(>stats_lock, seq);

Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I 
mean,
this code still can livelock in theory. Just suppose that anoter CPU 
does
write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In 
this
case "seq & 1" will be never true and thus "or_lock" will never happen.

IMO, this should be fixed. Either we should guarantee the forward 
progress
or we should not play with read_seqbegin_or_lock() at all. This code 
assumes
that sooner or later "nextseq = raw_read_seqcount()" should return the 
odd
counter, but in theory this can never happen.

And if we want to fix this we do not need 2 counters, just we need to 
set
"seq = 1" manually after need_seqretry() == T. Say, like 
__dentry_path() does.
(but unlike __dentry_path() we do not need to worry about 
rcu_read_unlock so
the code will be simpler).

I am wondering if it makes sense to introduce

bool read_seqretry_or_lock(const seqlock_t *sl, int *seq)
{
if (*seq & 1) {
read_sequnlock_excl(lock);
return false;
}

if (!read_seqretry(lock, *seq))
return false;

*seq = 1;
return true;
}

Or I missed your reply?

Oleg.

--
To unsubscribe from this list: send the line 

Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Frederic Weisbecker wrote:
>
> 2014-08-14 16:39 GMT+02:00 Oleg Nesterov :
> > On 08/14, Frederic Weisbecker wrote:
> >>
> >> I mean the read side doesn't use a lock with seqlocks. It's only made
> >> of barriers and sequence numbers to ensure the reader doesn't read
> >> some half-complete update. But other than that it can as well see the
> >> update n - 1 since barriers don't enforce latest results.
> >
> > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
> > write_seqcount_begin-update-write_seqcount_begin can miss "update" part
> > along with ->sequence modifications.
> >
> > But I still can't understand how this can lead to non-monotonic results,
> > could you spell?
>
> Well lets say clock = T.
> CPU 0 updates at T + 1.
> Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
> while CPU 1 still reads T.
> If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
> possible that CPU 2 still sees T. With the spinlocked version that
> thing can't happen, the second round would read at least T + 1 for
> both CPUs.

But this is fine? And CPU 2 doesn't see a non-monotonic result?

OK, this could be wrong if, say,

void print_clock(void)
{
lock(SOME_LOCK);
printk(..., clock_gettime());
unlock(SOME_LOCK);
}

printed the non-monotonic numbers if print_clock() is called on CPU_1 and
then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if
they were already visible to CPU_1 under the same lock. IOW,

int T = 0;  /* can be incremented at any time */

void check_monotony(void)
{
static int t = 0;

lock(SOME_LOCK);
BUG(t > T);
T = t;
unlock(SOME_LOCK);
}

must work corrrectly (ignoring overflow) even if T is changed without
SOME_LOCK.

Otherwise, without some sort of synchronization the different results on
CPU_1/2 should be fine.

Or I am still missing your point?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Peter Zijlstra
On Fri, Aug 15, 2014 at 11:37:33AM +0200, Mike Galbraith wrote:
> On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: 
> > On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
> > > For the N threads doing this on N cores case, seems rq->lock hammering
> > > will still be a source of major box wide pain.  Is there any correctness
> > > reason to add up unaccounted ->on_cpu beans, or is that just value
> > > added?  
> > 
> > That delta can be arbitrarily large with nohz_full. And without
> > nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
> > reported clock resolution.
> > 
> > Having a non-constant error bound is annoying for you never quite know
> > what to expect.
> 
> Ah, yeah, that could get rather large.
> 
> > Also; why do we care about PROCESS_CPUTIME? People should really not use
> > it. What are the 'valid' usecases you guys care about?
> 
> I don't care much, said "don't do that" before I saw a similar big box
> problem had popped up with times().

Urgh, yes times().. Now I don't think we do very accurate accounting of
those particular numbers, so we could fudge some of that. Typically we
only do TICK_NSEC granularity accounting on user/system divide anyhow,
seeing how putting timestamp reads in the kernel<>user switch is
_expensive_ -- see NOHZ_FULL.


pgp5yv9CvnsYp.pgp
Description: PGP signature


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Mike Galbraith
On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: 
> On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
> > For the N threads doing this on N cores case, seems rq->lock hammering
> > will still be a source of major box wide pain.  Is there any correctness
> > reason to add up unaccounted ->on_cpu beans, or is that just value
> > added?  
> 
> That delta can be arbitrarily large with nohz_full. And without
> nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
> reported clock resolution.
> 
> Having a non-constant error bound is annoying for you never quite know
> what to expect.

Ah, yeah, that could get rather large.

> Also; why do we care about PROCESS_CPUTIME? People should really not use
> it. What are the 'valid' usecases you guys care about?

I don't care much, said "don't do that" before I saw a similar big box
problem had popped up with times().

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Peter Zijlstra
On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
> For the N threads doing this on N cores case, seems rq->lock hammering
> will still be a source of major box wide pain.  Is there any correctness
> reason to add up unaccounted ->on_cpu beans, or is that just value
> added?  

That delta can be arbitrarily large with nohz_full. And without
nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
reported clock resolution.

Having a non-constant error bound is annoying for you never quite know
what to expect.

Also; why do we care about PROCESS_CPUTIME? People should really not use
it. What are the 'valid' usecases you guys care about?


pgpKXt9BDsAqg.pgp
Description: PGP signature


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Peter Zijlstra
On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
 For the N threads doing this on N cores case, seems rq-lock hammering
 will still be a source of major box wide pain.  Is there any correctness
 reason to add up unaccounted -on_cpu beans, or is that just value
 added?  

That delta can be arbitrarily large with nohz_full. And without
nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
reported clock resolution.

Having a non-constant error bound is annoying for you never quite know
what to expect.

Also; why do we care about PROCESS_CPUTIME? People should really not use
it. What are the 'valid' usecases you guys care about?


pgpKXt9BDsAqg.pgp
Description: PGP signature


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Mike Galbraith
On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: 
 On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
  For the N threads doing this on N cores case, seems rq-lock hammering
  will still be a source of major box wide pain.  Is there any correctness
  reason to add up unaccounted -on_cpu beans, or is that just value
  added?  
 
 That delta can be arbitrarily large with nohz_full. And without
 nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
 reported clock resolution.
 
 Having a non-constant error bound is annoying for you never quite know
 what to expect.

Ah, yeah, that could get rather large.

 Also; why do we care about PROCESS_CPUTIME? People should really not use
 it. What are the 'valid' usecases you guys care about?

I don't care much, said don't do that before I saw a similar big box
problem had popped up with times().

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Peter Zijlstra
On Fri, Aug 15, 2014 at 11:37:33AM +0200, Mike Galbraith wrote:
 On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: 
  On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
   For the N threads doing this on N cores case, seems rq-lock hammering
   will still be a source of major box wide pain.  Is there any correctness
   reason to add up unaccounted -on_cpu beans, or is that just value
   added?  
  
  That delta can be arbitrarily large with nohz_full. And without
  nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
  reported clock resolution.
  
  Having a non-constant error bound is annoying for you never quite know
  what to expect.
 
 Ah, yeah, that could get rather large.
 
  Also; why do we care about PROCESS_CPUTIME? People should really not use
  it. What are the 'valid' usecases you guys care about?
 
 I don't care much, said don't do that before I saw a similar big box
 problem had popped up with times().

Urgh, yes times().. Now I don't think we do very accurate accounting of
those particular numbers, so we could fudge some of that. Typically we
only do TICK_NSEC granularity accounting on user/system divide anyhow,
seeing how putting timestamp reads in the kerneluser switch is
_expensive_ -- see NOHZ_FULL.


pgp5yv9CvnsYp.pgp
Description: PGP signature


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Frederic Weisbecker wrote:

 2014-08-14 16:39 GMT+02:00 Oleg Nesterov o...@redhat.com:
  On 08/14, Frederic Weisbecker wrote:
 
  I mean the read side doesn't use a lock with seqlocks. It's only made
  of barriers and sequence numbers to ensure the reader doesn't read
  some half-complete update. But other than that it can as well see the
  update n - 1 since barriers don't enforce latest results.
 
  Yes, sure, read_seqcount_begin/read_seqcount_retry right after
  write_seqcount_begin-update-write_seqcount_begin can miss update part
  along with -sequence modifications.
 
  But I still can't understand how this can lead to non-monotonic results,
  could you spell?

 Well lets say clock = T.
 CPU 0 updates at T + 1.
 Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
 while CPU 1 still reads T.
 If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
 possible that CPU 2 still sees T. With the spinlocked version that
 thing can't happen, the second round would read at least T + 1 for
 both CPUs.

But this is fine? And CPU 2 doesn't see a non-monotonic result?

OK, this could be wrong if, say,

void print_clock(void)
{
lock(SOME_LOCK);
printk(..., clock_gettime());
unlock(SOME_LOCK);
}

printed the non-monotonic numbers if print_clock() is called on CPU_1 and
then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if
they were already visible to CPU_1 under the same lock. IOW,

int T = 0;  /* can be incremented at any time */

void check_monotony(void)
{
static int t = 0;

lock(SOME_LOCK);
BUG(t  T);
T = t;
unlock(SOME_LOCK);
}

must work corrrectly (ignoring overflow) even if T is changed without
SOME_LOCK.

Otherwise, without some sort of synchronization the different results on
CPU_1/2 should be fine.

Or I am still missing your point?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:

 @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, 
 struct task_cputime *times)
   struct signal_struct *sig = tsk-signal;
   cputime_t utime, stime;
   struct task_struct *t;
 -
 - times-utime = sig-utime;
 - times-stime = sig-stime;
 - times-sum_exec_runtime = sig-sum_sched_runtime;
 + unsigned int seq, nextseq;
  
   rcu_read_lock();
 - for_each_thread(tsk, t) {
 - task_cputime(t, utime, stime);
 - times-utime += utime;
 - times-stime += stime;
 - times-sum_exec_runtime += task_sched_runtime(t);
 - }
 + /* Attempt a lockless read on the first round. */
 + nextseq = 0;
 + do {
 + seq = nextseq;
 + read_seqbegin_or_lock(sig-stats_lock, seq);
 + times-utime = sig-utime;
 + times-stime = sig-stime;
 + times-sum_exec_runtime = sig-sum_sched_runtime;
 +
 + for_each_thread(tsk, t) {
 + task_cputime(t, utime, stime);
 + times-utime += utime;
 + times-stime += stime;
 + times-sum_exec_runtime += task_sched_runtime(t);
 + }
 + /*
 +  * If a writer is currently active, seq will be odd, and
 +  * read_seqbegin_or_lock will take the lock.
 +  */
 + nextseq = raw_read_seqcount(sig-stats_lock.seqcount);
 + } while (need_seqretry(sig-stats_lock, seq));
 + done_seqretry(sig-stats_lock, seq);
   rcu_read_unlock();
  }

I still think this is not right. Let me quote my previous email,

 @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct 
*tsk, struct task_cputime *times)
   struct signal_struct *sig = tsk-signal;
   cputime_t utime, stime;
   struct task_struct *t;
 -
 - times-utime = sig-utime;
 - times-stime = sig-stime;
 - times-sum_exec_runtime = sig-sum_sched_runtime;
 + unsigned int seq, nextseq;

   rcu_read_lock();

Almost cosmetic nit, but afaics this patch expands the rcu critical 
section
for no reason. We only need rcu_read_lock/unlock around 
for_each_thread()
below.

 + nextseq = 0;
 + do {
 + seq = nextseq;
 + read_seqbegin_or_lock(sig-stats_lock, seq);
 + times-utime = sig-utime;
 + times-stime = sig-stime;
 + times-sum_exec_runtime = sig-sum_sched_runtime;
 +
 + for_each_thread(tsk, t) {
 + task_cputime(t, utime, stime);
 + times-utime += utime;
 + times-stime += stime;
 + times-sum_exec_runtime += 
task_sched_runtime(t);
 + }
 + /*
 +  * If a writer is currently active, seq will be odd, and
 +  * read_seqbegin_or_lock will take the lock.
 +  */
 + nextseq = raw_read_seqcount(sig-stats_lock.seqcount);
 + } while (need_seqretry(sig-stats_lock, seq));
 + done_seqretry(sig-stats_lock, seq);

Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I 
mean,
this code still can livelock in theory. Just suppose that anoter CPU 
does
write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In 
this
case seq  1 will be never true and thus or_lock will never happen.

IMO, this should be fixed. Either we should guarantee the forward 
progress
or we should not play with read_seqbegin_or_lock() at all. This code 
assumes
that sooner or later nextseq = raw_read_seqcount() should return the 
odd
counter, but in theory this can never happen.

And if we want to fix this we do not need 2 counters, just we need to 
set
seq = 1 manually after need_seqretry() == T. Say, like 
__dentry_path() does.
(but unlike __dentry_path() we do not need to worry about 
rcu_read_unlock so
the code will be simpler).

I am wondering if it makes sense to introduce

bool read_seqretry_or_lock(const seqlock_t *sl, int *seq)
{
if (*seq  1) {
read_sequnlock_excl(lock);
return false;
}

if (!read_seqretry(lock, *seq))
return false;

*seq = 1;
return true;
}

Or I missed your reply?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Peter Zijlstra wrote:

 Also; why do we care about PROCESS_CPUTIME? People should really not use
 it. What are the 'valid' usecases you guys care about?

I do not really know. IIUC, the problematic usecase is sys_times().

I agree with Mike, don't do this if you have a lot of threads. But
perhaps the kernel can help to applications which already abuse times().

However, if we only want to make sys_times() more scalable(), then
perhaps the lockless version of thread_group_cputime() makes more
sense. And given that do_sys_times() uses current we can simplify it;
is_dead is not possible and we do not need to take -siglock twice:

void current_group_cputime(struct task_cputime *times)
{
struct task_struct *tsk = current, *t;
struct spinlock_t *siglock = tsk-sighand-siglock;
struct signal_struct *sig = tsk-signal;
bool lockless = true;
u64 exec;

 retry:
spin_lock_irq(siglock);
times-utime = sig-utime;
times-stime = sig-stime;
times-sum_exec_runtime = exec = sig-sum_sched_runtime;

if (lockless)
spin_unlock_irq(siglock);

rcu_read_lock();
for_each_thread(tsk, t) {
cputime_t utime, stime;
task_cputime(t, utime, stime);
times-utime += utime;
times-stime += stime;
times-sum_exec_runtime += task_sched_runtime(t);
}
rcu_read_unlock();

if (lockless) {
lockless = false;
spin_unlock_wait(siglock);
smp_rmb();
if (exec != sig-sum_sched_runtime)
goto retry;
} else {
spin_unlock_irq(siglock);
}
}

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Oleg Nesterov wrote:

 However, if we only want to make sys_times() more scalable(), then
 perhaps the lockless version of thread_group_cputime() makes more
 sense. And given that do_sys_times() uses current we can simplify it;
 is_dead is not possible and we do not need to take -siglock twice:

   void current_group_cputime(struct task_cputime *times)
   {
   struct task_struct *tsk = current, *t;
   struct spinlock_t *siglock = tsk-sighand-siglock;
   struct signal_struct *sig = tsk-signal;
   bool lockless = true;
   u64 exec;

retry:
   spin_lock_irq(siglock);
   times-utime = sig-utime;
   times-stime = sig-stime;
   times-sum_exec_runtime = exec = sig-sum_sched_runtime;

   if (lockless)
   spin_unlock_irq(siglock);

   rcu_read_lock();
   for_each_thread(tsk, t) {
   cputime_t utime, stime;
   task_cputime(t, utime, stime);
   times-utime += utime;
   times-stime += stime;
   times-sum_exec_runtime += task_sched_runtime(t);
   }
   rcu_read_unlock();

   if (lockless) {
   lockless = false;
   spin_unlock_wait(siglock);
   smp_rmb();
   if (exec != sig-sum_sched_runtime)
   goto retry;
   } else {
   spin_unlock_irq(siglock);
   }
   }

Just in case... Yes, sure, seqlock_t stats_lock is more scalable. Just
I do not know it's worth the trouble.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/15/2014 12:49 PM, Oleg Nesterov wrote:

 Just in case... Yes, sure, seqlock_t stats_lock is more scalable.
 Just I do not know it's worth the trouble.

If we don't know whether it is worth the trouble, it is probably best
to stick to a well-known generic locking algorithm, instead of brewing
our own and trying to maintain it.

I have fixed the other locking issue you pointed out, Oleg.

Now to see if this change to cputime_adjust does the trick :)

+++ b/kernel/sched/cputime.c
@@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
 * If the tick based count grows faster than the scheduler one,
 * the result of the scaling may go backward.
 * Let's enforce monotonicity.
+* Atomic exchange protects against concurrent cputime_adjust.
 */
- - prev-stime = max(prev-stime, stime);
- - prev-utime = max(prev-utime, utime);
+   while (stime  (rtime = ACCESS_ONCE(prev-stime)))
+   cmpxchg(prev-stime, rtime, stime);
+   while (utime  (rtime = ACCESS_ONCE(prev-utime)))
+   cmpxchg(prev-utime, rtime, utime);

-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJT7kJ4AAoJEM553pKExN6Do/oH/2lA5X/CrVuhOLBK1sVq3kRh
gGiOTT9pDQZH1wwafVNHKWaro3T/s9GNqemgvgt4UiKbjFeYkaOycHp1cuntJj8j
Wk8zNnWBOuGqqcSxzk1Duco3CByxshLNXxuYJfpdkdEXPqRyvURAOL58pxSybZzh
E6lT747ntFJu3GIbfC6Ta3q58pWLpVrhWlvonhSaqat6tOvlzo4MKiJxz3SbT6i0
cCpmQ5p/JoQ5+IUEbTOZYbE2bK2y5tSrMggAFwKWLB3/0zJm1h4+2Q/5PenCX59X
VDFmaOJLkNxGcVXg8x87itvqzfq/LkvDtwl9tTJmA5ccG37MPvM3803XF5OWVo0=
=aKES
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Oleg Nesterov
On 08/15, Rik van Riel wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 08/15/2014 12:49 PM, Oleg Nesterov wrote:

  Just in case... Yes, sure, seqlock_t stats_lock is more scalable.
  Just I do not know it's worth the trouble.

 If we don't know whether it is worth the trouble, it is probably best
 to stick to a well-known generic locking algorithm, instead of brewing
 our own and trying to maintain it.

Perhaps. I am obviously biased and can't judge ;) Plus, again, I do
understand that your approach has some advantages too.

 Now to see if this change to cputime_adjust does the trick :)

 +++ b/kernel/sched/cputime.c
 @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
* If the tick based count grows faster than the scheduler one,
* the result of the scaling may go backward.
* Let's enforce monotonicity.
 +  * Atomic exchange protects against concurrent cputime_adjust.
*/
 - -   prev-stime = max(prev-stime, stime);
 - -   prev-utime = max(prev-utime, utime);
 + while (stime  (rtime = ACCESS_ONCE(prev-stime)))
 + cmpxchg(prev-stime, rtime, stime);
 + while (utime  (rtime = ACCESS_ONCE(prev-utime)))
 + cmpxchg(prev-utime, rtime, utime);

Yes, perhaps we need something like this in any case. To remind, at least
do_task_stat() calls task_cputime_adjusted() lockless, although we could
fix this separately.

But I do not think the change above is enough. With this change cputime_adjust()
can race with itself. Yes, this guarantees monotonicity even if it is called
lockless, but this can lead to obviously inconsistent numbers.

And I don't think we can ignore this. If we could, then we can remove the
scale_stime recalculation and change cputime_adjust() to simply do:

static void cputime_adjust(struct task_cputime *curr,
   struct cputime *prev,
   cputime_t *ut, cputime_t *st)
{
/* enforce monotonicity */
*ut = prev-stime = max(prev-stime, curr-stime);
*st = prev-utime = max(prev-utime, curr-utime);
}

Yes, we have this problem either way. And personally I think that this
enforce monotonicity logic is pointless, userspace could take care,
but it is too late to complain.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread Frederic Weisbecker
On Fri, Aug 15, 2014 at 04:26:01PM +0200, Oleg Nesterov wrote:
 On 08/15, Frederic Weisbecker wrote:
 
  2014-08-14 16:39 GMT+02:00 Oleg Nesterov o...@redhat.com:
   On 08/14, Frederic Weisbecker wrote:
  
   I mean the read side doesn't use a lock with seqlocks. It's only made
   of barriers and sequence numbers to ensure the reader doesn't read
   some half-complete update. But other than that it can as well see the
   update n - 1 since barriers don't enforce latest results.
  
   Yes, sure, read_seqcount_begin/read_seqcount_retry right after
   write_seqcount_begin-update-write_seqcount_begin can miss update part
   along with -sequence modifications.
  
   But I still can't understand how this can lead to non-monotonic results,
   could you spell?
 
  Well lets say clock = T.
  CPU 0 updates at T + 1.
  Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
  while CPU 1 still reads T.
  If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
  possible that CPU 2 still sees T. With the spinlocked version that
  thing can't happen, the second round would read at least T + 1 for
  both CPUs.
 
 But this is fine? And CPU 2 doesn't see a non-monotonic result?
 
 OK, this could be wrong if, say,
 
   void print_clock(void)
   {
   lock(SOME_LOCK);
   printk(..., clock_gettime());
   unlock(SOME_LOCK);
   }
   
 printed the non-monotonic numbers if print_clock() is called on CPU_1 and
 then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if
 they were already visible to CPU_1 under the same lock. IOW,
 
   int T = 0;  /* can be incremented at any time */
 
   void check_monotony(void)
   {
   static int t = 0;
 
   lock(SOME_LOCK);
   BUG(t  T);
   T = t;
   unlock(SOME_LOCK);
   }
 
 must work corrrectly (ignoring overflow) even if T is changed without
 SOME_LOCK.
 
 Otherwise, without some sort of synchronization the different results on
 CPU_1/2 should be fine.
 
 Or I am still missing your point?

No I think you're right, as long as ordering against something else is involved,
monotonicity is enforced.

Now I'm trying to think about a case where SMP ordering isn't involved.
Perhaps some usecase based on coupling CPU local clocks and clock_gettime()
where a drift between both can appear. Now using a local clock probably only
makes sense in the context of local usecases where the thread clock update
would be local as well. So that's probably not a problem. Now what if somebody
couples multithread process wide clocks with per CPU local clocks. Well that's
probably too foolish to be considered.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Mike Galbraith
On Thu, 2014-08-14 at 19:48 +0200, Oleg Nesterov wrote: 
> On 08/14, Oleg Nesterov wrote:
> >
> > OK, lets forget about alternative approach for now. We can reconsider
> > it later. At least I have to admit that seqlock is more straighforward.
> 
> Yes.
> 
> But just for record, the "lockless" version doesn't look that bad to me,
> 
>   void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
> *times)
>   {
>   struct signal_struct *sig = tsk->signal;
>   bool lockless, is_dead;
>   struct task_struct *t;
>   unsigned long flags;
>   u64 exec;
> 
>   lockless = true;
>   is_dead = !lock_task_sighand(p, );
>retry:
>   times->utime = sig->utime;
>   times->stime = sig->stime;
>   times->sum_exec_runtime = exec = sig->sum_sched_runtime;
>   if (is_dead)
>   return;
> 
>   if (lockless)
>   unlock_task_sighand(p, );
> 
>   rcu_read_lock();
>   for_each_thread(tsk, t) {
>   cputime_t utime, stime;
>   task_cputime(t, , );
>   times->utime += utime;
>   times->stime += stime;
>   times->sum_exec_runtime += task_sched_runtime(t);
>   }
>   rcu_read_unlock();
> 
>   if (lockless) {
>   lockless = false;
>   is_dead = !lock_task_sighand(p, );
>   if (is_dead || exec != sig->sum_sched_runtime)
>   goto retry;
>   }
>   unlock_task_sighand(p, );
>   }
> 
> The obvious problem is that we should shift lock_task_sighand() from the
> callers to thread_group_cputime() first, or add 
> thread_group_cputime_lockless()
> and change the current users one by one.
> 
> And of course, stats_lock is more generic.

Yours looks nice to me, particularly in that it doesn't munge structure
layout, could perhaps be backported to fix up production kernels.

For the N threads doing this on N cores case, seems rq->lock hammering
will still be a source of major box wide pain.  Is there any correctness
reason to add up unaccounted ->on_cpu beans, or is that just value
added?  Seems to me it can't matter, as you traverse, what you added up
on previous threads becomes ever more stale as you proceed, so big boxen
would be better off not doing that.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Frederic Weisbecker
2014-08-14 16:39 GMT+02:00 Oleg Nesterov :
> On 08/14, Frederic Weisbecker wrote:
>>
>> 2014-08-14 3:57 GMT+02:00 Rik van Riel :
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA1
>> >
>> > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
>> >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
>> >>
>> >> I'm worried about such lockless solution based on RCU or read
>> >> seqcount because we lose the guarantee that an update is
>> >> immediately visible by all subsequent readers.
>> >>
>> >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
>> >> after that call clock_gettime(), with the spinlock we were
>> >> guaranteed to see the new update. Now with a pure seqlock read
>> >> approach, we guarantee a read sequence coherency but we don't
>> >> guarantee the freshest update result.
>> >>
>> >> So that looks like a source of non monotonic results.
>> >
>> > Which update are you worried about, specifically?
>> >
>> > The seq_write_lock to update the usage stat in p->signal will lock out
>> > the seqlock read side used to check those results.
>> >
>> > Is there another kind of thing read by cpu_clock_sample_group that you
>> > believe is not excluded by the seq_lock?
>>
>> I mean the read side doesn't use a lock with seqlocks. It's only made
>> of barriers and sequence numbers to ensure the reader doesn't read
>> some half-complete update. But other than that it can as well see the
>> update n - 1 since barriers don't enforce latest results.
>
> Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
> write_seqcount_begin-update-write_seqcount_begin can miss "update" part
> along with ->sequence modifications.
>
> But I still can't understand how this can lead to non-monotonic results,
> could you spell?

Well lets say clock = T.
CPU 0 updates at T + 1.
Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
while CPU 1 still reads T.
If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
possible that CPU 2 still sees T. With the spinlocked version that
thing can't happen, the second round would read at least T + 1 for
both CPUs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
On Thu, 14 Aug 2014 18:12:47 +0200
Oleg Nesterov  wrote:

> Or you can expand the scope of write_seqlock/write_sequnlock, so that
> __unhash_process in called from inside the critical section. This looks
> simpler at first glance.
> 
> Hmm, wait, it seems there is yet another problem ;) Afaics, you also
> need to modify __exit_signal() so that ->sum_sched_runtime/etc are
> accounted unconditionally, even if the group leader exits.

OK, this is what I have now.

I am still getting backwards time sometimes, but only tiny
increments. This suggests that cputime_adjust() may be the
culprit, and I have no good idea on how to fix that yet...

Should task_cputime_adjusted and thread_group_cputime_adjusted
pass in the address of a seqlock to use in case the values in
prev need to be updated?

Should we check whether the values in prev changed during the
time spent in the function?

Is this a race between task_cputime_adjusted and other writers
of signal->utime and signal->stime, instead of task_cputime_adjusted
racing with itself?

I am not sure what the best approach here is...

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel 
---
 include/linux/sched.h  |  1 +
 kernel/exit.c  | 48 +++---
 kernel/fork.c  |  1 +
 kernel/sched/cputime.c | 36 +++
 kernel/sys.c   |  2 --
 kernel/time/posix-cpu-timers.c | 14 
 6 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
+   seqlock_t stats_lock;
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..c1a0ef2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -115,32 +115,34 @@ static void __exit_signal(struct task_struct *tsk)
 
if (tsk == sig->curr_target)
sig->curr_target = next_thread(tsk);
-   /*
-* Accumulate here the counters for all threads but the
-* group leader as they die, so they can be added into
-* the process-wide totals when those are taken.
-* The group leader stays around as a zombie as long
-* as there are other threads.  When it gets reaped,
-* the exit.c code will add its counts into these totals.
-* We won't ever get here for the group leader, since it
-* will have been the last reference on the signal_struct.
-*/
-   task_cputime(tsk, , );
-   sig->utime += utime;
-   sig->stime += stime;
-   sig->gtime += task_gtime(tsk);
-   sig->min_flt += tsk->min_flt;
-   sig->maj_flt += tsk->maj_flt;
-   sig->nvcsw += tsk->nvcsw;
-   sig->nivcsw += tsk->nivcsw;
-   sig->inblock += task_io_get_inblock(tsk);
-   sig->oublock += task_io_get_oublock(tsk);
-   task_io_accounting_add(>ioac, >ioac);
-   sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
}
 
+   /*
+* Accumulate here the counters for all threads but the
+* group leader as they die, so they can be added into
+* the process-wide totals when those are taken.
+* The group leader stays around as a zombie as long
+* as there are other threads.  When it gets reaped,
+* the exit.c code will add its counts into these totals.
+* We won't ever get here for the group leader, since it
+* will have been the last reference on the signal_struct.
+*/
+ 

Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:
>
> On 08/14/2014 02:15 PM, Oleg Nesterov wrote:
> > On 08/14, Rik van Riel wrote:
> >>
> >> On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
> >>>
> >>> Or you can expand the scope of write_seqlock/write_sequnlock, so that
> >>> __unhash_process in called from inside the critical section. This looks
> >>> simpler at first glance.
> >>
> >> The problem with that is that wait_task_zombie() calls
> >> thread_group_cputime_adjusted() in that if() branch, and
> >> that code ends up taking the seqlock for read...
> >
> > Not sure I understand... This modifies parent->signal->c* counters,
> > and obviously the exiting thread is not the member of parent's thread
> > group, so thread_group_cputime_adjusted(parent) can never account the
> > exiting child twice simply because it won't see it?
>
> You are right, the tree of processes only goes one way,
> so there should be no deadlock in taking psig->stats_lock
> and having thread_group_cputime_adjusted take sig->stats_lock
> for read within that section.
>
> However, it might need some lockdep annotation to keep
> lockdep from thinking we might the same lock recursively :)

But wait_task_zombie() can (and should) call
thread_group_cputime_adjusted(zombie_child) outside of parent's ->siglock
or ->stats_lock so this this should be safe.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
On 08/14/2014 02:15 PM, Oleg Nesterov wrote:
> On 08/14, Rik van Riel wrote:
>>
>> On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
>>>
>>> Or you can expand the scope of write_seqlock/write_sequnlock, so that
>>> __unhash_process in called from inside the critical section. This looks
>>> simpler at first glance.
>>
>> The problem with that is that wait_task_zombie() calls
>> thread_group_cputime_adjusted() in that if() branch, and
>> that code ends up taking the seqlock for read...
> 
> Not sure I understand... This modifies parent->signal->c* counters,
> and obviously the exiting thread is not the member of parent's thread
> group, so thread_group_cputime_adjusted(parent) can never account the
> exiting child twice simply because it won't see it?

You are right, the tree of processes only goes one way,
so there should be no deadlock in taking psig->stats_lock
and having thread_group_cputime_adjusted take sig->stats_lock
for read within that section.

However, it might need some lockdep annotation to keep
lockdep from thinking we might the same lock recursively :)

>> However, in __exit_signal that approach should work.
> 
> Yes,
> 
>>> Hmm, wait, it seems there is yet another problem ;) Afaics, you also
>>> need to modify __exit_signal() so that ->sum_sched_runtime/etc are
>>> accounted unconditionally, even if the group leader exits.
>>>
>>> Probably this is not a big problem, and sys_times() or clock_gettime()
>>> do not care at all because they use current.
>>>
>>> But without this change thread_group_cputime(reaped_zombie) won't look
>>> at this task_struct at all, this can lead to non-monotonic result if
>>> it was previously called when this task was alive (non-reaped).
>>
>> You mean this whole block needs to run regardless of whether
>> the group is dead?
>>
>> task_cputime(tsk, , );
>> write_seqlock(>stats_lock);
>> sig->utime += utime;
>> sig->stime += stime;
>> sig->gtime += task_gtime(tsk);
>> sig->min_flt += tsk->min_flt;
>> sig->maj_flt += tsk->maj_flt;
>> sig->nvcsw += tsk->nvcsw;
>> sig->nivcsw += tsk->nivcsw;
>> sig->inblock += task_io_get_inblock(tsk);
>> sig->oublock += task_io_get_oublock(tsk);
>> task_io_accounting_add(>ioac, >ioac);
>> sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
> 
> Yes.

Let me give that a try and see what happens :)

>> How does that square with wait_task_zombie reaping the
>> statistics of the whole group with thread_group_cputime_adjusted()
>> when the group leader is exiting?
> 
> Again, not sure I understand... thread_group_cputime_adjusted() in
> wait_task_zombie() is fine in any case. Nobody but us can reap this
> zombie.
> 
> It seems that we misunderstood each other, let me try again. Just to
> simplify, suppose we have, say,
> 
>   sys_times_by_pid(pid, ...)
>   {
>   rcu_read_lock();
>   task = find_task_by_vpid(pid);
>   if (task)
>   get_task_struct(task);
>   rcu_read_unlock();
> 
>   if (!task)
>   return -ESRCH;
> 
>   thread_group_cputime(task, ...);
>   copy_to_user();
>   return 0;
>   }
> 
> Note that this task can exit right after rcu_read_unlock(), and it can
> be also reaped (by its parent or by itself) and removed from the thread
> list. In this case for_each_thread() will see no threads, and thus it
> will only read task->signal->*time.
> 
> This means that sys_times_by_pid() can simply return the wrong result
> instead of failure. Say, It can even return "all zeros" if this task was
> single-threaded.

Ahh, that makes sense.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Oleg Nesterov wrote:
>
> But just for record, the "lockless" version doesn't look that bad to me,
>
>   void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
> *times)
>   {
>   struct signal_struct *sig = tsk->signal;
>   bool lockless, is_dead;
>   struct task_struct *t;
>   unsigned long flags;
>   u64 exec;
>
>   lockless = true;
>   is_dead = !lock_task_sighand(p, );
>retry:
>   times->utime = sig->utime;
>   times->stime = sig->stime;
>   times->sum_exec_runtime = exec = sig->sum_sched_runtime;
>   if (is_dead)
>   return;
>
>   if (lockless)
>   unlock_task_sighand(p, );
>
>   rcu_read_lock();
>   for_each_thread(tsk, t) {
>   cputime_t utime, stime;
>   task_cputime(t, , );
>   times->utime += utime;
>   times->stime += stime;
>   times->sum_exec_runtime += task_sched_runtime(t);
>   }
>   rcu_read_unlock();
>
>   if (lockless) {
>   lockless = false;
>   is_dead = !lock_task_sighand(p, );
>   if (is_dead || exec != sig->sum_sched_runtime)
>   goto retry;
>   }
>   unlock_task_sighand(p, );
>   }
>
> The obvious problem is that we should shift lock_task_sighand() from the
> callers to thread_group_cputime() first, or add 
> thread_group_cputime_lockless()
> and change the current users one by one.

OTOH, it is simple to convert do_sys_times() and posix_cpu_clock_get_task()
to use the lockless version, and avoid the new stats_lock and other changes
it needs.

> And of course, stats_lock is more generic.

Yes, this is true in any case.

So I simply do not know.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:
>
> On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
> >
> > Or you can expand the scope of write_seqlock/write_sequnlock, so that
> > __unhash_process in called from inside the critical section. This looks
> > simpler at first glance.
>
> The problem with that is that wait_task_zombie() calls
> thread_group_cputime_adjusted() in that if() branch, and
> that code ends up taking the seqlock for read...

Not sure I understand... This modifies parent->signal->c* counters,
and obviously the exiting thread is not the member of parent's thread
group, so thread_group_cputime_adjusted(parent) can never account the
exiting child twice simply because it won't see it?

> However, in __exit_signal that approach should work.

Yes,

> > Hmm, wait, it seems there is yet another problem ;) Afaics, you also
> > need to modify __exit_signal() so that ->sum_sched_runtime/etc are
> > accounted unconditionally, even if the group leader exits.
> >
> > Probably this is not a big problem, and sys_times() or clock_gettime()
> > do not care at all because they use current.
> >
> > But without this change thread_group_cputime(reaped_zombie) won't look
> > at this task_struct at all, this can lead to non-monotonic result if
> > it was previously called when this task was alive (non-reaped).
>
> You mean this whole block needs to run regardless of whether
> the group is dead?
>
> task_cputime(tsk, , );
> write_seqlock(>stats_lock);
> sig->utime += utime;
> sig->stime += stime;
> sig->gtime += task_gtime(tsk);
> sig->min_flt += tsk->min_flt;
> sig->maj_flt += tsk->maj_flt;
> sig->nvcsw += tsk->nvcsw;
> sig->nivcsw += tsk->nivcsw;
> sig->inblock += task_io_get_inblock(tsk);
> sig->oublock += task_io_get_oublock(tsk);
> task_io_accounting_add(>ioac, >ioac);
> sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

Yes.

> How does that square with wait_task_zombie reaping the
> statistics of the whole group with thread_group_cputime_adjusted()
> when the group leader is exiting?

Again, not sure I understand... thread_group_cputime_adjusted() in
wait_task_zombie() is fine in any case. Nobody but us can reap this
zombie.

It seems that we misunderstood each other, let me try again. Just to
simplify, suppose we have, say,

sys_times_by_pid(pid, ...)
{
rcu_read_lock();
task = find_task_by_vpid(pid);
if (task)
get_task_struct(task);
rcu_read_unlock();

if (!task)
return -ESRCH;

thread_group_cputime(task, ...);
copy_to_user();
return 0;
}

Note that this task can exit right after rcu_read_unlock(), and it can
be also reaped (by its parent or by itself) and removed from the thread
list. In this case for_each_thread() will see no threads, and thus it
will only read task->signal->*time.

This means that sys_times_by_pid() can simply return the wrong result
instead of failure. Say, It can even return "all zeros" if this task was
single-threaded.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Oleg Nesterov wrote:
>
> OK, lets forget about alternative approach for now. We can reconsider
> it later. At least I have to admit that seqlock is more straighforward.

Yes.

But just for record, the "lockless" version doesn't look that bad to me,

void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
*times)
{
struct signal_struct *sig = tsk->signal;
bool lockless, is_dead;
struct task_struct *t;
unsigned long flags;
u64 exec;

lockless = true;
is_dead = !lock_task_sighand(p, );
 retry:
times->utime = sig->utime;
times->stime = sig->stime;
times->sum_exec_runtime = exec = sig->sum_sched_runtime;
if (is_dead)
return;

if (lockless)
unlock_task_sighand(p, );

rcu_read_lock();
for_each_thread(tsk, t) {
cputime_t utime, stime;
task_cputime(t, , );
times->utime += utime;
times->stime += stime;
times->sum_exec_runtime += task_sched_runtime(t);
}
rcu_read_unlock();

if (lockless) {
lockless = false;
is_dead = !lock_task_sighand(p, );
if (is_dead || exec != sig->sum_sched_runtime)
goto retry;
}
unlock_task_sighand(p, );
}

The obvious problem is that we should shift lock_task_sighand() from the
callers to thread_group_cputime() first, or add thread_group_cputime_lockless()
and change the current users one by one.

And of course, stats_lock is more generic.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
> On 08/14, Rik van Riel wrote:
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
>>> On 08/13, Rik van Riel wrote:

 @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) {
 cputime_t tgutime, tgstime, cutime, cstime;

 -  spin_lock_irq(>sighand->siglock);
 thread_group_cputime_adjusted(current, , );
 cutime = current->signal->cutime; cstime =
 current->signal->cstime; -
 spin_unlock_irq(>sighand->siglock);
>>>
>>> Ah, wait, there is another problem afaics...
>>
>> Last night I worked on another problem with this code.
>>
>> After propagating the stats from a dying task to the signal struct,
>> we need to make sure that that task's stats are not counted twice.
> 
> Heh indeed ;) Can't understand how I missed that.
> 
>> This requires zeroing the stats under the write_seqlock, which was
>> easy enough to add.
> 
> Or you can expand the scope of write_seqlock/write_sequnlock, so that
> __unhash_process in called from inside the critical section. This looks
> simpler at first glance.

The problem with that is that wait_task_zombie() calls
thread_group_cputime_adjusted() in that if() branch, and
that code ends up taking the seqlock for read...

However, in __exit_signal that approach should work.

> Hmm, wait, it seems there is yet another problem ;) Afaics, you also
> need to modify __exit_signal() so that ->sum_sched_runtime/etc are
> accounted unconditionally, even if the group leader exits.
> 
> Probably this is not a big problem, and sys_times() or clock_gettime()
> do not care at all because they use current.
> 
> But without this change thread_group_cputime(reaped_zombie) won't look
> at this task_struct at all, this can lead to non-monotonic result if
> it was previously called when this task was alive (non-reaped).

You mean this whole block needs to run regardless of whether
the group is dead?

task_cputime(tsk, , );
write_seqlock(>stats_lock);
sig->utime += utime;
sig->stime += stime;
sig->gtime += task_gtime(tsk);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
sig->nvcsw += tsk->nvcsw;
sig->nivcsw += tsk->nivcsw;
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(>ioac, >ioac);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

How does that square with wait_task_zombie reaping the
statistics of the whole group with thread_group_cputime_adjusted()
when the group leader is exiting?

Could that lead to things being double-counted?

Or do you mean ONLY ->sum_sched_runtime is unconditionally
accounted in __exit_signal(), because wait_task_zombie() seems
to be missing that one?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
> > On 08/13, Rik van Riel wrote:
> >>
> >> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) {
> >> cputime_t tgutime, tgstime, cutime, cstime;
> >>
> >> -  spin_lock_irq(>sighand->siglock);
> >> thread_group_cputime_adjusted(current, , );
> >> cutime = current->signal->cutime; cstime =
> >> current->signal->cstime; -
> >> spin_unlock_irq(>sighand->siglock);
> >
> > Ah, wait, there is another problem afaics...
>
> Last night I worked on another problem with this code.
>
> After propagating the stats from a dying task to the signal struct,
> we need to make sure that that task's stats are not counted twice.

Heh indeed ;) Can't understand how I missed that.

> This requires zeroing the stats under the write_seqlock, which was
> easy enough to add.

Or you can expand the scope of write_seqlock/write_sequnlock, so that
__unhash_process in called from inside the critical section. This looks
simpler at first glance.

Hmm, wait, it seems there is yet another problem ;) Afaics, you also
need to modify __exit_signal() so that ->sum_sched_runtime/etc are
accounted unconditionally, even if the group leader exits.

Probably this is not a big problem, and sys_times() or clock_gettime()
do not care at all because they use current.

But without this change thread_group_cputime(reaped_zombie) won't look
at this task_struct at all, this can lead to non-monotonic result if
it was previously called when this task was alive (non-reaped).

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
> On 08/13, Rik van Riel wrote:
>> 
>> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { 
>> cputime_t tgutime, tgstime, cutime, cstime;
>> 
>> -spin_lock_irq(>sighand->siglock); 
>> thread_group_cputime_adjusted(current, , ); 
>> cutime = current->signal->cutime; cstime =
>> current->signal->cstime; -
>> spin_unlock_irq(>sighand->siglock);
> 
> Ah, wait, there is another problem afaics...

Last night I worked on another problem with this code.

After propagating the stats from a dying task to the signal struct,
we need to make sure that that task's stats are not counted twice.

This requires zeroing the stats under the write_seqlock, which was
easy enough to add. We cannot rely on any state in the task that
was set outside of the write_seqlock...

> thread_group_cputime_adjusted()->cputime_adjust() plays with 
> signal->prev_cputime and thus it needs siglock or stats_lock to
> ensure it can't race with itself. Not sure it is safe to simply
> take the lock in cputime_adjust(), this should be checked.
> 
> OTOH, do_task_stat() already calls task_cputime_adjusted() lockless
> and this looks wrong or I missed something. So perhaps we need a
> lock in or around cputime_adjust() anyway.

I'll take a look at this.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJT7NfHAAoJEM553pKExN6DTVIH/RIFVl42fM+cBpiSavSa2s4k
B0ykVu/VwFbqoYVo5I5joSl25IpU5Xma3AwMBQHoJ7aY9a8w63iGFMoycKcDWbrY
nOyaOTvR92aMdn/GuGwS/XlU83PwIbLEyYWFrvn0CrnBqJw9pHz/sLYsvP/jASem
LbUStuWFzqGyasb4lJVZmLQKaIVhy30CM5Y3llTFuc16zyH/YG65tUasG+aR2miA
g3CiPOHP/IY0vZ+L3YYlLthLY4acVX/bwImE0vsx9fY+rG4hgj5xF9b0CnbaN41g
62oJ4jkFSH/voNFPjR7I5AnKpSeMsBqW2/l1tHlcaKcNCtkd9nri/HinxXN5bN4=
=dfSt
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Frederic Weisbecker wrote:
>
> 2014-08-14 3:57 GMT+02:00 Rik van Riel :
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
> >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
> >>
> >> I'm worried about such lockless solution based on RCU or read
> >> seqcount because we lose the guarantee that an update is
> >> immediately visible by all subsequent readers.
> >>
> >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
> >> after that call clock_gettime(), with the spinlock we were
> >> guaranteed to see the new update. Now with a pure seqlock read
> >> approach, we guarantee a read sequence coherency but we don't
> >> guarantee the freshest update result.
> >>
> >> So that looks like a source of non monotonic results.
> >
> > Which update are you worried about, specifically?
> >
> > The seq_write_lock to update the usage stat in p->signal will lock out
> > the seqlock read side used to check those results.
> >
> > Is there another kind of thing read by cpu_clock_sample_group that you
> > believe is not excluded by the seq_lock?
>
> I mean the read side doesn't use a lock with seqlocks. It's only made
> of barriers and sequence numbers to ensure the reader doesn't read
> some half-complete update. But other than that it can as well see the
> update n - 1 since barriers don't enforce latest results.

Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
write_seqcount_begin-update-write_seqcount_begin can miss "update" part
along with ->sequence modifications.

But I still can't understand how this can lead to non-monotonic results,
could you spell?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/13, Rik van Riel wrote:
>
> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
>  {
>   cputime_t tgutime, tgstime, cutime, cstime;
>
> - spin_lock_irq(>sighand->siglock);
>   thread_group_cputime_adjusted(current, , );
>   cutime = current->signal->cutime;
>   cstime = current->signal->cstime;
> - spin_unlock_irq(>sighand->siglock);

Ah, wait, there is another problem afaics...

thread_group_cputime_adjusted()->cputime_adjust() plays with
signal->prev_cputime and thus it needs siglock or stats_lock to ensure
it can't race with itself. Not sure it is safe to simply take the lock
in cputime_adjust(), this should be checked.

OTOH, do_task_stat() already calls task_cputime_adjusted() lockless and
this looks wrong or I missed something. So perhaps we need a lock in or
around cputime_adjust() anyway.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Frederic Weisbecker wrote:
>
> 2014-08-14 15:22 GMT+02:00 Oleg Nesterov :
> > On 08/13, Rik van Riel wrote:
> >>
> >> @@ -646,6 +646,7 @@ struct signal_struct {
> >>* Live threads maintain their own counters and add to these
> >>* in __exit_signal, except for the group leader.
> >>*/
> >> + seqlock_t stats_lock;
> >
> > Ah. Somehow I thought that you were going to use seqcount_t and fallback
> > to taking ->siglock if seqcount_retry, but this patch adds the "full blown"
> > seqlock_t.
> >
> > OK, I won't argue, this can make the seqbegin_or_lock simpler...
>
> Is this really needed? seqlock are useful when we have concurrent
> updaters. But updaters of thread stats should be under the thread lock
> already, right? If we have only one updater at a time, seqcount should
> be enough.

Yes, this is what I meant. Although I can see 2 reasons to use seqlock_t:

1. It can simplify the seqbegin-or-lock logic. If nothing else,
   you simply can't use read_seqbegin_or_lock() to take ->siglock.
   But this is just syntactic sugar.

2. If we use ->siglock in fallback path, we need to verify that
   thread_group_cputime() is never called with ->siglock held first.

   Or, we need a fat comment to explain that need_seqrtry == T is not
   possible if it is called under ->siglock, and thus "fallback to
   lock_task_sighand" must be always safe. But in this case we need
   to ensure that the caller didn't do write_seqcount_begin().

So perhaps seqlock_t makes more sense at least initially...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Frederic Weisbecker
2014-08-14 15:22 GMT+02:00 Oleg Nesterov :
> On 08/13, Rik van Riel wrote:
>>
>> On Wed, 13 Aug 2014 20:45:11 +0200
>> Oleg Nesterov  wrote:
>>
>> > That said, it is not that I am really sure that seqcount_t in ->signal
>> > is actually worse, not to mention that this is subjective anyway. IOW,
>> > I am not going to really fight with your approach ;)
>>
>> This is what it looks like, on top of your for_each_thread series
>> from yesterday:
>
> OK, lets forget about alternative approach for now. We can reconsider
> it later. At least I have to admit that seqlock is more straighforward.
>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -646,6 +646,7 @@ struct signal_struct {
>>* Live threads maintain their own counters and add to these
>>* in __exit_signal, except for the group leader.
>>*/
>> + seqlock_t stats_lock;
>
> Ah. Somehow I thought that you were going to use seqcount_t and fallback
> to taking ->siglock if seqcount_retry, but this patch adds the "full blown"
> seqlock_t.
>
> OK, I won't argue, this can make the seqbegin_or_lock simpler...

Is this really needed? seqlock are useful when we have concurrent
updaters. But updaters of thread stats should be under the thread lock
already, right? If we have only one updater at a time, seqcount should
be enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Frederic Weisbecker
2014-08-14 3:57 GMT+02:00 Rik van Riel :
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
>> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
>>> --- a/kernel/time/posix-cpu-timers.c +++
>>> b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int
>>> posix_cpu_clock_get_task(struct task_struct *tsk, if
>>> (same_thread_group(tsk, current)) err =
>>> cpu_clock_sample(which_clock, tsk, ); } else { - unsigned
>>> long flags; -struct sighand_struct *sighand; - -
>>>  /* - *
>>> while_each_thread() is not yet entirely RCU safe, -   * keep
>>> locking the group while sampling process -* clock for now. -
>>> */ - sighand = lock_task_sighand(tsk, ); - if 
>>> (!sighand)
>>> -return err; - if (tsk == current ||
>>> thread_group_leader(tsk)) err =
>>> cpu_clock_sample_group(which_clock, tsk, ); - -
>>> unlock_task_sighand(tsk, ); }
>>
>> I'm worried about such lockless solution based on RCU or read
>> seqcount because we lose the guarantee that an update is
>> immediately visible by all subsequent readers.
>>
>> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
>> after that call clock_gettime(), with the spinlock we were
>> guaranteed to see the new update. Now with a pure seqlock read
>> approach, we guarantee a read sequence coherency but we don't
>> guarantee the freshest update result.
>>
>> So that looks like a source of non monotonic results.
>
> Which update are you worried about, specifically?
>
> The seq_write_lock to update the usage stat in p->signal will lock out
> the seqlock read side used to check those results.
>
> Is there another kind of thing read by cpu_clock_sample_group that you
> believe is not excluded by the seq_lock?

I mean the read side doesn't use a lock with seqlocks. It's only made
of barriers and sequence numbers to ensure the reader doesn't read
some half-complete update. But other than that it can as well see the
update n - 1 since barriers don't enforce latest results.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/13, Rik van Riel wrote:
>
> On Wed, 13 Aug 2014 20:45:11 +0200
> Oleg Nesterov  wrote:
>
> > That said, it is not that I am really sure that seqcount_t in ->signal
> > is actually worse, not to mention that this is subjective anyway. IOW,
> > I am not going to really fight with your approach ;)
>
> This is what it looks like, on top of your for_each_thread series
> from yesterday:

OK, lets forget about alternative approach for now. We can reconsider
it later. At least I have to admit that seqlock is more straighforward.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -646,6 +646,7 @@ struct signal_struct {
>* Live threads maintain their own counters and add to these
>* in __exit_signal, except for the group leader.
>*/
> + seqlock_t stats_lock;

Ah. Somehow I thought that you were going to use seqcount_t and fallback
to taking ->siglock if seqcount_retry, but this patch adds the "full blown"
seqlock_t.

OK, I won't argue, this can make the seqbegin_or_lock simpler...

> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, 
> struct task_cputime *times)
>   struct signal_struct *sig = tsk->signal;
>   cputime_t utime, stime;
>   struct task_struct *t;
> -
> - times->utime = sig->utime;
> - times->stime = sig->stime;
> - times->sum_exec_runtime = sig->sum_sched_runtime;
> + unsigned int seq, nextseq;
>
>   rcu_read_lock();

Almost cosmetic nit, but afaics this patch expands the rcu critical section
for no reason. We only need rcu_read_lock/unlock around for_each_thread()
below.

> + nextseq = 0;
> + do {
> + seq = nextseq;
> + read_seqbegin_or_lock(>stats_lock, );
> + times->utime = sig->utime;
> + times->stime = sig->stime;
> + times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> + for_each_thread(tsk, t) {
> + task_cputime(t, , );
> + times->utime += utime;
> + times->stime += stime;
> + times->sum_exec_runtime += task_sched_runtime(t);
> + }
> + /*
> +  * If a writer is currently active, seq will be odd, and
> +  * read_seqbegin_or_lock will take the lock.
> +  */
> + nextseq = raw_read_seqcount(>stats_lock.seqcount);
> + } while (need_seqretry(>stats_lock, seq));
> + done_seqretry(>stats_lock, seq);

Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean,
this code still can livelock in theory. Just suppose that anoter CPU does
write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this
case "seq & 1" will be never true and thus "or_lock" will never happen.

IMO, this should be fixed. Either we should guarantee the forward progress
or we should not play with read_seqbegin_or_lock() at all. This code assumes
that sooner or later "nextseq = raw_read_seqcount()" should return the odd
counter, but in theory this can never happen.

And if we want to fix this we do not need 2 counters, just we need to set
"seq = 1" manually after need_seqretry() == T. Say, like __dentry_path() does.
(but unlike __dentry_path() we do not need to worry about rcu_read_unlock so
the code will be simpler).

I am wondering if it makes sense to introduce

bool read_seqretry_or_lock(const seqlock_t *sl, int *seq)
{
if (*seq & 1) {
read_sequnlock_excl(lock);
return false;
}

if (!read_seqretry(lock, *seq))
return false;

*seq = 1;
return true;
}

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/13, Rik van Riel wrote:

 On Wed, 13 Aug 2014 20:45:11 +0200
 Oleg Nesterov o...@redhat.com wrote:

  That said, it is not that I am really sure that seqcount_t in -signal
  is actually worse, not to mention that this is subjective anyway. IOW,
  I am not going to really fight with your approach ;)

 This is what it looks like, on top of your for_each_thread series
 from yesterday:

OK, lets forget about alternative approach for now. We can reconsider
it later. At least I have to admit that seqlock is more straighforward.

 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -646,6 +646,7 @@ struct signal_struct {
* Live threads maintain their own counters and add to these
* in __exit_signal, except for the group leader.
*/
 + seqlock_t stats_lock;

Ah. Somehow I thought that you were going to use seqcount_t and fallback
to taking -siglock if seqcount_retry, but this patch adds the full blown
seqlock_t.

OK, I won't argue, this can make the seqbegin_or_lock simpler...

 @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, 
 struct task_cputime *times)
   struct signal_struct *sig = tsk-signal;
   cputime_t utime, stime;
   struct task_struct *t;
 -
 - times-utime = sig-utime;
 - times-stime = sig-stime;
 - times-sum_exec_runtime = sig-sum_sched_runtime;
 + unsigned int seq, nextseq;

   rcu_read_lock();

Almost cosmetic nit, but afaics this patch expands the rcu critical section
for no reason. We only need rcu_read_lock/unlock around for_each_thread()
below.

 + nextseq = 0;
 + do {
 + seq = nextseq;
 + read_seqbegin_or_lock(sig-stats_lock, seq);
 + times-utime = sig-utime;
 + times-stime = sig-stime;
 + times-sum_exec_runtime = sig-sum_sched_runtime;
 +
 + for_each_thread(tsk, t) {
 + task_cputime(t, utime, stime);
 + times-utime += utime;
 + times-stime += stime;
 + times-sum_exec_runtime += task_sched_runtime(t);
 + }
 + /*
 +  * If a writer is currently active, seq will be odd, and
 +  * read_seqbegin_or_lock will take the lock.
 +  */
 + nextseq = raw_read_seqcount(sig-stats_lock.seqcount);
 + } while (need_seqretry(sig-stats_lock, seq));
 + done_seqretry(sig-stats_lock, seq);

Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean,
this code still can livelock in theory. Just suppose that anoter CPU does
write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this
case seq  1 will be never true and thus or_lock will never happen.

IMO, this should be fixed. Either we should guarantee the forward progress
or we should not play with read_seqbegin_or_lock() at all. This code assumes
that sooner or later nextseq = raw_read_seqcount() should return the odd
counter, but in theory this can never happen.

And if we want to fix this we do not need 2 counters, just we need to set
seq = 1 manually after need_seqretry() == T. Say, like __dentry_path() does.
(but unlike __dentry_path() we do not need to worry about rcu_read_unlock so
the code will be simpler).

I am wondering if it makes sense to introduce

bool read_seqretry_or_lock(const seqlock_t *sl, int *seq)
{
if (*seq  1) {
read_sequnlock_excl(lock);
return false;
}

if (!read_seqretry(lock, *seq))
return false;

*seq = 1;
return true;
}

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Frederic Weisbecker
2014-08-14 3:57 GMT+02:00 Rik van Riel r...@redhat.com:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
 On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
 --- a/kernel/time/posix-cpu-timers.c +++
 b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int
 posix_cpu_clock_get_task(struct task_struct *tsk, if
 (same_thread_group(tsk, current)) err =
 cpu_clock_sample(which_clock, tsk, rtn); } else { - unsigned
 long flags; -struct sighand_struct *sighand; - -
  /* - *
 while_each_thread() is not yet entirely RCU safe, -   * keep
 locking the group while sampling process -* clock for now. -
 */ - sighand = lock_task_sighand(tsk, flags); - if 
 (!sighand)
 -return err; - if (tsk == current ||
 thread_group_leader(tsk)) err =
 cpu_clock_sample_group(which_clock, tsk, rtn); - -
 unlock_task_sighand(tsk, flags); }

 I'm worried about such lockless solution based on RCU or read
 seqcount because we lose the guarantee that an update is
 immediately visible by all subsequent readers.

 Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
 after that call clock_gettime(), with the spinlock we were
 guaranteed to see the new update. Now with a pure seqlock read
 approach, we guarantee a read sequence coherency but we don't
 guarantee the freshest update result.

 So that looks like a source of non monotonic results.

 Which update are you worried about, specifically?

 The seq_write_lock to update the usage stat in p-signal will lock out
 the seqlock read side used to check those results.

 Is there another kind of thing read by cpu_clock_sample_group that you
 believe is not excluded by the seq_lock?

I mean the read side doesn't use a lock with seqlocks. It's only made
of barriers and sequence numbers to ensure the reader doesn't read
some half-complete update. But other than that it can as well see the
update n - 1 since barriers don't enforce latest results.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Frederic Weisbecker
2014-08-14 15:22 GMT+02:00 Oleg Nesterov o...@redhat.com:
 On 08/13, Rik van Riel wrote:

 On Wed, 13 Aug 2014 20:45:11 +0200
 Oleg Nesterov o...@redhat.com wrote:

  That said, it is not that I am really sure that seqcount_t in -signal
  is actually worse, not to mention that this is subjective anyway. IOW,
  I am not going to really fight with your approach ;)

 This is what it looks like, on top of your for_each_thread series
 from yesterday:

 OK, lets forget about alternative approach for now. We can reconsider
 it later. At least I have to admit that seqlock is more straighforward.

 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -646,6 +646,7 @@ struct signal_struct {
* Live threads maintain their own counters and add to these
* in __exit_signal, except for the group leader.
*/
 + seqlock_t stats_lock;

 Ah. Somehow I thought that you were going to use seqcount_t and fallback
 to taking -siglock if seqcount_retry, but this patch adds the full blown
 seqlock_t.

 OK, I won't argue, this can make the seqbegin_or_lock simpler...

Is this really needed? seqlock are useful when we have concurrent
updaters. But updaters of thread stats should be under the thread lock
already, right? If we have only one updater at a time, seqcount should
be enough.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Frederic Weisbecker wrote:

 2014-08-14 15:22 GMT+02:00 Oleg Nesterov o...@redhat.com:
  On 08/13, Rik van Riel wrote:
 
  @@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
  + seqlock_t stats_lock;
 
  Ah. Somehow I thought that you were going to use seqcount_t and fallback
  to taking -siglock if seqcount_retry, but this patch adds the full blown
  seqlock_t.
 
  OK, I won't argue, this can make the seqbegin_or_lock simpler...

 Is this really needed? seqlock are useful when we have concurrent
 updaters. But updaters of thread stats should be under the thread lock
 already, right? If we have only one updater at a time, seqcount should
 be enough.

Yes, this is what I meant. Although I can see 2 reasons to use seqlock_t:

1. It can simplify the seqbegin-or-lock logic. If nothing else,
   you simply can't use read_seqbegin_or_lock() to take -siglock.
   But this is just syntactic sugar.

2. If we use -siglock in fallback path, we need to verify that
   thread_group_cputime() is never called with -siglock held first.

   Or, we need a fat comment to explain that need_seqrtry == T is not
   possible if it is called under -siglock, and thus fallback to
   lock_task_sighand must be always safe. But in this case we need
   to ensure that the caller didn't do write_seqcount_begin().

So perhaps seqlock_t makes more sense at least initially...

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/13, Rik van Riel wrote:

 @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
  {
   cputime_t tgutime, tgstime, cutime, cstime;

 - spin_lock_irq(current-sighand-siglock);
   thread_group_cputime_adjusted(current, tgutime, tgstime);
   cutime = current-signal-cutime;
   cstime = current-signal-cstime;
 - spin_unlock_irq(current-sighand-siglock);

Ah, wait, there is another problem afaics...

thread_group_cputime_adjusted()-cputime_adjust() plays with
signal-prev_cputime and thus it needs siglock or stats_lock to ensure
it can't race with itself. Not sure it is safe to simply take the lock
in cputime_adjust(), this should be checked.

OTOH, do_task_stat() already calls task_cputime_adjusted() lockless and
this looks wrong or I missed something. So perhaps we need a lock in or
around cputime_adjust() anyway.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Frederic Weisbecker wrote:

 2014-08-14 3:57 GMT+02:00 Rik van Riel r...@redhat.com:
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
  On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
 
  I'm worried about such lockless solution based on RCU or read
  seqcount because we lose the guarantee that an update is
  immediately visible by all subsequent readers.
 
  Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
  after that call clock_gettime(), with the spinlock we were
  guaranteed to see the new update. Now with a pure seqlock read
  approach, we guarantee a read sequence coherency but we don't
  guarantee the freshest update result.
 
  So that looks like a source of non monotonic results.
 
  Which update are you worried about, specifically?
 
  The seq_write_lock to update the usage stat in p-signal will lock out
  the seqlock read side used to check those results.
 
  Is there another kind of thing read by cpu_clock_sample_group that you
  believe is not excluded by the seq_lock?

 I mean the read side doesn't use a lock with seqlocks. It's only made
 of barriers and sequence numbers to ensure the reader doesn't read
 some half-complete update. But other than that it can as well see the
 update n - 1 since barriers don't enforce latest results.

Yes, sure, read_seqcount_begin/read_seqcount_retry right after
write_seqcount_begin-update-write_seqcount_begin can miss update part
along with -sequence modifications.

But I still can't understand how this can lead to non-monotonic results,
could you spell?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
 On 08/13, Rik van Riel wrote:
 
 @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { 
 cputime_t tgutime, tgstime, cutime, cstime;
 
 -spin_lock_irq(current-sighand-siglock); 
 thread_group_cputime_adjusted(current, tgutime, tgstime); 
 cutime = current-signal-cutime; cstime =
 current-signal-cstime; -
 spin_unlock_irq(current-sighand-siglock);
 
 Ah, wait, there is another problem afaics...

Last night I worked on another problem with this code.

After propagating the stats from a dying task to the signal struct,
we need to make sure that that task's stats are not counted twice.

This requires zeroing the stats under the write_seqlock, which was
easy enough to add. We cannot rely on any state in the task that
was set outside of the write_seqlock...

 thread_group_cputime_adjusted()-cputime_adjust() plays with 
 signal-prev_cputime and thus it needs siglock or stats_lock to
 ensure it can't race with itself. Not sure it is safe to simply
 take the lock in cputime_adjust(), this should be checked.
 
 OTOH, do_task_stat() already calls task_cputime_adjusted() lockless
 and this looks wrong or I missed something. So perhaps we need a
 lock in or around cputime_adjust() anyway.

I'll take a look at this.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJT7NfHAAoJEM553pKExN6DTVIH/RIFVl42fM+cBpiSavSa2s4k
B0ykVu/VwFbqoYVo5I5joSl25IpU5Xma3AwMBQHoJ7aY9a8w63iGFMoycKcDWbrY
nOyaOTvR92aMdn/GuGwS/XlU83PwIbLEyYWFrvn0CrnBqJw9pHz/sLYsvP/jASem
LbUStuWFzqGyasb4lJVZmLQKaIVhy30CM5Y3llTFuc16zyH/YG65tUasG+aR2miA
g3CiPOHP/IY0vZ+L3YYlLthLY4acVX/bwImE0vsx9fY+rG4hgj5xF9b0CnbaN41g
62oJ4jkFSH/voNFPjR7I5AnKpSeMsBqW2/l1tHlcaKcNCtkd9nri/HinxXN5bN4=
=dfSt
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
  On 08/13, Rik van Riel wrote:
 
  @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) {
  cputime_t tgutime, tgstime, cutime, cstime;
 
  -  spin_lock_irq(current-sighand-siglock);
  thread_group_cputime_adjusted(current, tgutime, tgstime);
  cutime = current-signal-cutime; cstime =
  current-signal-cstime; -
  spin_unlock_irq(current-sighand-siglock);
 
  Ah, wait, there is another problem afaics...

 Last night I worked on another problem with this code.

 After propagating the stats from a dying task to the signal struct,
 we need to make sure that that task's stats are not counted twice.

Heh indeed ;) Can't understand how I missed that.

 This requires zeroing the stats under the write_seqlock, which was
 easy enough to add.

Or you can expand the scope of write_seqlock/write_sequnlock, so that
__unhash_process in called from inside the critical section. This looks
simpler at first glance.

Hmm, wait, it seems there is yet another problem ;) Afaics, you also
need to modify __exit_signal() so that -sum_sched_runtime/etc are
accounted unconditionally, even if the group leader exits.

Probably this is not a big problem, and sys_times() or clock_gettime()
do not care at all because they use current.

But without this change thread_group_cputime(reaped_zombie) won't look
at this task_struct at all, this can lead to non-monotonic result if
it was previously called when this task was alive (non-reaped).

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
 On 08/14, Rik van Riel wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
 On 08/13, Rik van Riel wrote:

 @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) {
 cputime_t tgutime, tgstime, cutime, cstime;

 -  spin_lock_irq(current-sighand-siglock);
 thread_group_cputime_adjusted(current, tgutime, tgstime);
 cutime = current-signal-cutime; cstime =
 current-signal-cstime; -
 spin_unlock_irq(current-sighand-siglock);

 Ah, wait, there is another problem afaics...

 Last night I worked on another problem with this code.

 After propagating the stats from a dying task to the signal struct,
 we need to make sure that that task's stats are not counted twice.
 
 Heh indeed ;) Can't understand how I missed that.
 
 This requires zeroing the stats under the write_seqlock, which was
 easy enough to add.
 
 Or you can expand the scope of write_seqlock/write_sequnlock, so that
 __unhash_process in called from inside the critical section. This looks
 simpler at first glance.

The problem with that is that wait_task_zombie() calls
thread_group_cputime_adjusted() in that if() branch, and
that code ends up taking the seqlock for read...

However, in __exit_signal that approach should work.

 Hmm, wait, it seems there is yet another problem ;) Afaics, you also
 need to modify __exit_signal() so that -sum_sched_runtime/etc are
 accounted unconditionally, even if the group leader exits.
 
 Probably this is not a big problem, and sys_times() or clock_gettime()
 do not care at all because they use current.
 
 But without this change thread_group_cputime(reaped_zombie) won't look
 at this task_struct at all, this can lead to non-monotonic result if
 it was previously called when this task was alive (non-reaped).

You mean this whole block needs to run regardless of whether
the group is dead?

task_cputime(tsk, utime, stime);
write_seqlock(sig-stats_lock);
sig-utime += utime;
sig-stime += stime;
sig-gtime += task_gtime(tsk);
sig-min_flt += tsk-min_flt;
sig-maj_flt += tsk-maj_flt;
sig-nvcsw += tsk-nvcsw;
sig-nivcsw += tsk-nivcsw;
sig-inblock += task_io_get_inblock(tsk);
sig-oublock += task_io_get_oublock(tsk);
task_io_accounting_add(sig-ioac, tsk-ioac);
sig-sum_sched_runtime += tsk-se.sum_exec_runtime;

How does that square with wait_task_zombie reaping the
statistics of the whole group with thread_group_cputime_adjusted()
when the group leader is exiting?

Could that lead to things being double-counted?

Or do you mean ONLY -sum_sched_runtime is unconditionally
accounted in __exit_signal(), because wait_task_zombie() seems
to be missing that one?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Oleg Nesterov wrote:

 OK, lets forget about alternative approach for now. We can reconsider
 it later. At least I have to admit that seqlock is more straighforward.

Yes.

But just for record, the lockless version doesn't look that bad to me,

void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
*times)
{
struct signal_struct *sig = tsk-signal;
bool lockless, is_dead;
struct task_struct *t;
unsigned long flags;
u64 exec;

lockless = true;
is_dead = !lock_task_sighand(p, flags);
 retry:
times-utime = sig-utime;
times-stime = sig-stime;
times-sum_exec_runtime = exec = sig-sum_sched_runtime;
if (is_dead)
return;

if (lockless)
unlock_task_sighand(p, flags);

rcu_read_lock();
for_each_thread(tsk, t) {
cputime_t utime, stime;
task_cputime(t, utime, stime);
times-utime += utime;
times-stime += stime;
times-sum_exec_runtime += task_sched_runtime(t);
}
rcu_read_unlock();

if (lockless) {
lockless = false;
is_dead = !lock_task_sighand(p, flags);
if (is_dead || exec != sig-sum_sched_runtime)
goto retry;
}
unlock_task_sighand(p, flags);
}

The obvious problem is that we should shift lock_task_sighand() from the
callers to thread_group_cputime() first, or add thread_group_cputime_lockless()
and change the current users one by one.

And of course, stats_lock is more generic.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:

 On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
 
  Or you can expand the scope of write_seqlock/write_sequnlock, so that
  __unhash_process in called from inside the critical section. This looks
  simpler at first glance.

 The problem with that is that wait_task_zombie() calls
 thread_group_cputime_adjusted() in that if() branch, and
 that code ends up taking the seqlock for read...

Not sure I understand... This modifies parent-signal-c* counters,
and obviously the exiting thread is not the member of parent's thread
group, so thread_group_cputime_adjusted(parent) can never account the
exiting child twice simply because it won't see it?

 However, in __exit_signal that approach should work.

Yes,

  Hmm, wait, it seems there is yet another problem ;) Afaics, you also
  need to modify __exit_signal() so that -sum_sched_runtime/etc are
  accounted unconditionally, even if the group leader exits.
 
  Probably this is not a big problem, and sys_times() or clock_gettime()
  do not care at all because they use current.
 
  But without this change thread_group_cputime(reaped_zombie) won't look
  at this task_struct at all, this can lead to non-monotonic result if
  it was previously called when this task was alive (non-reaped).

 You mean this whole block needs to run regardless of whether
 the group is dead?

 task_cputime(tsk, utime, stime);
 write_seqlock(sig-stats_lock);
 sig-utime += utime;
 sig-stime += stime;
 sig-gtime += task_gtime(tsk);
 sig-min_flt += tsk-min_flt;
 sig-maj_flt += tsk-maj_flt;
 sig-nvcsw += tsk-nvcsw;
 sig-nivcsw += tsk-nivcsw;
 sig-inblock += task_io_get_inblock(tsk);
 sig-oublock += task_io_get_oublock(tsk);
 task_io_accounting_add(sig-ioac, tsk-ioac);
 sig-sum_sched_runtime += tsk-se.sum_exec_runtime;

Yes.

 How does that square with wait_task_zombie reaping the
 statistics of the whole group with thread_group_cputime_adjusted()
 when the group leader is exiting?

Again, not sure I understand... thread_group_cputime_adjusted() in
wait_task_zombie() is fine in any case. Nobody but us can reap this
zombie.

It seems that we misunderstood each other, let me try again. Just to
simplify, suppose we have, say,

sys_times_by_pid(pid, ...)
{
rcu_read_lock();
task = find_task_by_vpid(pid);
if (task)
get_task_struct(task);
rcu_read_unlock();

if (!task)
return -ESRCH;

thread_group_cputime(task, ...);
copy_to_user();
return 0;
}

Note that this task can exit right after rcu_read_unlock(), and it can
be also reaped (by its parent or by itself) and removed from the thread
list. In this case for_each_thread() will see no threads, and thus it
will only read task-signal-*time.

This means that sys_times_by_pid() can simply return the wrong result
instead of failure. Say, It can even return all zeros if this task was
single-threaded.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Oleg Nesterov wrote:

 But just for record, the lockless version doesn't look that bad to me,

   void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
 *times)
   {
   struct signal_struct *sig = tsk-signal;
   bool lockless, is_dead;
   struct task_struct *t;
   unsigned long flags;
   u64 exec;

   lockless = true;
   is_dead = !lock_task_sighand(p, flags);
retry:
   times-utime = sig-utime;
   times-stime = sig-stime;
   times-sum_exec_runtime = exec = sig-sum_sched_runtime;
   if (is_dead)
   return;

   if (lockless)
   unlock_task_sighand(p, flags);

   rcu_read_lock();
   for_each_thread(tsk, t) {
   cputime_t utime, stime;
   task_cputime(t, utime, stime);
   times-utime += utime;
   times-stime += stime;
   times-sum_exec_runtime += task_sched_runtime(t);
   }
   rcu_read_unlock();

   if (lockless) {
   lockless = false;
   is_dead = !lock_task_sighand(p, flags);
   if (is_dead || exec != sig-sum_sched_runtime)
   goto retry;
   }
   unlock_task_sighand(p, flags);
   }

 The obvious problem is that we should shift lock_task_sighand() from the
 callers to thread_group_cputime() first, or add 
 thread_group_cputime_lockless()
 and change the current users one by one.

OTOH, it is simple to convert do_sys_times() and posix_cpu_clock_get_task()
to use the lockless version, and avoid the new stats_lock and other changes
it needs.

 And of course, stats_lock is more generic.

Yes, this is true in any case.

So I simply do not know.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
On 08/14/2014 02:15 PM, Oleg Nesterov wrote:
 On 08/14, Rik van Riel wrote:

 On 08/14/2014 12:12 PM, Oleg Nesterov wrote:

 Or you can expand the scope of write_seqlock/write_sequnlock, so that
 __unhash_process in called from inside the critical section. This looks
 simpler at first glance.

 The problem with that is that wait_task_zombie() calls
 thread_group_cputime_adjusted() in that if() branch, and
 that code ends up taking the seqlock for read...
 
 Not sure I understand... This modifies parent-signal-c* counters,
 and obviously the exiting thread is not the member of parent's thread
 group, so thread_group_cputime_adjusted(parent) can never account the
 exiting child twice simply because it won't see it?

You are right, the tree of processes only goes one way,
so there should be no deadlock in taking psig-stats_lock
and having thread_group_cputime_adjusted take sig-stats_lock
for read within that section.

However, it might need some lockdep annotation to keep
lockdep from thinking we might the same lock recursively :)

 However, in __exit_signal that approach should work.
 
 Yes,
 
 Hmm, wait, it seems there is yet another problem ;) Afaics, you also
 need to modify __exit_signal() so that -sum_sched_runtime/etc are
 accounted unconditionally, even if the group leader exits.

 Probably this is not a big problem, and sys_times() or clock_gettime()
 do not care at all because they use current.

 But without this change thread_group_cputime(reaped_zombie) won't look
 at this task_struct at all, this can lead to non-monotonic result if
 it was previously called when this task was alive (non-reaped).

 You mean this whole block needs to run regardless of whether
 the group is dead?

 task_cputime(tsk, utime, stime);
 write_seqlock(sig-stats_lock);
 sig-utime += utime;
 sig-stime += stime;
 sig-gtime += task_gtime(tsk);
 sig-min_flt += tsk-min_flt;
 sig-maj_flt += tsk-maj_flt;
 sig-nvcsw += tsk-nvcsw;
 sig-nivcsw += tsk-nivcsw;
 sig-inblock += task_io_get_inblock(tsk);
 sig-oublock += task_io_get_oublock(tsk);
 task_io_accounting_add(sig-ioac, tsk-ioac);
 sig-sum_sched_runtime += tsk-se.sum_exec_runtime;
 
 Yes.

Let me give that a try and see what happens :)

 How does that square with wait_task_zombie reaping the
 statistics of the whole group with thread_group_cputime_adjusted()
 when the group leader is exiting?
 
 Again, not sure I understand... thread_group_cputime_adjusted() in
 wait_task_zombie() is fine in any case. Nobody but us can reap this
 zombie.
 
 It seems that we misunderstood each other, let me try again. Just to
 simplify, suppose we have, say,
 
   sys_times_by_pid(pid, ...)
   {
   rcu_read_lock();
   task = find_task_by_vpid(pid);
   if (task)
   get_task_struct(task);
   rcu_read_unlock();
 
   if (!task)
   return -ESRCH;
 
   thread_group_cputime(task, ...);
   copy_to_user();
   return 0;
   }
 
 Note that this task can exit right after rcu_read_unlock(), and it can
 be also reaped (by its parent or by itself) and removed from the thread
 list. In this case for_each_thread() will see no threads, and thus it
 will only read task-signal-*time.
 
 This means that sys_times_by_pid() can simply return the wrong result
 instead of failure. Say, It can even return all zeros if this task was
 single-threaded.

Ahh, that makes sense.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Oleg Nesterov
On 08/14, Rik van Riel wrote:

 On 08/14/2014 02:15 PM, Oleg Nesterov wrote:
  On 08/14, Rik van Riel wrote:
 
  On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
 
  Or you can expand the scope of write_seqlock/write_sequnlock, so that
  __unhash_process in called from inside the critical section. This looks
  simpler at first glance.
 
  The problem with that is that wait_task_zombie() calls
  thread_group_cputime_adjusted() in that if() branch, and
  that code ends up taking the seqlock for read...
 
  Not sure I understand... This modifies parent-signal-c* counters,
  and obviously the exiting thread is not the member of parent's thread
  group, so thread_group_cputime_adjusted(parent) can never account the
  exiting child twice simply because it won't see it?

 You are right, the tree of processes only goes one way,
 so there should be no deadlock in taking psig-stats_lock
 and having thread_group_cputime_adjusted take sig-stats_lock
 for read within that section.

 However, it might need some lockdep annotation to keep
 lockdep from thinking we might the same lock recursively :)

But wait_task_zombie() can (and should) call
thread_group_cputime_adjusted(zombie_child) outside of parent's -siglock
or -stats_lock so this this should be safe.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Rik van Riel
On Thu, 14 Aug 2014 18:12:47 +0200
Oleg Nesterov o...@redhat.com wrote:

 Or you can expand the scope of write_seqlock/write_sequnlock, so that
 __unhash_process in called from inside the critical section. This looks
 simpler at first glance.
 
 Hmm, wait, it seems there is yet another problem ;) Afaics, you also
 need to modify __exit_signal() so that -sum_sched_runtime/etc are
 accounted unconditionally, even if the group leader exits.

OK, this is what I have now.

I am still getting backwards time sometimes, but only tiny
increments. This suggests that cputime_adjust() may be the
culprit, and I have no good idea on how to fix that yet...

Should task_cputime_adjusted and thread_group_cputime_adjusted
pass in the address of a seqlock to use in case the values in
prev need to be updated?

Should we check whether the values in prev changed during the
time spent in the function?

Is this a race between task_cputime_adjusted and other writers
of signal-utime and signal-stime, instead of task_cputime_adjusted
racing with itself?

I am not sure what the best approach here is...

---8---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel r...@redhat.com
---
 include/linux/sched.h  |  1 +
 kernel/exit.c  | 48 +++---
 kernel/fork.c  |  1 +
 kernel/sched/cputime.c | 36 +++
 kernel/sys.c   |  2 --
 kernel/time/posix-cpu-timers.c | 14 
 6 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
+   seqlock_t stats_lock;
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..c1a0ef2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -115,32 +115,34 @@ static void __exit_signal(struct task_struct *tsk)
 
if (tsk == sig-curr_target)
sig-curr_target = next_thread(tsk);
-   /*
-* Accumulate here the counters for all threads but the
-* group leader as they die, so they can be added into
-* the process-wide totals when those are taken.
-* The group leader stays around as a zombie as long
-* as there are other threads.  When it gets reaped,
-* the exit.c code will add its counts into these totals.
-* We won't ever get here for the group leader, since it
-* will have been the last reference on the signal_struct.
-*/
-   task_cputime(tsk, utime, stime);
-   sig-utime += utime;
-   sig-stime += stime;
-   sig-gtime += task_gtime(tsk);
-   sig-min_flt += tsk-min_flt;
-   sig-maj_flt += tsk-maj_flt;
-   sig-nvcsw += tsk-nvcsw;
-   sig-nivcsw += tsk-nivcsw;
-   sig-inblock += task_io_get_inblock(tsk);
-   sig-oublock += task_io_get_oublock(tsk);
-   task_io_accounting_add(sig-ioac, tsk-ioac);
-   sig-sum_sched_runtime += tsk-se.sum_exec_runtime;
}
 
+   /*
+* Accumulate here the counters for all threads but the
+* group leader as they die, so they can be added into
+* the process-wide totals when those are taken.
+* The group leader stays around as a zombie as long
+* as there are other threads.  When it gets reaped,
+* the exit.c code will add its counts into these totals.
+* We won't ever get here for the group leader, since it
+* will have been the last reference on the 

Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Frederic Weisbecker
2014-08-14 16:39 GMT+02:00 Oleg Nesterov o...@redhat.com:
 On 08/14, Frederic Weisbecker wrote:

 2014-08-14 3:57 GMT+02:00 Rik van Riel r...@redhat.com:
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
  On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
 
  I'm worried about such lockless solution based on RCU or read
  seqcount because we lose the guarantee that an update is
  immediately visible by all subsequent readers.
 
  Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
  after that call clock_gettime(), with the spinlock we were
  guaranteed to see the new update. Now with a pure seqlock read
  approach, we guarantee a read sequence coherency but we don't
  guarantee the freshest update result.
 
  So that looks like a source of non monotonic results.
 
  Which update are you worried about, specifically?
 
  The seq_write_lock to update the usage stat in p-signal will lock out
  the seqlock read side used to check those results.
 
  Is there another kind of thing read by cpu_clock_sample_group that you
  believe is not excluded by the seq_lock?

 I mean the read side doesn't use a lock with seqlocks. It's only made
 of barriers and sequence numbers to ensure the reader doesn't read
 some half-complete update. But other than that it can as well see the
 update n - 1 since barriers don't enforce latest results.

 Yes, sure, read_seqcount_begin/read_seqcount_retry right after
 write_seqcount_begin-update-write_seqcount_begin can miss update part
 along with -sequence modifications.

 But I still can't understand how this can lead to non-monotonic results,
 could you spell?

Well lets say clock = T.
CPU 0 updates at T + 1.
Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
while CPU 1 still reads T.
If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
possible that CPU 2 still sees T. With the spinlocked version that
thing can't happen, the second round would read at least T + 1 for
both CPUs.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-14 Thread Mike Galbraith
On Thu, 2014-08-14 at 19:48 +0200, Oleg Nesterov wrote: 
 On 08/14, Oleg Nesterov wrote:
 
  OK, lets forget about alternative approach for now. We can reconsider
  it later. At least I have to admit that seqlock is more straighforward.
 
 Yes.
 
 But just for record, the lockless version doesn't look that bad to me,
 
   void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
 *times)
   {
   struct signal_struct *sig = tsk-signal;
   bool lockless, is_dead;
   struct task_struct *t;
   unsigned long flags;
   u64 exec;
 
   lockless = true;
   is_dead = !lock_task_sighand(p, flags);
retry:
   times-utime = sig-utime;
   times-stime = sig-stime;
   times-sum_exec_runtime = exec = sig-sum_sched_runtime;
   if (is_dead)
   return;
 
   if (lockless)
   unlock_task_sighand(p, flags);
 
   rcu_read_lock();
   for_each_thread(tsk, t) {
   cputime_t utime, stime;
   task_cputime(t, utime, stime);
   times-utime += utime;
   times-stime += stime;
   times-sum_exec_runtime += task_sched_runtime(t);
   }
   rcu_read_unlock();
 
   if (lockless) {
   lockless = false;
   is_dead = !lock_task_sighand(p, flags);
   if (is_dead || exec != sig-sum_sched_runtime)
   goto retry;
   }
   unlock_task_sighand(p, flags);
   }
 
 The obvious problem is that we should shift lock_task_sighand() from the
 callers to thread_group_cputime() first, or add 
 thread_group_cputime_lockless()
 and change the current users one by one.
 
 And of course, stats_lock is more generic.

Yours looks nice to me, particularly in that it doesn't munge structure
layout, could perhaps be backported to fix up production kernels.

For the N threads doing this on N cores case, seems rq-lock hammering
will still be a source of major box wide pain.  Is there any correctness
reason to add up unaccounted -on_cpu beans, or is that just value
added?  Seems to me it can't matter, as you traverse, what you added up
on previous threads becomes ever more stale as you proceed, so big boxen
would be better off not doing that.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
>> --- a/kernel/time/posix-cpu-timers.c +++
>> b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int
>> posix_cpu_clock_get_task(struct task_struct *tsk, if
>> (same_thread_group(tsk, current)) err =
>> cpu_clock_sample(which_clock, tsk, ); } else { - unsigned
>> long flags; -struct sighand_struct *sighand; - - 
>> /* - *
>> while_each_thread() is not yet entirely RCU safe, -   * keep
>> locking the group while sampling process -* clock for now. -
>> */ - sighand = lock_task_sighand(tsk, ); - if 
>> (!sighand) 
>> -return err; - if (tsk == current ||
>> thread_group_leader(tsk)) err =
>> cpu_clock_sample_group(which_clock, tsk, ); - -
>> unlock_task_sighand(tsk, ); }
> 
> I'm worried about such lockless solution based on RCU or read
> seqcount because we lose the guarantee that an update is
> immediately visible by all subsequent readers.
> 
> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
> after that call clock_gettime(), with the spinlock we were
> guaranteed to see the new update. Now with a pure seqlock read
> approach, we guarantee a read sequence coherency but we don't
> guarantee the freshest update result.
> 
> So that looks like a source of non monotonic results.

Which update are you worried about, specifically?

The seq_write_lock to update the usage stat in p->signal will lock out
the seqlock read side used to check those results.

Is there another kind of thing read by cpu_clock_sample_group that you
believe is not excluded by the seq_lock?


- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJT7BdtAAoJEM553pKExN6DngEH/1CJuBb6xij08AoZNQuW4WNQ
geKakADsYTz8FmutbGi+lJEHNKAMZQ5wYbyFNczPAX/fVJsOlC92Qtfwy5z/VupN
QzlRHh79ZJR5/6xGddlu/8LjGrMIXwKqShIeKtTzoENS+rxA82l42zoXTagal4yX
Peb5/Q7cBMg9pFZzUMITEJssQhDtyTN1rXiU5IsEG54PhDbSgFk7HK1cO46tRefb
x1WbUKZKUViVKnoKYhJqd6FDSWuPtL5EpebvMVj9TZ29JBQTMDGx8saUezjuY0YL
STAv/wqigmbbcNnloJpr3gO1iJMkknv3jHk6Bfv1Dil8Um1D3mBAAKFK+Ov8Rk0=
=kU1O
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Frederic Weisbecker
On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct 
> *tsk,
>   if (same_thread_group(tsk, current))
>   err = cpu_clock_sample(which_clock, tsk, );
>   } else {
> - unsigned long flags;
> - struct sighand_struct *sighand;
> -
> - /*
> -  * while_each_thread() is not yet entirely RCU safe,
> -  * keep locking the group while sampling process
> -  * clock for now.
> -  */
> - sighand = lock_task_sighand(tsk, );
> - if (!sighand)
> - return err;
> -
>   if (tsk == current || thread_group_leader(tsk))
>   err = cpu_clock_sample_group(which_clock, tsk, );
> -
> - unlock_task_sighand(tsk, );
>   }

I'm worried about such lockless solution based on RCU or read seqcount because
we lose the guarantee that an update is immediately visible by all subsequent
readers.

Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that
call clock_gettime(), with the spinlock we were guaranteed to see the new
update. Now with a pure seqlock read approach, we guarantee a
read sequence coherency but we don't guarantee the freshest update result.

So that looks like a source of non monotonic results.

>  
>   if (!err)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Rik van Riel
On Wed, 13 Aug 2014 20:45:11 +0200
Oleg Nesterov  wrote:

> That said, it is not that I am really sure that seqcount_t in ->signal
> is actually worse, not to mention that this is subjective anyway. IOW,
> I am not going to really fight with your approach ;)

This is what it looks like, on top of your for_each_thread series
from yesterday:

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel 
---
 include/linux/sched.h  |  1 +
 kernel/exit.c  |  4 
 kernel/fork.c  |  1 +
 kernel/sched/cputime.c | 36 +++-
 kernel/sys.c   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
+   seqlock_t stats_lock;
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..8092e59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 * will have been the last reference on the signal_struct.
 */
task_cputime(tsk, , );
+   write_seqlock(>stats_lock);
sig->utime += utime;
sig->stime += stime;
sig->gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
sig->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(>ioac, >ioac);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+   write_sequnlock(>stats_lock);
}
 
sig->nr_threads--;
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
spin_lock_irq(>real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
+   write_seqlock(>stats_lock);
psig->cutime += tgutime + sig->cutime;
psig->cstime += tgstime + sig->cstime;
psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
psig->cmaxrss = maxrss;
task_io_accounting_add(>ioac, >ioac);
task_io_accounting_add(>ioac, >ioac);
+   write_sequnlock(>stats_lock);
spin_unlock_irq(>real_parent->sighand->siglock);
}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig->curr_target = tsk;
init_sigpending(>shared_pending);
INIT_LIST_HEAD(>posix_timers);
+   seqlock_init(>stats_lock);
 
hrtimer_init(>real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
struct signal_struct *sig = tsk->signal;
cputime_t utime, stime;
struct task_struct *t;
-
-   times->utime = sig->utime;
-   times->stime = sig->stime;
-   times->sum_exec_runtime = sig->sum_sched_runtime;
+   unsigned int seq, nextseq;
 
rcu_read_lock();
-   for_each_thread(tsk, t) {
-   task_cputime(t, , );
-   times->utime 

[PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Rik van Riel
On Wed, 13 Aug 2014 20:45:11 +0200
Oleg Nesterov  wrote:

> That said, it is not that I am really sure that seqcount_t in ->signal
> is actually worse, not to mention that this is subjective anyway. IOW,
> I am not going to really fight with your approach ;)

This is what it looks like, on top of your for_each_thread series
from yesterday:

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel 
---
 include/linux/sched.h  |  1 +
 kernel/exit.c  |  4 
 kernel/fork.c  |  1 +
 kernel/sched/cputime.c | 36 +++-
 kernel/sys.c   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
+   seqlock_t stats_lock;
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..8092e59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 * will have been the last reference on the signal_struct.
 */
task_cputime(tsk, , );
+   write_seqlock(>stats_lock);
sig->utime += utime;
sig->stime += stime;
sig->gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
sig->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(>ioac, >ioac);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+   write_sequnlock(>stats_lock);
}
 
sig->nr_threads--;
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
spin_lock_irq(>real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
+   write_seqlock(>stats_lock);
psig->cutime += tgutime + sig->cutime;
psig->cstime += tgstime + sig->cstime;
psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
psig->cmaxrss = maxrss;
task_io_accounting_add(>ioac, >ioac);
task_io_accounting_add(>ioac, >ioac);
+   write_sequnlock(>stats_lock);
spin_unlock_irq(>real_parent->sighand->siglock);
}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig->curr_target = tsk;
init_sigpending(>shared_pending);
INIT_LIST_HEAD(>posix_timers);
+   seqlock_init(>stats_lock);
 
hrtimer_init(>real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
struct signal_struct *sig = tsk->signal;
cputime_t utime, stime;
struct task_struct *t;
-
-   times->utime = sig->utime;
-   times->stime = sig->stime;
-   times->sum_exec_runtime = sig->sum_sched_runtime;
+   unsigned int seq, nextseq;
 
rcu_read_lock();
-   for_each_thread(tsk, t) {
-   task_cputime(t, , );
-   times->utime 

[PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Rik van Riel
On Wed, 13 Aug 2014 20:45:11 +0200
Oleg Nesterov o...@redhat.com wrote:

 That said, it is not that I am really sure that seqcount_t in -signal
 is actually worse, not to mention that this is subjective anyway. IOW,
 I am not going to really fight with your approach ;)

This is what it looks like, on top of your for_each_thread series
from yesterday:

---8---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel r...@redhat.com
---
 include/linux/sched.h  |  1 +
 kernel/exit.c  |  4 
 kernel/fork.c  |  1 +
 kernel/sched/cputime.c | 36 +++-
 kernel/sys.c   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
+   seqlock_t stats_lock;
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..8092e59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 * will have been the last reference on the signal_struct.
 */
task_cputime(tsk, utime, stime);
+   write_seqlock(sig-stats_lock);
sig-utime += utime;
sig-stime += stime;
sig-gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
sig-oublock += task_io_get_oublock(tsk);
task_io_accounting_add(sig-ioac, tsk-ioac);
sig-sum_sched_runtime += tsk-se.sum_exec_runtime;
+   write_sequnlock(sig-stats_lock);
}
 
sig-nr_threads--;
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
spin_lock_irq(p-real_parent-sighand-siglock);
psig = p-real_parent-signal;
sig = p-signal;
+   write_seqlock(psig-stats_lock);
psig-cutime += tgutime + sig-cutime;
psig-cstime += tgstime + sig-cstime;
psig-cgtime += task_gtime(p) + sig-gtime + sig-cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
psig-cmaxrss = maxrss;
task_io_accounting_add(psig-ioac, p-ioac);
task_io_accounting_add(psig-ioac, sig-ioac);
+   write_sequnlock(psig-stats_lock);
spin_unlock_irq(p-real_parent-sighand-siglock);
}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig-curr_target = tsk;
init_sigpending(sig-shared_pending);
INIT_LIST_HEAD(sig-posix_timers);
+   seqlock_init(sig-stats_lock);
 
hrtimer_init(sig-real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig-real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
struct signal_struct *sig = tsk-signal;
cputime_t utime, stime;
struct task_struct *t;
-
-   times-utime = sig-utime;
-   times-stime = sig-stime;
-   times-sum_exec_runtime = sig-sum_sched_runtime;
+   unsigned int seq, nextseq;
 
rcu_read_lock();
-   for_each_thread(tsk, t) {
-   

[PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Rik van Riel
On Wed, 13 Aug 2014 20:45:11 +0200
Oleg Nesterov o...@redhat.com wrote:

 That said, it is not that I am really sure that seqcount_t in -signal
 is actually worse, not to mention that this is subjective anyway. IOW,
 I am not going to really fight with your approach ;)

This is what it looks like, on top of your for_each_thread series
from yesterday:

---8---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel r...@redhat.com
---
 include/linux/sched.h  |  1 +
 kernel/exit.c  |  4 
 kernel/fork.c  |  1 +
 kernel/sched/cputime.c | 36 +++-
 kernel/sys.c   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 * Live threads maintain their own counters and add to these
 * in __exit_signal, except for the group leader.
 */
+   seqlock_t stats_lock;
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..8092e59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 * will have been the last reference on the signal_struct.
 */
task_cputime(tsk, utime, stime);
+   write_seqlock(sig-stats_lock);
sig-utime += utime;
sig-stime += stime;
sig-gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
sig-oublock += task_io_get_oublock(tsk);
task_io_accounting_add(sig-ioac, tsk-ioac);
sig-sum_sched_runtime += tsk-se.sum_exec_runtime;
+   write_sequnlock(sig-stats_lock);
}
 
sig-nr_threads--;
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
spin_lock_irq(p-real_parent-sighand-siglock);
psig = p-real_parent-signal;
sig = p-signal;
+   write_seqlock(psig-stats_lock);
psig-cutime += tgutime + sig-cutime;
psig-cstime += tgstime + sig-cstime;
psig-cgtime += task_gtime(p) + sig-gtime + sig-cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
psig-cmaxrss = maxrss;
task_io_accounting_add(psig-ioac, p-ioac);
task_io_accounting_add(psig-ioac, sig-ioac);
+   write_sequnlock(psig-stats_lock);
spin_unlock_irq(p-real_parent-sighand-siglock);
}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
sig-curr_target = tsk;
init_sigpending(sig-shared_pending);
INIT_LIST_HEAD(sig-posix_timers);
+   seqlock_init(sig-stats_lock);
 
hrtimer_init(sig-real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig-real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
struct signal_struct *sig = tsk-signal;
cputime_t utime, stime;
struct task_struct *t;
-
-   times-utime = sig-utime;
-   times-stime = sig-stime;
-   times-sum_exec_runtime = sig-sum_sched_runtime;
+   unsigned int seq, nextseq;
 
rcu_read_lock();
-   for_each_thread(tsk, t) {
-   

Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Frederic Weisbecker
On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
 --- a/kernel/time/posix-cpu-timers.c
 +++ b/kernel/time/posix-cpu-timers.c
 @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct 
 *tsk,
   if (same_thread_group(tsk, current))
   err = cpu_clock_sample(which_clock, tsk, rtn);
   } else {
 - unsigned long flags;
 - struct sighand_struct *sighand;
 -
 - /*
 -  * while_each_thread() is not yet entirely RCU safe,
 -  * keep locking the group while sampling process
 -  * clock for now.
 -  */
 - sighand = lock_task_sighand(tsk, flags);
 - if (!sighand)
 - return err;
 -
   if (tsk == current || thread_group_leader(tsk))
   err = cpu_clock_sample_group(which_clock, tsk, rtn);
 -
 - unlock_task_sighand(tsk, flags);
   }

I'm worried about such lockless solution based on RCU or read seqcount because
we lose the guarantee that an update is immediately visible by all subsequent
readers.

Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that
call clock_gettime(), with the spinlock we were guaranteed to see the new
update. Now with a pure seqlock read approach, we guarantee a
read sequence coherency but we don't guarantee the freshest update result.

So that looks like a source of non monotonic results.

  
   if (!err)
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock

2014-08-13 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
 On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
 --- a/kernel/time/posix-cpu-timers.c +++
 b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int
 posix_cpu_clock_get_task(struct task_struct *tsk, if
 (same_thread_group(tsk, current)) err =
 cpu_clock_sample(which_clock, tsk, rtn); } else { - unsigned
 long flags; -struct sighand_struct *sighand; - - 
 /* - *
 while_each_thread() is not yet entirely RCU safe, -   * keep
 locking the group while sampling process -* clock for now. -
 */ - sighand = lock_task_sighand(tsk, flags); - if 
 (!sighand) 
 -return err; - if (tsk == current ||
 thread_group_leader(tsk)) err =
 cpu_clock_sample_group(which_clock, tsk, rtn); - -
 unlock_task_sighand(tsk, flags); }
 
 I'm worried about such lockless solution based on RCU or read
 seqcount because we lose the guarantee that an update is
 immediately visible by all subsequent readers.
 
 Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
 after that call clock_gettime(), with the spinlock we were
 guaranteed to see the new update. Now with a pure seqlock read
 approach, we guarantee a read sequence coherency but we don't
 guarantee the freshest update result.
 
 So that looks like a source of non monotonic results.

Which update are you worried about, specifically?

The seq_write_lock to update the usage stat in p-signal will lock out
the seqlock read side used to check those results.

Is there another kind of thing read by cpu_clock_sample_group that you
believe is not excluded by the seq_lock?


- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJT7BdtAAoJEM553pKExN6DngEH/1CJuBb6xij08AoZNQuW4WNQ
geKakADsYTz8FmutbGi+lJEHNKAMZQ5wYbyFNczPAX/fVJsOlC92Qtfwy5z/VupN
QzlRHh79ZJR5/6xGddlu/8LjGrMIXwKqShIeKtTzoENS+rxA82l42zoXTagal4yX
Peb5/Q7cBMg9pFZzUMITEJssQhDtyTN1rXiU5IsEG54PhDbSgFk7HK1cO46tRefb
x1WbUKZKUViVKnoKYhJqd6FDSWuPtL5EpebvMVj9TZ29JBQTMDGx8saUezjuY0YL
STAv/wqigmbbcNnloJpr3gO1iJMkknv3jHk6Bfv1Dil8Um1D3mBAAKFK+Ov8Rk0=
=kU1O
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/