Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Stanislaw Gruszka
On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > > Is this really equivalent though? It updates one task instead of all
> > > tasks in the group and there is no guarantee that tsk == current.
> > 
> > Oh, my intention was to update runtime on current.
> > 
> 
> Ok, so minimally that would need addressing. However, then I would worry
> that two tasks in a group calling the function at the same time would
> see different results because each of them updated a different task.
> Such a situation is inherently race-prone anyway but it's a large enough
> functional difference to be worth calling out.

It races bacause we don't know which thread will call the clock_gettime()
first. But once that happen, second thread will see updated runtime value
from first thread as we call update_curr() for it with task_rq_lock (change
from commit 6e998916dfe3).

> Minimally, I don't think such a patch is a replacement for Giovanni's
> which is functionally equivalent to the current code but could be layered
> on top if it is proven to be ok.

I agree. I wanted to post my patch on top of Giovanni's.

> > > Glancing at it, it should monotonically increase but it looks like it
> > > would calculate stale data.
> > 
> > Yes, until the next tick on a CPU, the patch does not count partial
> > runtime of thread running on that CPU. However that was the behaviour
> > before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> > function worked:
> > 
> 
> Sure, but does this patch not reintroduce the "SMP wobble" and the
> problem of "the diff of 'process' should always be >= the diff of
> 'thread'" ?

It should not reintroduce that problem, also because of change from
commit 6e998916dfe3. When a thread reads sum_exec_runtime it also
update that value, then process reads updated value. I run test
case from  "SMP wobble" commit and the problem do not happen
on my tests.

Perhaps I should post patch with a descriptive changelog and things
would be clearer ...

Stanislaw


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Stanislaw Gruszka
On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > > Is this really equivalent though? It updates one task instead of all
> > > tasks in the group and there is no guarantee that tsk == current.
> > 
> > Oh, my intention was to update runtime on current.
> > 
> 
> Ok, so minimally that would need addressing. However, then I would worry
> that two tasks in a group calling the function at the same time would
> see different results because each of them updated a different task.
> Such a situation is inherently race-prone anyway but it's a large enough
> functional difference to be worth calling out.

It races bacause we don't know which thread will call the clock_gettime()
first. But once that happen, second thread will see updated runtime value
from first thread as we call update_curr() for it with task_rq_lock (change
from commit 6e998916dfe3).

> Minimally, I don't think such a patch is a replacement for Giovanni's
> which is functionally equivalent to the current code but could be layered
> on top if it is proven to be ok.

I agree. I wanted to post my patch on top of Giovanni's.

> > > Glancing at it, it should monotonically increase but it looks like it
> > > would calculate stale data.
> > 
> > Yes, until the next tick on a CPU, the patch does not count partial
> > runtime of thread running on that CPU. However that was the behaviour
> > before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> > function worked:
> > 
> 
> Sure, but does this patch not reintroduce the "SMP wobble" and the
> problem of "the diff of 'process' should always be >= the diff of
> 'thread'" ?

It should not reintroduce that problem, also because of change from
commit 6e998916dfe3. When a thread reads sum_exec_runtime it also
update that value, then process reads updated value. I run test
case from  "SMP wobble" commit and the problem do not happen
on my tests.

Perhaps I should post patch with a descriptive changelog and things
would be clearer ...

Stanislaw


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Mel Gorman
On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > Is this really equivalent though? It updates one task instead of all
> > tasks in the group and there is no guarantee that tsk == current.
> 
> Oh, my intention was to update runtime on current.
> 

Ok, so minimally that would need addressing. However, then I would worry
that two tasks in a group calling the function at the same time would
see different results because each of them updated a different task.
Such a situation is inherently race-prone anyway but it's a large enough
functional difference to be worth calling out.

Minimally, I don't think such a patch is a replacement for Giovanni's
which is functionally equivalent to the current code but could be layered
on top if it is proven to be ok.

> > Glancing at it, it should monotonically increase but it looks like it
> > would calculate stale data.
> 
> Yes, until the next tick on a CPU, the patch does not count partial
> runtime of thread running on that CPU. However that was the behaviour
> before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> function worked:
> 

Sure, but does this patch not reintroduce the "SMP wobble" and the
problem of "the diff of 'process' should always be >= the diff of
'thread'" ?

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Mel Gorman
On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > Is this really equivalent though? It updates one task instead of all
> > tasks in the group and there is no guarantee that tsk == current.
> 
> Oh, my intention was to update runtime on current.
> 

Ok, so minimally that would need addressing. However, then I would worry
that two tasks in a group calling the function at the same time would
see different results because each of them updated a different task.
Such a situation is inherently race-prone anyway but it's a large enough
functional difference to be worth calling out.

Minimally, I don't think such a patch is a replacement for Giovanni's
which is functionally equivalent to the current code but could be layered
on top if it is proven to be ok.

> > Glancing at it, it should monotonically increase but it looks like it
> > would calculate stale data.
> 
> Yes, until the next tick on a CPU, the patch does not count partial
> runtime of thread running on that CPU. However that was the behaviour
> before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> function worked:
> 

Sure, but does this patch not reintroduce the "SMP wobble" and the
problem of "the diff of 'process' should always be >= the diff of
'thread'" ?

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Wanpeng Li
2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka :
> On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
>> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka :
>> > Hi
>> >
>> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> >> Nice detective work! I'm wondering, where do we stand if compared with a
>> >> pre-6e998916dfe3 kernel?
>> >>
>> >> I admit this is a difficult question: 6e998916dfe3 does not revert 
>> >> cleanly and I
>> >> suspect v3.17 does not run easily on a recent distro. Could you attempt 
>> >> to revert
>> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't 
>> >> try to
>> >> make the result correct, just see what the performance gap is, roughly.
>> >>
>> >> If there's still a significant gap then it might make sense to optimize 
>> >> this some
>> >> more.
>> >
>> > I measured (partial) revert performance on 4.7 using mmtest instructions
>> > from Giovanni and also tested some other possible fix (draft version):
>> >
>> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> > index 75f98c5..54fdf6d 100644
>> > --- a/kernel/sched/cputime.c
>> > +++ b/kernel/sched/cputime.c
>> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, 
>> > struct task_cputime *times)
>> > unsigned int seq, nextseq;
>> > unsigned long flags;
>> >
>> > +   (void) task_sched_runtime(tsk);
>> > +
>> > rcu_read_lock();
>> > /* Attempt a lockless read on the first round. */
>> > nextseq = 0;
>> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, 
>> > struct task_cputime *times)
>> > task_cputime(t, , );
>> > times->utime += utime;
>> > times->stime += stime;
>> > -   times->sum_exec_runtime += task_sched_runtime(t);
>> > +   times->sum_exec_runtime += t->se.sum_exec_runtime;
>>
>> If this will not have updated stats for other threads?
>
> No, until tick/sched() on CPUs running threads.

Yeah, I think this change will result in not updated stats for other
threads if they are running and before next update_curr() is called.

Regards,
Wanpeng Li


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Wanpeng Li
2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka :
> On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
>> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka :
>> > Hi
>> >
>> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> >> Nice detective work! I'm wondering, where do we stand if compared with a
>> >> pre-6e998916dfe3 kernel?
>> >>
>> >> I admit this is a difficult question: 6e998916dfe3 does not revert 
>> >> cleanly and I
>> >> suspect v3.17 does not run easily on a recent distro. Could you attempt 
>> >> to revert
>> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't 
>> >> try to
>> >> make the result correct, just see what the performance gap is, roughly.
>> >>
>> >> If there's still a significant gap then it might make sense to optimize 
>> >> this some
>> >> more.
>> >
>> > I measured (partial) revert performance on 4.7 using mmtest instructions
>> > from Giovanni and also tested some other possible fix (draft version):
>> >
>> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> > index 75f98c5..54fdf6d 100644
>> > --- a/kernel/sched/cputime.c
>> > +++ b/kernel/sched/cputime.c
>> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, 
>> > struct task_cputime *times)
>> > unsigned int seq, nextseq;
>> > unsigned long flags;
>> >
>> > +   (void) task_sched_runtime(tsk);
>> > +
>> > rcu_read_lock();
>> > /* Attempt a lockless read on the first round. */
>> > nextseq = 0;
>> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, 
>> > struct task_cputime *times)
>> > task_cputime(t, , );
>> > times->utime += utime;
>> > times->stime += stime;
>> > -   times->sum_exec_runtime += task_sched_runtime(t);
>> > +   times->sum_exec_runtime += t->se.sum_exec_runtime;
>>
>> If this will not have updated stats for other threads?
>
> No, until tick/sched() on CPUs running threads.

Yeah, I think this change will result in not updated stats for other
threads if they are running and before next update_curr() is called.

Regards,
Wanpeng Li


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Stanislaw Gruszka
On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka :
> > Hi
> >
> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> >> Nice detective work! I'm wondering, where do we stand if compared with a
> >> pre-6e998916dfe3 kernel?
> >>
> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly 
> >> and I
> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to 
> >> revert
> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't 
> >> try to
> >> make the result correct, just see what the performance gap is, roughly.
> >>
> >> If there's still a significant gap then it might make sense to optimize 
> >> this some
> >> more.
> >
> > I measured (partial) revert performance on 4.7 using mmtest instructions
> > from Giovanni and also tested some other possible fix (draft version):
> >
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 75f98c5..54fdf6d 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, 
> > struct task_cputime *times)
> > unsigned int seq, nextseq;
> > unsigned long flags;
> >
> > +   (void) task_sched_runtime(tsk);
> > +
> > rcu_read_lock();
> > /* Attempt a lockless read on the first round. */
> > nextseq = 0;
> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, 
> > struct task_cputime *times)
> > task_cputime(t, , );
> > times->utime += utime;
> > times->stime += stime;
> > -   times->sum_exec_runtime += task_sched_runtime(t);
> > +   times->sum_exec_runtime += t->se.sum_exec_runtime;
> 
> If this will not have updated stats for other threads?

No, until tick/sched() on CPUs running threads.

Stanislaw


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Stanislaw Gruszka
On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka :
> > Hi
> >
> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> >> Nice detective work! I'm wondering, where do we stand if compared with a
> >> pre-6e998916dfe3 kernel?
> >>
> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly 
> >> and I
> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to 
> >> revert
> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't 
> >> try to
> >> make the result correct, just see what the performance gap is, roughly.
> >>
> >> If there's still a significant gap then it might make sense to optimize 
> >> this some
> >> more.
> >
> > I measured (partial) revert performance on 4.7 using mmtest instructions
> > from Giovanni and also tested some other possible fix (draft version):
> >
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 75f98c5..54fdf6d 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, 
> > struct task_cputime *times)
> > unsigned int seq, nextseq;
> > unsigned long flags;
> >
> > +   (void) task_sched_runtime(tsk);
> > +
> > rcu_read_lock();
> > /* Attempt a lockless read on the first round. */
> > nextseq = 0;
> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, 
> > struct task_cputime *times)
> > task_cputime(t, , );
> > times->utime += utime;
> > times->stime += stime;
> > -   times->sum_exec_runtime += task_sched_runtime(t);
> > +   times->sum_exec_runtime += t->se.sum_exec_runtime;
> 
> If this will not have updated stats for other threads?

No, until tick/sched() on CPUs running threads.

Stanislaw


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Stanislaw Gruszka
On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > > mmtest benchmark results are below (full compare-kernels.sh output is in 
> > > attachment):
> > > 
> > > vanila-4.7revertprefetch  patch
> > > 4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 
> > > 72.59%)
> > > 5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 
> > > 81.16%)
> > > 6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 
> > > 85.27%)
> > > 6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 
> > > 86.89%)
> > > 7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 
> > > 87.91%)
> > > 7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 
> > > 88.01%)
> > > 7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 
> > > 87.97%)
> > > 7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 
> > > 86.66%)
> > > 8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 
> > > 85.82%)
> > > 5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 
> > > 59.28%)
> > > 5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 
> > > 35.46%)
> > > 5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 
> > > 39.40%)
> > > 6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 
> > > 44.89%)
> > > 7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 
> > > 49.22%)
> > > 7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 
> > > 52.76%)
> > > 7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 
> > > 53.71%)
> > > 7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 
> > > 54.25%)
> > > 
> > > Patch works because we we update sum_exec_runtime on current thread
> > > what assure we see proper sum_exec_runtime value on different CPUs. I
> > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > > patch did not break them. I'm going to run some other test.
> > > 
> > > Patch is draft version for early review, task_sched_runtime() will be
> > > simplified (since it's called only current thread) and possibly split
> > > into two functions: one that call update_curr() and other that return
> > > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > > 
> > > Stanislaw
> > 
> 
> Is this really equivalent though? It updates one task instead of all
> tasks in the group and there is no guarantee that tsk == current.

Oh, my intention was to update runtime on current.

> Glancing at it, it should monotonically increase but it looks like it
> would calculate stale data.

Yes, until the next tick on a CPU, the patch does not count partial
runtime of thread running on that CPU. However that was the behaviour
before commit d670ec13178d0 - that how old thread_group_sched_runtime()
function worked:

 /*
- * Return sum_exec_runtime for the thread group.
- * In case the task is currently running, return the sum plus current's
- * pending runtime that have not been accounted yet.
- *
- * Note that the thread group might have other running tasks as well,
- * so the return value not includes other pending runtime that other
- * running tasks might have.
- */
-unsigned long long thread_group_sched_runtime(struct task_struct *p)

Stanislaw


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Stanislaw Gruszka
On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > > mmtest benchmark results are below (full compare-kernels.sh output is in 
> > > attachment):
> > > 
> > > vanila-4.7revertprefetch  patch
> > > 4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 
> > > 72.59%)
> > > 5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 
> > > 81.16%)
> > > 6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 
> > > 85.27%)
> > > 6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 
> > > 86.89%)
> > > 7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 
> > > 87.91%)
> > > 7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 
> > > 88.01%)
> > > 7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 
> > > 87.97%)
> > > 7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 
> > > 86.66%)
> > > 8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 
> > > 85.82%)
> > > 5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 
> > > 59.28%)
> > > 5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 
> > > 35.46%)
> > > 5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 
> > > 39.40%)
> > > 6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 
> > > 44.89%)
> > > 7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 
> > > 49.22%)
> > > 7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 
> > > 52.76%)
> > > 7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 
> > > 53.71%)
> > > 7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 
> > > 54.25%)
> > > 
> > > Patch works because we we update sum_exec_runtime on current thread
> > > what assure we see proper sum_exec_runtime value on different CPUs. I
> > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > > patch did not break them. I'm going to run some other test.
> > > 
> > > Patch is draft version for early review, task_sched_runtime() will be
> > > simplified (since it's called only current thread) and possibly split
> > > into two functions: one that call update_curr() and other that return
> > > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > > 
> > > Stanislaw
> > 
> 
> Is this really equivalent though? It updates one task instead of all
> tasks in the group and there is no guarantee that tsk == current.

Oh, my intention was to update runtime on current.

> Glancing at it, it should monotonically increase but it looks like it
> would calculate stale data.

Yes, until the next tick on a CPU, the patch does not count partial
runtime of thread running on that CPU. However that was the behaviour
before commit d670ec13178d0 - that how old thread_group_sched_runtime()
function worked:

 /*
- * Return sum_exec_runtime for the thread group.
- * In case the task is currently running, return the sum plus current's
- * pending runtime that have not been accounted yet.
- *
- * Note that the thread group might have other running tasks as well,
- * so the return value not includes other pending runtime that other
- * running tasks might have.
- */
-unsigned long long thread_group_sched_runtime(struct task_struct *p)

Stanislaw


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Wanpeng Li
2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka :
> Hi
>
> On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> Nice detective work! I'm wondering, where do we stand if compared with a
>> pre-6e998916dfe3 kernel?
>>
>> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly 
>> and I
>> suspect v3.17 does not run easily on a recent distro. Could you attempt to 
>> revert
>> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't 
>> try to
>> make the result correct, just see what the performance gap is, roughly.
>>
>> If there's still a significant gap then it might make sense to optimize this 
>> some
>> more.
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
> unsigned int seq, nextseq;
> unsigned long flags;
>
> +   (void) task_sched_runtime(tsk);
> +
> rcu_read_lock();
> /* Attempt a lockless read on the first round. */
> nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
> task_cputime(t, , );
> times->utime += utime;
> times->stime += stime;
> -   times->sum_exec_runtime += task_sched_runtime(t);
> +   times->sum_exec_runtime += t->se.sum_exec_runtime;

If this will not have updated stats for other threads?

Regards,
Wanpeng Li


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Wanpeng Li
2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka :
> Hi
>
> On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> Nice detective work! I'm wondering, where do we stand if compared with a
>> pre-6e998916dfe3 kernel?
>>
>> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly 
>> and I
>> suspect v3.17 does not run easily on a recent distro. Could you attempt to 
>> revert
>> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't 
>> try to
>> make the result correct, just see what the performance gap is, roughly.
>>
>> If there's still a significant gap then it might make sense to optimize this 
>> some
>> more.
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
> unsigned int seq, nextseq;
> unsigned long flags;
>
> +   (void) task_sched_runtime(tsk);
> +
> rcu_read_lock();
> /* Attempt a lockless read on the first round. */
> nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
> task_cputime(t, , );
> times->utime += utime;
> times->stime += stime;
> -   times->sum_exec_runtime += task_sched_runtime(t);
> +   times->sum_exec_runtime += t->se.sum_exec_runtime;

If this will not have updated stats for other threads?

Regards,
Wanpeng Li


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Mel Gorman
On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > mmtest benchmark results are below (full compare-kernels.sh output is in 
> > attachment):
> > 
> > vanila-4.7revertprefetch  patch
> > 4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 
> > 72.59%)
> > 5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 
> > 81.16%)
> > 6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 
> > 85.27%)
> > 6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 
> > 86.89%)
> > 7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 
> > 87.91%)
> > 7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 
> > 88.01%)
> > 7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 
> > 87.97%)
> > 7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 
> > 86.66%)
> > 8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 
> > 85.82%)
> > 5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 
> > 59.28%)
> > 5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 
> > 35.46%)
> > 5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 
> > 39.40%)
> > 6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 
> > 44.89%)
> > 7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 
> > 49.22%)
> > 7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 
> > 52.76%)
> > 7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 
> > 53.71%)
> > 7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 
> > 54.25%)
> > 
> > Patch works because we we update sum_exec_runtime on current thread
> > what assure we see proper sum_exec_runtime value on different CPUs. I
> > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > patch did not break them. I'm going to run some other test.
> > 
> > Patch is draft version for early review, task_sched_runtime() will be
> > simplified (since it's called only current thread) and possibly split
> > into two functions: one that call update_curr() and other that return
> > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > 
> > Stanislaw
> 

Is this really equivalent though? It updates one task instead of all
tasks in the group and there is no guarantee that tsk == current.
Glancing at it, it should monotonically increase but it looks like it
would calculate stale data.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Mel Gorman
On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > mmtest benchmark results are below (full compare-kernels.sh output is in 
> > attachment):
> > 
> > vanila-4.7revertprefetch  patch
> > 4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 
> > 72.59%)
> > 5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 
> > 81.16%)
> > 6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 
> > 85.27%)
> > 6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 
> > 86.89%)
> > 7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 
> > 87.91%)
> > 7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 
> > 88.01%)
> > 7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 
> > 87.97%)
> > 7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 
> > 86.66%)
> > 8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 
> > 85.82%)
> > 5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 
> > 59.28%)
> > 5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 
> > 35.46%)
> > 5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 
> > 39.40%)
> > 6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 
> > 44.89%)
> > 7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 
> > 49.22%)
> > 7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 
> > 52.76%)
> > 7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 
> > 53.71%)
> > 7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 
> > 54.25%)
> > 
> > Patch works because we we update sum_exec_runtime on current thread
> > what assure we see proper sum_exec_runtime value on different CPUs. I
> > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > patch did not break them. I'm going to run some other test.
> > 
> > Patch is draft version for early review, task_sched_runtime() will be
> > simplified (since it's called only current thread) and possibly split
> > into two functions: one that call update_curr() and other that return
> > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > 
> > Stanislaw
> 

Is this really equivalent though? It updates one task instead of all
tasks in the group and there is no guarantee that tsk == current.
Glancing at it, it should monotonically increase but it looks like it
would calculate stale data.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Giovanni Gherdovich
Hello Stanislaw,

On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote:
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
>   unsigned int seq, nextseq;
>   unsigned long flags;
>  
> + (void) task_sched_runtime(tsk);
> +
>   rcu_read_lock();
>   /* Attempt a lockless read on the first round. */
>   nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
>   task_cputime(t, , );
>   times->utime += utime;
>   times->stime += stime;
> - times->sum_exec_runtime += task_sched_runtime(t);
> + times->sum_exec_runtime += t->se.sum_exec_runtime;
>   }
>   /* If lockless access failed, take the lock. */
>   nextseq = 1;
> ---
> mmtest benchmark results are below (full compare-kernels.sh output is in 
> attachment):
> 
> vanila-4.7revertprefetch  patch
> 4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 
> 72.59%)
> 5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 
> 81.16%)
> 6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 
> 85.27%)
> 6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 
> 86.89%)
> 7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 
> 87.91%)
> 7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 
> 88.01%)
> 7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 
> 87.97%)
> 7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 
> 86.66%)
> 8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 
> 85.82%)
> 5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 
> 59.28%)
> 5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 
> 35.46%)
> 5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 
> 39.40%)
> 6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 
> 44.89%)
> 7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 
> 49.22%)
> 7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 
> 52.76%)
> 7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 
> 53.71%)
> 7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 
> 54.25%)
> 
> Patch works because we we update sum_exec_runtime on current thread
> what assure we see proper sum_exec_runtime value on different CPUs. I
> tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> patch did not break them. I'm going to run some other test.
> 
> Patch is draft version for early review, task_sched_runtime() will be
> simplified (since it's called only current thread) and possibly split
> into two functions: one that call update_curr() and other that return
> sum_exec_runtime (assure it's consistent on 32 bit arches).
> 
> Stanislaw

Thank you for having a look at this.
Your patch performs very well, even better than the pre-6e998916dfe3
numbers I was aiming for. I confirm your results on my test machine
(Sandy Bridge, 32 cores, 2 NUMA nodes).
I didn't apply on the very latest 4.8-rc but used what I had handy for
comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3).
As I said, my measurements match yours (my tables follow); looks like
your diff cures the problem while mine cures the symptoms.

clock_gettime():

threads4.7-rc7 3.18-rc3  4.7-rc7 + prefetch4.7-rc7 + 
Stanislaw
   (pre-6e998916dfe3)
2  3.482.23 ( 35.68%)3.06 ( 11.83%)1.08 ( 
68.81%)
5  3.332.83 ( 14.84%)3.25 (  2.40%)0.71 ( 
78.55%)
8  3.372.84 ( 15.80%)3.26 (  3.30%)0.56 ( 
83.49%)
12 3.323.09 (  6.69%)3.37 ( -1.60%)0.42 ( 
87.28%)
21 4.013.14 ( 21.70%)3.90 (  2.74%)0.35 ( 
91.35%)
30 3.633.28 (  9.75%)3.36 (  7.41%)0.28 ( 
92.23%)
48 3.713.02 ( 18.69%)3.11 ( 16.27%)0.39 ( 
89.39%)
79 3.752.88 ( 23.23%)3.16 ( 15.74%)0.46 ( 
87.76%)
1103.812.95 ( 22.62%)3.25 ( 14.80%)0.56 ( 
85.41%)
1283.883.05 ( 21.28%)3.31 ( 14.76%)0.62 ( 
84.10%)

times():

threads4.7-rc7 3.18-rc3  4.7-rc7 + prefetch4.7-rc7 + 
Stanislaw
   

Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-15 Thread Giovanni Gherdovich
Hello Stanislaw,

On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote:
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
>   unsigned int seq, nextseq;
>   unsigned long flags;
>  
> + (void) task_sched_runtime(tsk);
> +
>   rcu_read_lock();
>   /* Attempt a lockless read on the first round. */
>   nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
> task_cputime *times)
>   task_cputime(t, , );
>   times->utime += utime;
>   times->stime += stime;
> - times->sum_exec_runtime += task_sched_runtime(t);
> + times->sum_exec_runtime += t->se.sum_exec_runtime;
>   }
>   /* If lockless access failed, take the lock. */
>   nextseq = 1;
> ---
> mmtest benchmark results are below (full compare-kernels.sh output is in 
> attachment):
> 
> vanila-4.7revertprefetch  patch
> 4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 
> 72.59%)
> 5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 
> 81.16%)
> 6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 
> 85.27%)
> 6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 
> 86.89%)
> 7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 
> 87.91%)
> 7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 
> 88.01%)
> 7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 
> 87.97%)
> 7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 
> 86.66%)
> 8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 
> 85.82%)
> 5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 
> 59.28%)
> 5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 
> 35.46%)
> 5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 
> 39.40%)
> 6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 
> 44.89%)
> 7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 
> 49.22%)
> 7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 
> 52.76%)
> 7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 
> 53.71%)
> 7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 
> 54.25%)
> 
> Patch works because we we update sum_exec_runtime on current thread
> what assure we see proper sum_exec_runtime value on different CPUs. I
> tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> patch did not break them. I'm going to run some other test.
> 
> Patch is draft version for early review, task_sched_runtime() will be
> simplified (since it's called only current thread) and possibly split
> into two functions: one that call update_curr() and other that return
> sum_exec_runtime (assure it's consistent on 32 bit arches).
> 
> Stanislaw

Thank you for having a look at this.
Your patch performs very well, even better than the pre-6e998916dfe3
numbers I was aiming for. I confirm your results on my test machine
(Sandy Bridge, 32 cores, 2 NUMA nodes).
I didn't apply on the very latest 4.8-rc but used what I had handy for
comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3).
As I said, my measurements match yours (my tables follow); looks like
your diff cures the problem while mine cures the symptoms.

clock_gettime():

threads4.7-rc7 3.18-rc3  4.7-rc7 + prefetch4.7-rc7 + 
Stanislaw
   (pre-6e998916dfe3)
2  3.482.23 ( 35.68%)3.06 ( 11.83%)1.08 ( 
68.81%)
5  3.332.83 ( 14.84%)3.25 (  2.40%)0.71 ( 
78.55%)
8  3.372.84 ( 15.80%)3.26 (  3.30%)0.56 ( 
83.49%)
12 3.323.09 (  6.69%)3.37 ( -1.60%)0.42 ( 
87.28%)
21 4.013.14 ( 21.70%)3.90 (  2.74%)0.35 ( 
91.35%)
30 3.633.28 (  9.75%)3.36 (  7.41%)0.28 ( 
92.23%)
48 3.713.02 ( 18.69%)3.11 ( 16.27%)0.39 ( 
89.39%)
79 3.752.88 ( 23.23%)3.16 ( 15.74%)0.46 ( 
87.76%)
1103.812.95 ( 22.62%)3.25 ( 14.80%)0.56 ( 
85.41%)
1283.883.05 ( 21.28%)3.31 ( 14.76%)0.62 ( 
84.10%)

times():

threads4.7-rc7 3.18-rc3  4.7-rc7 + prefetch4.7-rc7 + 
Stanislaw
   

Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-12 Thread Stanislaw Gruszka
Hi

On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?
> 
> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly 
> and I 
> suspect v3.17 does not run easily on a recent distro. Could you attempt to 
> revert 
> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try 
> to 
> make the result correct, just see what the performance gap is, roughly.
> 
> If there's still a significant gap then it might make sense to optimize this 
> some 
> more.

I measured (partial) revert performance on 4.7 using mmtest instructions
from Giovanni and also tested some other possible fix (draft version):

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..54fdf6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
unsigned int seq, nextseq;
unsigned long flags;
 
+   (void) task_sched_runtime(tsk);
+
rcu_read_lock();
/* Attempt a lockless read on the first round. */
nextseq = 0;
@@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
task_cputime(t, , );
times->utime += utime;
times->stime += stime;
-   times->sum_exec_runtime += task_sched_runtime(t);
+   times->sum_exec_runtime += t->se.sum_exec_runtime;
}
/* If lockless access failed, take the lock. */
nextseq = 1;
---
mmtest benchmark results are below (full compare-kernels.sh output is in 
attachment):

vanila-4.7revertprefetch  patch
4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 72.59%)
5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 81.16%)
6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 85.27%)
6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 86.89%)
7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 87.91%)
7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 88.01%)
7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 87.97%)
7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 86.66%)
8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 85.82%)
5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 59.28%)
5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 35.46%)
5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 39.40%)
6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 44.89%)
7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 49.22%)
7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 52.76%)
7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 53.71%)
7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 54.25%)

Patch works because we we update sum_exec_runtime on current thread
what assure we see proper sum_exec_runtime value on different CPUs. I
tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
patch did not break them. I'm going to run some other test.

Patch is draft version for early review, task_sched_runtime() will be
simplified (since it's called only current thread) and possibly split
into two functions: one that call update_curr() and other that return
sum_exec_runtime (assure it's consistent on 32 bit arches).

Stanislaw

poundtime
 vanilla
   rever prefetc mas
 4.7
  revertprefetchmask
Min  real-pound_clock_gettime-2 4.38 (  0.00%)2.73 ( 
37.67%)3.62 ( 17.35%)1.19 ( 72.83%)
Min  real-pound_clock_gettime-5 5.40 (  0.00%)4.76 ( 
11.85%)4.49 ( 16.85%)0.99 ( 81.67%)
Min  real-pound_clock_gettime-8 5.83 (  0.00%)4.88 ( 
16.30%)5.91 ( -1.37%)0.88 ( 84.91%)
Min  real-pound_clock_gettime-126.55 (  0.00%)4.87 ( 
25.65%)5.98 (  8.70%)0.84 ( 87.18%)
Min  real-pound_clock_gettime-217.11 (  0.00%)5.10 ( 
28.27%)6.63 (  6.75%)0.85 ( 88.05%)
Min  real-pound_clock_gettime-307.56 (  0.00%)5.20 ( 
31.22%)7.08 (  6.35%)0.87 ( 88.49%)
Min  real-pound_clock_gettime-487.78 (  0.00%)5.24 ( 
32.65%)7.20 (  7.46%)0.92 ( 

Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-12 Thread Stanislaw Gruszka
Hi

On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?
> 
> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly 
> and I 
> suspect v3.17 does not run easily on a recent distro. Could you attempt to 
> revert 
> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try 
> to 
> make the result correct, just see what the performance gap is, roughly.
> 
> If there's still a significant gap then it might make sense to optimize this 
> some 
> more.

I measured (partial) revert performance on 4.7 using mmtest instructions
from Giovanni and also tested some other possible fix (draft version):

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..54fdf6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
unsigned int seq, nextseq;
unsigned long flags;
 
+   (void) task_sched_runtime(tsk);
+
rcu_read_lock();
/* Attempt a lockless read on the first round. */
nextseq = 0;
@@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
task_cputime(t, , );
times->utime += utime;
times->stime += stime;
-   times->sum_exec_runtime += task_sched_runtime(t);
+   times->sum_exec_runtime += t->se.sum_exec_runtime;
}
/* If lockless access failed, take the lock. */
nextseq = 1;
---
mmtest benchmark results are below (full compare-kernels.sh output is in 
attachment):

vanila-4.7revertprefetch  patch
4.74 (  0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 72.59%)
5.49 (  0.00%)5.00 (  8.97%)5.34 (  2.72%)1.03 ( 81.16%)
6.12 (  0.00%)4.91 ( 19.73%)5.97 (  2.40%)0.90 ( 85.27%)
6.68 (  0.00%)4.90 ( 26.66%)6.02 (  9.75%)0.88 ( 86.89%)
7.21 (  0.00%)5.13 ( 28.85%)6.70 (  7.09%)0.87 ( 87.91%)
7.66 (  0.00%)5.22 ( 31.80%)7.17 (  6.39%)0.92 ( 88.01%)
7.91 (  0.00%)5.36 ( 32.22%)7.30 (  7.72%)0.95 ( 87.97%)
7.95 (  0.00%)5.35 ( 32.73%)7.34 (  7.66%)1.06 ( 86.66%)
8.00 (  0.00%)5.33 ( 33.31%)7.38 (  7.73%)1.13 ( 85.82%)
5.61 (  0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 59.28%)
5.66 (  0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 35.46%)
5.98 (  0.00%)4.97 ( 16.87%)5.96 (  0.35%)3.62 ( 39.40%)
6.58 (  0.00%)4.94 ( 24.93%)6.04 (  8.32%)3.63 ( 44.89%)
7.19 (  0.00%)5.18 ( 28.01%)6.68 (  7.13%)3.65 ( 49.22%)
7.67 (  0.00%)5.27 ( 31.29%)7.16 (  6.63%)3.62 ( 52.76%)
7.88 (  0.00%)5.36 ( 31.98%)7.28 (  7.58%)3.65 ( 53.71%)
7.99 (  0.00%)5.39 ( 32.52%)7.40 (  7.42%)3.65 ( 54.25%)

Patch works because we we update sum_exec_runtime on current thread
what assure we see proper sum_exec_runtime value on different CPUs. I
tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
patch did not break them. I'm going to run some other test.

Patch is draft version for early review, task_sched_runtime() will be
simplified (since it's called only current thread) and possibly split
into two functions: one that call update_curr() and other that return
sum_exec_runtime (assure it's consistent on 32 bit arches).

Stanislaw

poundtime
 vanilla
   rever prefetc mas
 4.7
  revertprefetchmask
Min  real-pound_clock_gettime-2 4.38 (  0.00%)2.73 ( 
37.67%)3.62 ( 17.35%)1.19 ( 72.83%)
Min  real-pound_clock_gettime-5 5.40 (  0.00%)4.76 ( 
11.85%)4.49 ( 16.85%)0.99 ( 81.67%)
Min  real-pound_clock_gettime-8 5.83 (  0.00%)4.88 ( 
16.30%)5.91 ( -1.37%)0.88 ( 84.91%)
Min  real-pound_clock_gettime-126.55 (  0.00%)4.87 ( 
25.65%)5.98 (  8.70%)0.84 ( 87.18%)
Min  real-pound_clock_gettime-217.11 (  0.00%)5.10 ( 
28.27%)6.63 (  6.75%)0.85 ( 88.05%)
Min  real-pound_clock_gettime-307.56 (  0.00%)5.20 ( 
31.22%)7.08 (  6.35%)0.87 ( 88.49%)
Min  real-pound_clock_gettime-487.78 (  0.00%)5.24 ( 
32.65%)7.20 (  7.46%)0.92 ( 

Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-10 Thread Ingo Molnar

* Giovanni Gherdovich  wrote:

> Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
> inconsistency") fixed a problem whereby clock_nanosleep() followed by
> clock_gettime() could allow a task to wake early. It addressed the problem
> by calling the scheduling classes update_curr when the cputimer starts.
> 
> Said change induced a considerable performance regression on the syscalls
> times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
> debuggers and applications that monitor their own performance that
> accidentally depend on the performance of these specific calls.
> 
> This patch mitigates the performace loss by prefetching data in the CPU
> cache, as stalls due to cache misses appear to be where most time is spent
> in our benchmarks.
> 
> Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
> box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
> variable number of threads, from 2 to 4*num_cpus; the results are in
> seconds and correspond to the average of 10 runs; the percentage gain is
> computed with (before-after)/before so a positive value is an improvement
> (it's faster). The improvement varies between a few percents for 5-20
> threads and more than 10% for 2 or >20 threads.
> 
> pound_clock_gettime:
> 
> threads   4.7-rc7 patched 4.7-rc7
> [num] [secs]  [secs (percent)]
>   2   3.483.06 ( 11.83%)
>   5   3.333.25 (  2.40%)
>   8   3.373.26 (  3.30%)
>  12   3.323.37 ( -1.60%)
>  21   4.013.90 (  2.74%)
>  30   3.633.36 (  7.41%)
>  48   3.713.11 ( 16.27%)
>  79   3.753.16 ( 15.74%)
> 110   3.813.25 ( 14.80%)
> 128   3.883.31 ( 14.76%)

Nice detective work! I'm wondering, where do we stand if compared with a 
pre-6e998916dfe3 kernel?

I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and 
I 
suspect v3.17 does not run easily on a recent distro. Could you attempt to 
revert 
the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try 
to 
make the result correct, just see what the performance gap is, roughly.

If there's still a significant gap then it might make sense to optimize this 
some 
more.

Thanks,

Ingo


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-10 Thread Ingo Molnar

* Giovanni Gherdovich  wrote:

> Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
> inconsistency") fixed a problem whereby clock_nanosleep() followed by
> clock_gettime() could allow a task to wake early. It addressed the problem
> by calling the scheduling classes update_curr when the cputimer starts.
> 
> Said change induced a considerable performance regression on the syscalls
> times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
> debuggers and applications that monitor their own performance that
> accidentally depend on the performance of these specific calls.
> 
> This patch mitigates the performace loss by prefetching data in the CPU
> cache, as stalls due to cache misses appear to be where most time is spent
> in our benchmarks.
> 
> Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
> box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
> variable number of threads, from 2 to 4*num_cpus; the results are in
> seconds and correspond to the average of 10 runs; the percentage gain is
> computed with (before-after)/before so a positive value is an improvement
> (it's faster). The improvement varies between a few percents for 5-20
> threads and more than 10% for 2 or >20 threads.
> 
> pound_clock_gettime:
> 
> threads   4.7-rc7 patched 4.7-rc7
> [num] [secs]  [secs (percent)]
>   2   3.483.06 ( 11.83%)
>   5   3.333.25 (  2.40%)
>   8   3.373.26 (  3.30%)
>  12   3.323.37 ( -1.60%)
>  21   4.013.90 (  2.74%)
>  30   3.633.36 (  7.41%)
>  48   3.713.11 ( 16.27%)
>  79   3.753.16 ( 15.74%)
> 110   3.813.25 ( 14.80%)
> 128   3.883.31 ( 14.76%)

Nice detective work! I'm wondering, where do we stand if compared with a 
pre-6e998916dfe3 kernel?

I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and 
I 
suspect v3.17 does not run easily on a recent distro. Could you attempt to 
revert 
the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try 
to 
make the result correct, just see what the performance gap is, roughly.

If there's still a significant gap then it might make sense to optimize this 
some 
more.

Thanks,

Ingo


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-10 Thread Giovanni Gherdovich
Hello Ingo,

thank you for your reply.

Ingo Molnar 
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?

The data follows. A considerable part of the performance loss is recovered;
something is still on the table.

"3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1
"3.18-bug" is the revision 6e998916dfe3 itself.
Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse.


times()

threads3.18-pre-bug  3.18-bug  4.7.0-rc7 
4.7.0-rc7-patched

2  2.27 (  0.00%)3.73 (-64.71%)3.65 (-61.14%)
3.06 (-35.16%)
5  2.78 (  0.00%)3.77 (-35.56%)3.45 (-23.98%)
3.25 (-16.79%)
8  2.79 (  0.00%)4.41 (-57.71%)3.52 (-26.05%)
3.26 (-16.53%)
12 3.02 (  0.00%)3.56 (-17.94%)3.29 ( -9.08%)
3.37 (-11.74%)
21 3.10 (  0.00%)4.61 (-48.74%)4.07 (-31.34%)
3.90 (-25.89%)
30 3.33 (  0.00%)5.75 (-72.53%)3.87 (-16.01%)
3.36 ( -0.81%)
48 2.96 (  0.00%)6.06 (-105.04%)   3.79 (-28.10%)
3.11 ( -5.14%)
79 2.88 (  0.00%)6.24 (-116.83%)   3.88 (-34.81%)
3.16 ( -9.84%)
1102.98 (  0.00%)6.37 (-114.08%)   3.90 (-31.12%)
3.25 ( -9.07%)
1283.10 (  0.00%)6.35 (-104.61%)   4.00 (-28.87%)
3.31 ( -6.57%)


clock_gettime()

threads3.18-pre-bug  3.18-bug  4.7.0-rc7 
4.7.0-rc7-patched

2  2.23 (  0.00%)3.68 (-64.56%)3.48 (-55.48%)
3.25 (-45.41%)
5  2.83 (  0.00%)3.78 (-33.42%)3.33 (-17.43%)
3.17 (-12.03%)
8  2.84 (  0.00%)4.31 (-52.12%)3.37 (-18.76%)
3.22 (-13.43%)
12 3.09 (  0.00%)3.61 (-16.74%)3.32 ( -7.17%)
3.36 ( -8.47%)
21 3.14 (  0.00%)4.63 (-47.36%)4.01 (-27.71%)
3.92 (-24.68%)
30 3.28 (  0.00%)5.75 (-75.37%)3.63 (-10.80%)
3.40 ( -3.69%)
48 3.02 (  0.00%)6.05 (-100.56%)   3.71 (-22.99%)
3.16 ( -4.64%)
79 2.88 (  0.00%)6.30 (-118.90%)   3.75 (-30.26%)
3.28 (-13.93%)
1102.95 (  0.00%)6.46 (-119.00%)   3.81 (-29.24%)
3.38 (-14.69%)
1283.05 (  0.00%)6.42 (-110.08%)   3.88 (-27.04%)
3.38 (-10.70%)


Regards,
Giovanni Gherdovich


Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()

2016-08-10 Thread Giovanni Gherdovich
Hello Ingo,

thank you for your reply.

Ingo Molnar 
> Nice detective work! I'm wondering, where do we stand if compared with a 
> pre-6e998916dfe3 kernel?

The data follows. A considerable part of the performance loss is recovered;
something is still on the table.

"3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1
"3.18-bug" is the revision 6e998916dfe3 itself.
Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse.


times()

threads3.18-pre-bug  3.18-bug  4.7.0-rc7 
4.7.0-rc7-patched

2  2.27 (  0.00%)3.73 (-64.71%)3.65 (-61.14%)
3.06 (-35.16%)
5  2.78 (  0.00%)3.77 (-35.56%)3.45 (-23.98%)
3.25 (-16.79%)
8  2.79 (  0.00%)4.41 (-57.71%)3.52 (-26.05%)
3.26 (-16.53%)
12 3.02 (  0.00%)3.56 (-17.94%)3.29 ( -9.08%)
3.37 (-11.74%)
21 3.10 (  0.00%)4.61 (-48.74%)4.07 (-31.34%)
3.90 (-25.89%)
30 3.33 (  0.00%)5.75 (-72.53%)3.87 (-16.01%)
3.36 ( -0.81%)
48 2.96 (  0.00%)6.06 (-105.04%)   3.79 (-28.10%)
3.11 ( -5.14%)
79 2.88 (  0.00%)6.24 (-116.83%)   3.88 (-34.81%)
3.16 ( -9.84%)
1102.98 (  0.00%)6.37 (-114.08%)   3.90 (-31.12%)
3.25 ( -9.07%)
1283.10 (  0.00%)6.35 (-104.61%)   4.00 (-28.87%)
3.31 ( -6.57%)


clock_gettime()

threads3.18-pre-bug  3.18-bug  4.7.0-rc7 
4.7.0-rc7-patched

2  2.23 (  0.00%)3.68 (-64.56%)3.48 (-55.48%)
3.25 (-45.41%)
5  2.83 (  0.00%)3.78 (-33.42%)3.33 (-17.43%)
3.17 (-12.03%)
8  2.84 (  0.00%)4.31 (-52.12%)3.37 (-18.76%)
3.22 (-13.43%)
12 3.09 (  0.00%)3.61 (-16.74%)3.32 ( -7.17%)
3.36 ( -8.47%)
21 3.14 (  0.00%)4.63 (-47.36%)4.01 (-27.71%)
3.92 (-24.68%)
30 3.28 (  0.00%)5.75 (-75.37%)3.63 (-10.80%)
3.40 ( -3.69%)
48 3.02 (  0.00%)6.05 (-100.56%)   3.71 (-22.99%)
3.16 ( -4.64%)
79 2.88 (  0.00%)6.30 (-118.90%)   3.75 (-30.26%)
3.28 (-13.93%)
1102.95 (  0.00%)6.46 (-119.00%)   3.81 (-29.24%)
3.38 (-14.69%)
1283.05 (  0.00%)6.42 (-110.08%)   3.88 (-27.04%)
3.38 (-10.70%)


Regards,
Giovanni Gherdovich