Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote: > On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote: > > > Is this really equivalent though? It updates one task instead of all > > > tasks in the group and there is no guarantee that tsk == current. > > > > Oh, my intention was to update runtime on current. > > > > Ok, so minimally that would need addressing. However, then I would worry > that two tasks in a group calling the function at the same time would > see different results because each of them updated a different task. > Such a situation is inherently race-prone anyway but it's a large enough > functional difference to be worth calling out. It races bacause we don't know which thread will call the clock_gettime() first. But once that happen, second thread will see updated runtime value from first thread as we call update_curr() for it with task_rq_lock (change from commit 6e998916dfe3). > Minimally, I don't think such a patch is a replacement for Giovanni's > which is functionally equivalent to the current code but could be layered > on top if it is proven to be ok. I agree. I wanted to post my patch on top of Giovanni's. > > > Glancing at it, it should monotonically increase but it looks like it > > > would calculate stale data. > > > > Yes, until the next tick on a CPU, the patch does not count partial > > runtime of thread running on that CPU. However that was the behaviour > > before commit d670ec13178d0 - that how old thread_group_sched_runtime() > > function worked: > > > > Sure, but does this patch not reintroduce the "SMP wobble" and the > problem of "the diff of 'process' should always be >= the diff of > 'thread'" ? It should not reintroduce that problem, also because of change from commit 6e998916dfe3. When a thread reads sum_exec_runtime it also update that value, then process reads updated value. I run test case from "SMP wobble" commit and the problem do not happen on my tests. Perhaps I should post patch with a descriptive changelog and things would be clearer ... Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote: > On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote: > > > Is this really equivalent though? It updates one task instead of all > > > tasks in the group and there is no guarantee that tsk == current. > > > > Oh, my intention was to update runtime on current. > > > > Ok, so minimally that would need addressing. However, then I would worry > that two tasks in a group calling the function at the same time would > see different results because each of them updated a different task. > Such a situation is inherently race-prone anyway but it's a large enough > functional difference to be worth calling out. It races bacause we don't know which thread will call the clock_gettime() first. But once that happen, second thread will see updated runtime value from first thread as we call update_curr() for it with task_rq_lock (change from commit 6e998916dfe3). > Minimally, I don't think such a patch is a replacement for Giovanni's > which is functionally equivalent to the current code but could be layered > on top if it is proven to be ok. I agree. I wanted to post my patch on top of Giovanni's. > > > Glancing at it, it should monotonically increase but it looks like it > > > would calculate stale data. > > > > Yes, until the next tick on a CPU, the patch does not count partial > > runtime of thread running on that CPU. However that was the behaviour > > before commit d670ec13178d0 - that how old thread_group_sched_runtime() > > function worked: > > > > Sure, but does this patch not reintroduce the "SMP wobble" and the > problem of "the diff of 'process' should always be >= the diff of > 'thread'" ? It should not reintroduce that problem, also because of change from commit 6e998916dfe3. When a thread reads sum_exec_runtime it also update that value, then process reads updated value. I run test case from "SMP wobble" commit and the problem do not happen on my tests. Perhaps I should post patch with a descriptive changelog and things would be clearer ... Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote: > > Is this really equivalent though? It updates one task instead of all > > tasks in the group and there is no guarantee that tsk == current. > > Oh, my intention was to update runtime on current. > Ok, so minimally that would need addressing. However, then I would worry that two tasks in a group calling the function at the same time would see different results because each of them updated a different task. Such a situation is inherently race-prone anyway but it's a large enough functional difference to be worth calling out. Minimally, I don't think such a patch is a replacement for Giovanni's which is functionally equivalent to the current code but could be layered on top if it is proven to be ok. > > Glancing at it, it should monotonically increase but it looks like it > > would calculate stale data. > > Yes, until the next tick on a CPU, the patch does not count partial > runtime of thread running on that CPU. However that was the behaviour > before commit d670ec13178d0 - that how old thread_group_sched_runtime() > function worked: > Sure, but does this patch not reintroduce the "SMP wobble" and the problem of "the diff of 'process' should always be >= the diff of 'thread'" ? -- Mel Gorman SUSE Labs
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote: > > Is this really equivalent though? It updates one task instead of all > > tasks in the group and there is no guarantee that tsk == current. > > Oh, my intention was to update runtime on current. > Ok, so minimally that would need addressing. However, then I would worry that two tasks in a group calling the function at the same time would see different results because each of them updated a different task. Such a situation is inherently race-prone anyway but it's a large enough functional difference to be worth calling out. Minimally, I don't think such a patch is a replacement for Giovanni's which is functionally equivalent to the current code but could be layered on top if it is proven to be ok. > > Glancing at it, it should monotonically increase but it looks like it > > would calculate stale data. > > Yes, until the next tick on a CPU, the patch does not count partial > runtime of thread running on that CPU. However that was the behaviour > before commit d670ec13178d0 - that how old thread_group_sched_runtime() > function worked: > Sure, but does this patch not reintroduce the "SMP wobble" and the problem of "the diff of 'process' should always be >= the diff of 'thread'" ? -- Mel Gorman SUSE Labs
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka: > On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote: >> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka : >> > Hi >> > >> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: >> >> Nice detective work! I'm wondering, where do we stand if compared with a >> >> pre-6e998916dfe3 kernel? >> >> >> >> I admit this is a difficult question: 6e998916dfe3 does not revert >> >> cleanly and I >> >> suspect v3.17 does not run easily on a recent distro. Could you attempt >> >> to revert >> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't >> >> try to >> >> make the result correct, just see what the performance gap is, roughly. >> >> >> >> If there's still a significant gap then it might make sense to optimize >> >> this some >> >> more. >> > >> > I measured (partial) revert performance on 4.7 using mmtest instructions >> > from Giovanni and also tested some other possible fix (draft version): >> > >> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> > index 75f98c5..54fdf6d 100644 >> > --- a/kernel/sched/cputime.c >> > +++ b/kernel/sched/cputime.c >> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, >> > struct task_cputime *times) >> > unsigned int seq, nextseq; >> > unsigned long flags; >> > >> > + (void) task_sched_runtime(tsk); >> > + >> > rcu_read_lock(); >> > /* Attempt a lockless read on the first round. */ >> > nextseq = 0; >> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, >> > struct task_cputime *times) >> > task_cputime(t, , ); >> > times->utime += utime; >> > times->stime += stime; >> > - times->sum_exec_runtime += task_sched_runtime(t); >> > + times->sum_exec_runtime += t->se.sum_exec_runtime; >> >> If this will not have updated stats for other threads? > > No, until tick/sched() on CPUs running threads. Yeah, I think this change will result in not updated stats for other threads if they are running and before next update_curr() is called. Regards, Wanpeng Li
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka : > On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote: >> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka : >> > Hi >> > >> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: >> >> Nice detective work! I'm wondering, where do we stand if compared with a >> >> pre-6e998916dfe3 kernel? >> >> >> >> I admit this is a difficult question: 6e998916dfe3 does not revert >> >> cleanly and I >> >> suspect v3.17 does not run easily on a recent distro. Could you attempt >> >> to revert >> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't >> >> try to >> >> make the result correct, just see what the performance gap is, roughly. >> >> >> >> If there's still a significant gap then it might make sense to optimize >> >> this some >> >> more. >> > >> > I measured (partial) revert performance on 4.7 using mmtest instructions >> > from Giovanni and also tested some other possible fix (draft version): >> > >> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> > index 75f98c5..54fdf6d 100644 >> > --- a/kernel/sched/cputime.c >> > +++ b/kernel/sched/cputime.c >> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, >> > struct task_cputime *times) >> > unsigned int seq, nextseq; >> > unsigned long flags; >> > >> > + (void) task_sched_runtime(tsk); >> > + >> > rcu_read_lock(); >> > /* Attempt a lockless read on the first round. */ >> > nextseq = 0; >> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, >> > struct task_cputime *times) >> > task_cputime(t, , ); >> > times->utime += utime; >> > times->stime += stime; >> > - times->sum_exec_runtime += task_sched_runtime(t); >> > + times->sum_exec_runtime += t->se.sum_exec_runtime; >> >> If this will not have updated stats for other threads? > > No, until tick/sched() on CPUs running threads. Yeah, I think this change will result in not updated stats for other threads if they are running and before next update_curr() is called. Regards, Wanpeng Li
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote: > 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka: > > Hi > > > > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: > >> Nice detective work! I'm wondering, where do we stand if compared with a > >> pre-6e998916dfe3 kernel? > >> > >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly > >> and I > >> suspect v3.17 does not run easily on a recent distro. Could you attempt to > >> revert > >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't > >> try to > >> make the result correct, just see what the performance gap is, roughly. > >> > >> If there's still a significant gap then it might make sense to optimize > >> this some > >> more. > > > > I measured (partial) revert performance on 4.7 using mmtest instructions > > from Giovanni and also tested some other possible fix (draft version): > > > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > > index 75f98c5..54fdf6d 100644 > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, > > struct task_cputime *times) > > unsigned int seq, nextseq; > > unsigned long flags; > > > > + (void) task_sched_runtime(tsk); > > + > > rcu_read_lock(); > > /* Attempt a lockless read on the first round. */ > > nextseq = 0; > > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, > > struct task_cputime *times) > > task_cputime(t, , ); > > times->utime += utime; > > times->stime += stime; > > - times->sum_exec_runtime += task_sched_runtime(t); > > + times->sum_exec_runtime += t->se.sum_exec_runtime; > > If this will not have updated stats for other threads? No, until tick/sched() on CPUs running threads. Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote: > 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka : > > Hi > > > > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: > >> Nice detective work! I'm wondering, where do we stand if compared with a > >> pre-6e998916dfe3 kernel? > >> > >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly > >> and I > >> suspect v3.17 does not run easily on a recent distro. Could you attempt to > >> revert > >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't > >> try to > >> make the result correct, just see what the performance gap is, roughly. > >> > >> If there's still a significant gap then it might make sense to optimize > >> this some > >> more. > > > > I measured (partial) revert performance on 4.7 using mmtest instructions > > from Giovanni and also tested some other possible fix (draft version): > > > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > > index 75f98c5..54fdf6d 100644 > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, > > struct task_cputime *times) > > unsigned int seq, nextseq; > > unsigned long flags; > > > > + (void) task_sched_runtime(tsk); > > + > > rcu_read_lock(); > > /* Attempt a lockless read on the first round. */ > > nextseq = 0; > > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, > > struct task_cputime *times) > > task_cputime(t, , ); > > times->utime += utime; > > times->stime += stime; > > - times->sum_exec_runtime += task_sched_runtime(t); > > + times->sum_exec_runtime += t->se.sum_exec_runtime; > > If this will not have updated stats for other threads? No, until tick/sched() on CPUs running threads. Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote: > On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote: > > > mmtest benchmark results are below (full compare-kernels.sh output is in > > > attachment): > > > > > > vanila-4.7revertprefetch patch > > > 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( > > > 72.59%) > > > 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( > > > 81.16%) > > > 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( > > > 85.27%) > > > 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( > > > 86.89%) > > > 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( > > > 87.91%) > > > 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( > > > 88.01%) > > > 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( > > > 87.97%) > > > 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( > > > 86.66%) > > > 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( > > > 85.82%) > > > 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( > > > 59.28%) > > > 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( > > > 35.46%) > > > 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( > > > 39.40%) > > > 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( > > > 44.89%) > > > 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( > > > 49.22%) > > > 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( > > > 52.76%) > > > 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( > > > 53.71%) > > > 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( > > > 54.25%) > > > > > > Patch works because we we update sum_exec_runtime on current thread > > > what assure we see proper sum_exec_runtime value on different CPUs. I > > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > > > patch did not break them. I'm going to run some other test. > > > > > > Patch is draft version for early review, task_sched_runtime() will be > > > simplified (since it's called only current thread) and possibly split > > > into two functions: one that call update_curr() and other that return > > > sum_exec_runtime (assure it's consistent on 32 bit arches). > > > > > > Stanislaw > > > > Is this really equivalent though? It updates one task instead of all > tasks in the group and there is no guarantee that tsk == current. Oh, my intention was to update runtime on current. > Glancing at it, it should monotonically increase but it looks like it > would calculate stale data. Yes, until the next tick on a CPU, the patch does not count partial runtime of thread running on that CPU. However that was the behaviour before commit d670ec13178d0 - that how old thread_group_sched_runtime() function worked: /* - * Return sum_exec_runtime for the thread group. - * In case the task is currently running, return the sum plus current's - * pending runtime that have not been accounted yet. - * - * Note that the thread group might have other running tasks as well, - * so the return value not includes other pending runtime that other - * running tasks might have. - */ -unsigned long long thread_group_sched_runtime(struct task_struct *p) Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote: > On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote: > > > mmtest benchmark results are below (full compare-kernels.sh output is in > > > attachment): > > > > > > vanila-4.7revertprefetch patch > > > 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( > > > 72.59%) > > > 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( > > > 81.16%) > > > 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( > > > 85.27%) > > > 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( > > > 86.89%) > > > 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( > > > 87.91%) > > > 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( > > > 88.01%) > > > 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( > > > 87.97%) > > > 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( > > > 86.66%) > > > 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( > > > 85.82%) > > > 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( > > > 59.28%) > > > 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( > > > 35.46%) > > > 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( > > > 39.40%) > > > 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( > > > 44.89%) > > > 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( > > > 49.22%) > > > 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( > > > 52.76%) > > > 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( > > > 53.71%) > > > 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( > > > 54.25%) > > > > > > Patch works because we we update sum_exec_runtime on current thread > > > what assure we see proper sum_exec_runtime value on different CPUs. I > > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > > > patch did not break them. I'm going to run some other test. > > > > > > Patch is draft version for early review, task_sched_runtime() will be > > > simplified (since it's called only current thread) and possibly split > > > into two functions: one that call update_curr() and other that return > > > sum_exec_runtime (assure it's consistent on 32 bit arches). > > > > > > Stanislaw > > > > Is this really equivalent though? It updates one task instead of all > tasks in the group and there is no guarantee that tsk == current. Oh, my intention was to update runtime on current. > Glancing at it, it should monotonically increase but it looks like it > would calculate stale data. Yes, until the next tick on a CPU, the patch does not count partial runtime of thread running on that CPU. However that was the behaviour before commit d670ec13178d0 - that how old thread_group_sched_runtime() function worked: /* - * Return sum_exec_runtime for the thread group. - * In case the task is currently running, return the sum plus current's - * pending runtime that have not been accounted yet. - * - * Note that the thread group might have other running tasks as well, - * so the return value not includes other pending runtime that other - * running tasks might have. - */ -unsigned long long thread_group_sched_runtime(struct task_struct *p) Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka: > Hi > > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: >> Nice detective work! I'm wondering, where do we stand if compared with a >> pre-6e998916dfe3 kernel? >> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly >> and I >> suspect v3.17 does not run easily on a recent distro. Could you attempt to >> revert >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't >> try to >> make the result correct, just see what the performance gap is, roughly. >> >> If there's still a significant gap then it might make sense to optimize this >> some >> more. > > I measured (partial) revert performance on 4.7 using mmtest instructions > from Giovanni and also tested some other possible fix (draft version): > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 75f98c5..54fdf6d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + (void) task_sched_runtime(tsk); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += t->se.sum_exec_runtime; If this will not have updated stats for other threads? Regards, Wanpeng Li
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka : > Hi > > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: >> Nice detective work! I'm wondering, where do we stand if compared with a >> pre-6e998916dfe3 kernel? >> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly >> and I >> suspect v3.17 does not run easily on a recent distro. Could you attempt to >> revert >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't >> try to >> make the result correct, just see what the performance gap is, roughly. >> >> If there's still a significant gap then it might make sense to optimize this >> some >> more. > > I measured (partial) revert performance on 4.7 using mmtest instructions > from Giovanni and also tested some other possible fix (draft version): > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 75f98c5..54fdf6d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + (void) task_sched_runtime(tsk); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += t->se.sum_exec_runtime; If this will not have updated stats for other threads? Regards, Wanpeng Li
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote: > > mmtest benchmark results are below (full compare-kernels.sh output is in > > attachment): > > > > vanila-4.7revertprefetch patch > > 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( > > 72.59%) > > 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( > > 81.16%) > > 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( > > 85.27%) > > 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( > > 86.89%) > > 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( > > 87.91%) > > 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( > > 88.01%) > > 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( > > 87.97%) > > 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( > > 86.66%) > > 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( > > 85.82%) > > 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( > > 59.28%) > > 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( > > 35.46%) > > 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( > > 39.40%) > > 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( > > 44.89%) > > 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( > > 49.22%) > > 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( > > 52.76%) > > 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( > > 53.71%) > > 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( > > 54.25%) > > > > Patch works because we we update sum_exec_runtime on current thread > > what assure we see proper sum_exec_runtime value on different CPUs. I > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > > patch did not break them. I'm going to run some other test. > > > > Patch is draft version for early review, task_sched_runtime() will be > > simplified (since it's called only current thread) and possibly split > > into two functions: one that call update_curr() and other that return > > sum_exec_runtime (assure it's consistent on 32 bit arches). > > > > Stanislaw > Is this really equivalent though? It updates one task instead of all tasks in the group and there is no guarantee that tsk == current. Glancing at it, it should monotonically increase but it looks like it would calculate stale data. -- Mel Gorman SUSE Labs
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote: > > mmtest benchmark results are below (full compare-kernels.sh output is in > > attachment): > > > > vanila-4.7revertprefetch patch > > 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( > > 72.59%) > > 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( > > 81.16%) > > 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( > > 85.27%) > > 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( > > 86.89%) > > 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( > > 87.91%) > > 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( > > 88.01%) > > 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( > > 87.97%) > > 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( > > 86.66%) > > 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( > > 85.82%) > > 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( > > 59.28%) > > 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( > > 35.46%) > > 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( > > 39.40%) > > 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( > > 44.89%) > > 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( > > 49.22%) > > 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( > > 52.76%) > > 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( > > 53.71%) > > 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( > > 54.25%) > > > > Patch works because we we update sum_exec_runtime on current thread > > what assure we see proper sum_exec_runtime value on different CPUs. I > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > > patch did not break them. I'm going to run some other test. > > > > Patch is draft version for early review, task_sched_runtime() will be > > simplified (since it's called only current thread) and possibly split > > into two functions: one that call update_curr() and other that return > > sum_exec_runtime (assure it's consistent on 32 bit arches). > > > > Stanislaw > Is this really equivalent though? It updates one task instead of all tasks in the group and there is no guarantee that tsk == current. Glancing at it, it should monotonically increase but it looks like it would calculate stale data. -- Mel Gorman SUSE Labs
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
Hello Stanislaw, On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote: > > I measured (partial) revert performance on 4.7 using mmtest instructions > from Giovanni and also tested some other possible fix (draft version): > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 75f98c5..54fdf6d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + (void) task_sched_runtime(tsk); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += t->se.sum_exec_runtime; > } > /* If lockless access failed, take the lock. */ > nextseq = 1; > --- > mmtest benchmark results are below (full compare-kernels.sh output is in > attachment): > > vanila-4.7revertprefetch patch > 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( > 72.59%) > 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( > 81.16%) > 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( > 85.27%) > 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( > 86.89%) > 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( > 87.91%) > 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( > 88.01%) > 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( > 87.97%) > 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( > 86.66%) > 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( > 85.82%) > 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( > 59.28%) > 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( > 35.46%) > 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( > 39.40%) > 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( > 44.89%) > 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( > 49.22%) > 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( > 52.76%) > 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( > 53.71%) > 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( > 54.25%) > > Patch works because we we update sum_exec_runtime on current thread > what assure we see proper sum_exec_runtime value on different CPUs. I > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > patch did not break them. I'm going to run some other test. > > Patch is draft version for early review, task_sched_runtime() will be > simplified (since it's called only current thread) and possibly split > into two functions: one that call update_curr() and other that return > sum_exec_runtime (assure it's consistent on 32 bit arches). > > Stanislaw Thank you for having a look at this. Your patch performs very well, even better than the pre-6e998916dfe3 numbers I was aiming for. I confirm your results on my test machine (Sandy Bridge, 32 cores, 2 NUMA nodes). I didn't apply on the very latest 4.8-rc but used what I had handy for comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3). As I said, my measurements match yours (my tables follow); looks like your diff cures the problem while mine cures the symptoms. clock_gettime(): threads4.7-rc7 3.18-rc3 4.7-rc7 + prefetch4.7-rc7 + Stanislaw (pre-6e998916dfe3) 2 3.482.23 ( 35.68%)3.06 ( 11.83%)1.08 ( 68.81%) 5 3.332.83 ( 14.84%)3.25 ( 2.40%)0.71 ( 78.55%) 8 3.372.84 ( 15.80%)3.26 ( 3.30%)0.56 ( 83.49%) 12 3.323.09 ( 6.69%)3.37 ( -1.60%)0.42 ( 87.28%) 21 4.013.14 ( 21.70%)3.90 ( 2.74%)0.35 ( 91.35%) 30 3.633.28 ( 9.75%)3.36 ( 7.41%)0.28 ( 92.23%) 48 3.713.02 ( 18.69%)3.11 ( 16.27%)0.39 ( 89.39%) 79 3.752.88 ( 23.23%)3.16 ( 15.74%)0.46 ( 87.76%) 1103.812.95 ( 22.62%)3.25 ( 14.80%)0.56 ( 85.41%) 1283.883.05 ( 21.28%)3.31 ( 14.76%)0.62 ( 84.10%) times(): threads4.7-rc7 3.18-rc3 4.7-rc7 + prefetch4.7-rc7 + Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
Hello Stanislaw, On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote: > > I measured (partial) revert performance on 4.7 using mmtest instructions > from Giovanni and also tested some other possible fix (draft version): > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 75f98c5..54fdf6d 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + (void) task_sched_runtime(tsk); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct > task_cputime *times) > task_cputime(t, , ); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += t->se.sum_exec_runtime; > } > /* If lockless access failed, take the lock. */ > nextseq = 1; > --- > mmtest benchmark results are below (full compare-kernels.sh output is in > attachment): > > vanila-4.7revertprefetch patch > 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( > 72.59%) > 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( > 81.16%) > 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( > 85.27%) > 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( > 86.89%) > 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( > 87.91%) > 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( > 88.01%) > 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( > 87.97%) > 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( > 86.66%) > 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( > 85.82%) > 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( > 59.28%) > 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( > 35.46%) > 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( > 39.40%) > 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( > 44.89%) > 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( > 49.22%) > 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( > 52.76%) > 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( > 53.71%) > 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( > 54.25%) > > Patch works because we we update sum_exec_runtime on current thread > what assure we see proper sum_exec_runtime value on different CPUs. I > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, > patch did not break them. I'm going to run some other test. > > Patch is draft version for early review, task_sched_runtime() will be > simplified (since it's called only current thread) and possibly split > into two functions: one that call update_curr() and other that return > sum_exec_runtime (assure it's consistent on 32 bit arches). > > Stanislaw Thank you for having a look at this. Your patch performs very well, even better than the pre-6e998916dfe3 numbers I was aiming for. I confirm your results on my test machine (Sandy Bridge, 32 cores, 2 NUMA nodes). I didn't apply on the very latest 4.8-rc but used what I had handy for comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3). As I said, my measurements match yours (my tables follow); looks like your diff cures the problem while mine cures the symptoms. clock_gettime(): threads4.7-rc7 3.18-rc3 4.7-rc7 + prefetch4.7-rc7 + Stanislaw (pre-6e998916dfe3) 2 3.482.23 ( 35.68%)3.06 ( 11.83%)1.08 ( 68.81%) 5 3.332.83 ( 14.84%)3.25 ( 2.40%)0.71 ( 78.55%) 8 3.372.84 ( 15.80%)3.26 ( 3.30%)0.56 ( 83.49%) 12 3.323.09 ( 6.69%)3.37 ( -1.60%)0.42 ( 87.28%) 21 4.013.14 ( 21.70%)3.90 ( 2.74%)0.35 ( 91.35%) 30 3.633.28 ( 9.75%)3.36 ( 7.41%)0.28 ( 92.23%) 48 3.713.02 ( 18.69%)3.11 ( 16.27%)0.39 ( 89.39%) 79 3.752.88 ( 23.23%)3.16 ( 15.74%)0.46 ( 87.76%) 1103.812.95 ( 22.62%)3.25 ( 14.80%)0.56 ( 85.41%) 1283.883.05 ( 21.28%)3.31 ( 14.76%)0.62 ( 84.10%) times(): threads4.7-rc7 3.18-rc3 4.7-rc7 + prefetch4.7-rc7 + Stanislaw
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
Hi On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: > Nice detective work! I'm wondering, where do we stand if compared with a > pre-6e998916dfe3 kernel? > > I admit this is a difficult question: 6e998916dfe3 does not revert cleanly > and I > suspect v3.17 does not run easily on a recent distro. Could you attempt to > revert > the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try > to > make the result correct, just see what the performance gap is, roughly. > > If there's still a significant gap then it might make sense to optimize this > some > more. I measured (partial) revert performance on 4.7 using mmtest instructions from Giovanni and also tested some other possible fix (draft version): diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 75f98c5..54fdf6d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) unsigned int seq, nextseq; unsigned long flags; + (void) task_sched_runtime(tsk); + rcu_read_lock(); /* Attempt a lockless read on the first round. */ nextseq = 0; @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) task_cputime(t, , ); times->utime += utime; times->stime += stime; - times->sum_exec_runtime += task_sched_runtime(t); + times->sum_exec_runtime += t->se.sum_exec_runtime; } /* If lockless access failed, take the lock. */ nextseq = 1; --- mmtest benchmark results are below (full compare-kernels.sh output is in attachment): vanila-4.7revertprefetch patch 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 72.59%) 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( 81.16%) 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( 85.27%) 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( 86.89%) 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( 87.91%) 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( 88.01%) 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( 87.97%) 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( 86.66%) 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( 85.82%) 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 59.28%) 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 35.46%) 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( 39.40%) 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( 44.89%) 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( 49.22%) 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( 52.76%) 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( 53.71%) 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( 54.25%) Patch works because we we update sum_exec_runtime on current thread what assure we see proper sum_exec_runtime value on different CPUs. I tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, patch did not break them. I'm going to run some other test. Patch is draft version for early review, task_sched_runtime() will be simplified (since it's called only current thread) and possibly split into two functions: one that call update_curr() and other that return sum_exec_runtime (assure it's consistent on 32 bit arches). Stanislaw poundtime vanilla rever prefetc mas 4.7 revertprefetchmask Min real-pound_clock_gettime-2 4.38 ( 0.00%)2.73 ( 37.67%)3.62 ( 17.35%)1.19 ( 72.83%) Min real-pound_clock_gettime-5 5.40 ( 0.00%)4.76 ( 11.85%)4.49 ( 16.85%)0.99 ( 81.67%) Min real-pound_clock_gettime-8 5.83 ( 0.00%)4.88 ( 16.30%)5.91 ( -1.37%)0.88 ( 84.91%) Min real-pound_clock_gettime-126.55 ( 0.00%)4.87 ( 25.65%)5.98 ( 8.70%)0.84 ( 87.18%) Min real-pound_clock_gettime-217.11 ( 0.00%)5.10 ( 28.27%)6.63 ( 6.75%)0.85 ( 88.05%) Min real-pound_clock_gettime-307.56 ( 0.00%)5.20 ( 31.22%)7.08 ( 6.35%)0.87 ( 88.49%) Min real-pound_clock_gettime-487.78 ( 0.00%)5.24 ( 32.65%)7.20 ( 7.46%)0.92 (
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
Hi On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote: > Nice detective work! I'm wondering, where do we stand if compared with a > pre-6e998916dfe3 kernel? > > I admit this is a difficult question: 6e998916dfe3 does not revert cleanly > and I > suspect v3.17 does not run easily on a recent distro. Could you attempt to > revert > the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try > to > make the result correct, just see what the performance gap is, roughly. > > If there's still a significant gap then it might make sense to optimize this > some > more. I measured (partial) revert performance on 4.7 using mmtest instructions from Giovanni and also tested some other possible fix (draft version): diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 75f98c5..54fdf6d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) unsigned int seq, nextseq; unsigned long flags; + (void) task_sched_runtime(tsk); + rcu_read_lock(); /* Attempt a lockless read on the first round. */ nextseq = 0; @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) task_cputime(t, , ); times->utime += utime; times->stime += stime; - times->sum_exec_runtime += task_sched_runtime(t); + times->sum_exec_runtime += t->se.sum_exec_runtime; } /* If lockless access failed, take the lock. */ nextseq = 1; --- mmtest benchmark results are below (full compare-kernels.sh output is in attachment): vanila-4.7revertprefetch patch 4.74 ( 0.00%)3.04 ( 35.93%)4.09 ( 13.81%)1.30 ( 72.59%) 5.49 ( 0.00%)5.00 ( 8.97%)5.34 ( 2.72%)1.03 ( 81.16%) 6.12 ( 0.00%)4.91 ( 19.73%)5.97 ( 2.40%)0.90 ( 85.27%) 6.68 ( 0.00%)4.90 ( 26.66%)6.02 ( 9.75%)0.88 ( 86.89%) 7.21 ( 0.00%)5.13 ( 28.85%)6.70 ( 7.09%)0.87 ( 87.91%) 7.66 ( 0.00%)5.22 ( 31.80%)7.17 ( 6.39%)0.92 ( 88.01%) 7.91 ( 0.00%)5.36 ( 32.22%)7.30 ( 7.72%)0.95 ( 87.97%) 7.95 ( 0.00%)5.35 ( 32.73%)7.34 ( 7.66%)1.06 ( 86.66%) 8.00 ( 0.00%)5.33 ( 33.31%)7.38 ( 7.73%)1.13 ( 85.82%) 5.61 ( 0.00%)3.55 ( 36.76%)4.53 ( 19.23%)2.29 ( 59.28%) 5.66 ( 0.00%)4.32 ( 23.79%)4.75 ( 16.18%)3.65 ( 35.46%) 5.98 ( 0.00%)4.97 ( 16.87%)5.96 ( 0.35%)3.62 ( 39.40%) 6.58 ( 0.00%)4.94 ( 24.93%)6.04 ( 8.32%)3.63 ( 44.89%) 7.19 ( 0.00%)5.18 ( 28.01%)6.68 ( 7.13%)3.65 ( 49.22%) 7.67 ( 0.00%)5.27 ( 31.29%)7.16 ( 6.63%)3.62 ( 52.76%) 7.88 ( 0.00%)5.36 ( 31.98%)7.28 ( 7.58%)3.65 ( 53.71%) 7.99 ( 0.00%)5.39 ( 32.52%)7.40 ( 7.42%)3.65 ( 54.25%) Patch works because we we update sum_exec_runtime on current thread what assure we see proper sum_exec_runtime value on different CPUs. I tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0, patch did not break them. I'm going to run some other test. Patch is draft version for early review, task_sched_runtime() will be simplified (since it's called only current thread) and possibly split into two functions: one that call update_curr() and other that return sum_exec_runtime (assure it's consistent on 32 bit arches). Stanislaw poundtime vanilla rever prefetc mas 4.7 revertprefetchmask Min real-pound_clock_gettime-2 4.38 ( 0.00%)2.73 ( 37.67%)3.62 ( 17.35%)1.19 ( 72.83%) Min real-pound_clock_gettime-5 5.40 ( 0.00%)4.76 ( 11.85%)4.49 ( 16.85%)0.99 ( 81.67%) Min real-pound_clock_gettime-8 5.83 ( 0.00%)4.88 ( 16.30%)5.91 ( -1.37%)0.88 ( 84.91%) Min real-pound_clock_gettime-126.55 ( 0.00%)4.87 ( 25.65%)5.98 ( 8.70%)0.84 ( 87.18%) Min real-pound_clock_gettime-217.11 ( 0.00%)5.10 ( 28.27%)6.63 ( 6.75%)0.85 ( 88.05%) Min real-pound_clock_gettime-307.56 ( 0.00%)5.20 ( 31.22%)7.08 ( 6.35%)0.87 ( 88.49%) Min real-pound_clock_gettime-487.78 ( 0.00%)5.24 ( 32.65%)7.20 ( 7.46%)0.92 (
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
* Giovanni Gherdovichwrote: > Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() > inconsistency") fixed a problem whereby clock_nanosleep() followed by > clock_gettime() could allow a task to wake early. It addressed the problem > by calling the scheduling classes update_curr when the cputimer starts. > > Said change induced a considerable performance regression on the syscalls > times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some > debuggers and applications that monitor their own performance that > accidentally depend on the performance of these specific calls. > > This patch mitigates the performace loss by prefetching data in the CPU > cache, as stalls due to cache misses appear to be where most time is spent > in our benchmarks. > > Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge > box with 32 logical cores and 2 NUMA nodes. The test is repeated with a > variable number of threads, from 2 to 4*num_cpus; the results are in > seconds and correspond to the average of 10 runs; the percentage gain is > computed with (before-after)/before so a positive value is an improvement > (it's faster). The improvement varies between a few percents for 5-20 > threads and more than 10% for 2 or >20 threads. > > pound_clock_gettime: > > threads 4.7-rc7 patched 4.7-rc7 > [num] [secs] [secs (percent)] > 2 3.483.06 ( 11.83%) > 5 3.333.25 ( 2.40%) > 8 3.373.26 ( 3.30%) > 12 3.323.37 ( -1.60%) > 21 4.013.90 ( 2.74%) > 30 3.633.36 ( 7.41%) > 48 3.713.11 ( 16.27%) > 79 3.753.16 ( 15.74%) > 110 3.813.25 ( 14.80%) > 128 3.883.31 ( 14.76%) Nice detective work! I'm wondering, where do we stand if compared with a pre-6e998916dfe3 kernel? I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I suspect v3.17 does not run easily on a recent distro. Could you attempt to revert the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to make the result correct, just see what the performance gap is, roughly. If there's still a significant gap then it might make sense to optimize this some more. Thanks, Ingo
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
* Giovanni Gherdovich wrote: > Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() > inconsistency") fixed a problem whereby clock_nanosleep() followed by > clock_gettime() could allow a task to wake early. It addressed the problem > by calling the scheduling classes update_curr when the cputimer starts. > > Said change induced a considerable performance regression on the syscalls > times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some > debuggers and applications that monitor their own performance that > accidentally depend on the performance of these specific calls. > > This patch mitigates the performace loss by prefetching data in the CPU > cache, as stalls due to cache misses appear to be where most time is spent > in our benchmarks. > > Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge > box with 32 logical cores and 2 NUMA nodes. The test is repeated with a > variable number of threads, from 2 to 4*num_cpus; the results are in > seconds and correspond to the average of 10 runs; the percentage gain is > computed with (before-after)/before so a positive value is an improvement > (it's faster). The improvement varies between a few percents for 5-20 > threads and more than 10% for 2 or >20 threads. > > pound_clock_gettime: > > threads 4.7-rc7 patched 4.7-rc7 > [num] [secs] [secs (percent)] > 2 3.483.06 ( 11.83%) > 5 3.333.25 ( 2.40%) > 8 3.373.26 ( 3.30%) > 12 3.323.37 ( -1.60%) > 21 4.013.90 ( 2.74%) > 30 3.633.36 ( 7.41%) > 48 3.713.11 ( 16.27%) > 79 3.753.16 ( 15.74%) > 110 3.813.25 ( 14.80%) > 128 3.883.31 ( 14.76%) Nice detective work! I'm wondering, where do we stand if compared with a pre-6e998916dfe3 kernel? I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I suspect v3.17 does not run easily on a recent distro. Could you attempt to revert the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to make the result correct, just see what the performance gap is, roughly. If there's still a significant gap then it might make sense to optimize this some more. Thanks, Ingo
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
Hello Ingo, thank you for your reply. Ingo Molnar> Nice detective work! I'm wondering, where do we stand if compared with a > pre-6e998916dfe3 kernel? The data follows. A considerable part of the performance loss is recovered; something is still on the table. "3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1 "3.18-bug" is the revision 6e998916dfe3 itself. Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse. times() threads3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched 2 2.27 ( 0.00%)3.73 (-64.71%)3.65 (-61.14%) 3.06 (-35.16%) 5 2.78 ( 0.00%)3.77 (-35.56%)3.45 (-23.98%) 3.25 (-16.79%) 8 2.79 ( 0.00%)4.41 (-57.71%)3.52 (-26.05%) 3.26 (-16.53%) 12 3.02 ( 0.00%)3.56 (-17.94%)3.29 ( -9.08%) 3.37 (-11.74%) 21 3.10 ( 0.00%)4.61 (-48.74%)4.07 (-31.34%) 3.90 (-25.89%) 30 3.33 ( 0.00%)5.75 (-72.53%)3.87 (-16.01%) 3.36 ( -0.81%) 48 2.96 ( 0.00%)6.06 (-105.04%) 3.79 (-28.10%) 3.11 ( -5.14%) 79 2.88 ( 0.00%)6.24 (-116.83%) 3.88 (-34.81%) 3.16 ( -9.84%) 1102.98 ( 0.00%)6.37 (-114.08%) 3.90 (-31.12%) 3.25 ( -9.07%) 1283.10 ( 0.00%)6.35 (-104.61%) 4.00 (-28.87%) 3.31 ( -6.57%) clock_gettime() threads3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched 2 2.23 ( 0.00%)3.68 (-64.56%)3.48 (-55.48%) 3.25 (-45.41%) 5 2.83 ( 0.00%)3.78 (-33.42%)3.33 (-17.43%) 3.17 (-12.03%) 8 2.84 ( 0.00%)4.31 (-52.12%)3.37 (-18.76%) 3.22 (-13.43%) 12 3.09 ( 0.00%)3.61 (-16.74%)3.32 ( -7.17%) 3.36 ( -8.47%) 21 3.14 ( 0.00%)4.63 (-47.36%)4.01 (-27.71%) 3.92 (-24.68%) 30 3.28 ( 0.00%)5.75 (-75.37%)3.63 (-10.80%) 3.40 ( -3.69%) 48 3.02 ( 0.00%)6.05 (-100.56%) 3.71 (-22.99%) 3.16 ( -4.64%) 79 2.88 ( 0.00%)6.30 (-118.90%) 3.75 (-30.26%) 3.28 (-13.93%) 1102.95 ( 0.00%)6.46 (-119.00%) 3.81 (-29.24%) 3.38 (-14.69%) 1283.05 ( 0.00%)6.42 (-110.08%) 3.88 (-27.04%) 3.38 (-10.70%) Regards, Giovanni Gherdovich
Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
Hello Ingo, thank you for your reply. Ingo Molnar > Nice detective work! I'm wondering, where do we stand if compared with a > pre-6e998916dfe3 kernel? The data follows. A considerable part of the performance loss is recovered; something is still on the table. "3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1 "3.18-bug" is the revision 6e998916dfe3 itself. Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse. times() threads3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched 2 2.27 ( 0.00%)3.73 (-64.71%)3.65 (-61.14%) 3.06 (-35.16%) 5 2.78 ( 0.00%)3.77 (-35.56%)3.45 (-23.98%) 3.25 (-16.79%) 8 2.79 ( 0.00%)4.41 (-57.71%)3.52 (-26.05%) 3.26 (-16.53%) 12 3.02 ( 0.00%)3.56 (-17.94%)3.29 ( -9.08%) 3.37 (-11.74%) 21 3.10 ( 0.00%)4.61 (-48.74%)4.07 (-31.34%) 3.90 (-25.89%) 30 3.33 ( 0.00%)5.75 (-72.53%)3.87 (-16.01%) 3.36 ( -0.81%) 48 2.96 ( 0.00%)6.06 (-105.04%) 3.79 (-28.10%) 3.11 ( -5.14%) 79 2.88 ( 0.00%)6.24 (-116.83%) 3.88 (-34.81%) 3.16 ( -9.84%) 1102.98 ( 0.00%)6.37 (-114.08%) 3.90 (-31.12%) 3.25 ( -9.07%) 1283.10 ( 0.00%)6.35 (-104.61%) 4.00 (-28.87%) 3.31 ( -6.57%) clock_gettime() threads3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched 2 2.23 ( 0.00%)3.68 (-64.56%)3.48 (-55.48%) 3.25 (-45.41%) 5 2.83 ( 0.00%)3.78 (-33.42%)3.33 (-17.43%) 3.17 (-12.03%) 8 2.84 ( 0.00%)4.31 (-52.12%)3.37 (-18.76%) 3.22 (-13.43%) 12 3.09 ( 0.00%)3.61 (-16.74%)3.32 ( -7.17%) 3.36 ( -8.47%) 21 3.14 ( 0.00%)4.63 (-47.36%)4.01 (-27.71%) 3.92 (-24.68%) 30 3.28 ( 0.00%)5.75 (-75.37%)3.63 (-10.80%) 3.40 ( -3.69%) 48 3.02 ( 0.00%)6.05 (-100.56%) 3.71 (-22.99%) 3.16 ( -4.64%) 79 2.88 ( 0.00%)6.30 (-118.90%) 3.75 (-30.26%) 3.28 (-13.93%) 1102.95 ( 0.00%)6.46 (-119.00%) 3.81 (-29.24%) 3.38 (-14.69%) 1283.05 ( 0.00%)6.42 (-110.08%) 3.88 (-27.04%) 3.38 (-10.70%) Regards, Giovanni Gherdovich