Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-19 Thread Jason Low
On Thu, 2015-03-19 at 10:59 -0700, Linus Torvalds wrote:
> On Thu, Mar 19, 2015 at 10:21 AM, Jason Low  wrote:
> >
> > I tested this patch on a 32 bit ARM system with 4 cores. Using the
> > generic 64 bit atomics, I did not see any performance change with this
> > patch, and the relevant functions (account_group_*_time(), ect...) don't
> > show up in perf reports.
> 
> Ok.
> 
> > One factor might be because locking/cacheline contention isn't as
> > apparent on smaller systems to begin with, and lib/atomic64.c also
> > mentions that "this is expected to used on systems with small numbers of
> > CPUs (<= 4 or so)".
> 
> Yeah, that's probably a valid argument anyway - 32-bit systems aren't
> really going to be multi-node big systems any more.
> 
> So I'm ok with the patch,

Okay, I will be sending out a v3 patch which will also mention a bit
about its effect on 32 bit systems in the changelog, in addition to the
changes that were discussed about in this thread (using WRITE_ONCE(),
ect...).

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 10:21 AM, Jason Low  wrote:
>
> I tested this patch on a 32 bit ARM system with 4 cores. Using the
> generic 64 bit atomics, I did not see any performance change with this
> patch, and the relevant functions (account_group_*_time(), ect...) don't
> show up in perf reports.

Ok.

> One factor might be because locking/cacheline contention isn't as
> apparent on smaller systems to begin with, and lib/atomic64.c also
> mentions that "this is expected to used on systems with small numbers of
> CPUs (<= 4 or so)".

Yeah, that's probably a valid argument anyway - 32-bit systems aren't
really going to be multi-node big systems any more.

So I'm ok with the patch,

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-19 Thread Jason Low
On Mon, 2015-03-02 at 13:49 -0800, Jason Low wrote:
> On Mon, 2015-03-02 at 11:03 -0800, Linus Torvalds wrote:
> > On Mon, Mar 2, 2015 at 10:42 AM, Jason Low  wrote:
> > >
> > > This patch converts the timers to 64 bit atomic variables and use
> > > atomic add to update them without a lock. With this patch, the percent
> > > of total time spent updating thread group cputimer timers was reduced
> > > from 30% down to less than 1%.
> > 
> > NAK.
> > 
> > Not because I think this is wrong, but because somebody needs to look
> > at the effects on 32-bit architectures too.
> 
> Okay, I will run some tests to see how this change affects the
> performance of itimers on 32 bit systems.

Hi Linus,

I tested this patch on a 32 bit ARM system with 4 cores. Using the
generic 64 bit atomics, I did not see any performance change with this
patch, and the relevant functions (account_group_*_time(), ect...) don't
show up in perf reports.

One factor might be because locking/cacheline contention isn't as
apparent on smaller systems to begin with, and lib/atomic64.c also
mentions that "this is expected to used on systems with small numbers of
CPUs (<= 4 or so)".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Jason Low
On Thu, 2015-03-05 at 16:35 +0100, Frederic Weisbecker wrote:
> On Mon, Mar 02, 2015 at 10:42:11AM -0800, Jason Low wrote:

> > +/* Sample thread_group_cputimer values in "cputimer", copy results to 
> > "times" */
> > +static inline void sample_group_cputimer(struct task_cputime *times,
> > +struct thread_group_cputimer *cputimer)
> > +{
> > +times->utime = atomic64_read(&cputimer->utime);
> > +times->stime = atomic64_read(&cputimer->stime);
> > +times->sum_exec_runtime = 
> > atomic64_read(&cputimer->sum_exec_runtime);
> 
> So, in the case we are calling that right after setting cputimer->running, I 
> guess we are fine
> because we just updated cputimer with the freshest values.
> 
> But if we are reading this a while after, say several ticks further, there is 
> a chance that
> we read stale values since we don't lock anymore.
> 
> I don't know if it matters or not, I guess it depends how stale it can be and 
> how much precision
> we expect from posix cpu timers. It probably doesn't matter.
> 
> But just in case, atomic64_read_return(&cputimer->utime, 0) would make sure 
> we get the freshest
> value because it performs a full barrier, at the cost of more overhead of 
> course.

(Assuming that is atomic64_add_return :))

Yeah, there aren't any guarantees that we read the freshest value, but
since the lock isn't used to serialize subsequent accesses of
times->utime, ect..., the values can potentially become stale by the
time they get used anyway, even when we have the locking.

So I'm not sure if atomic64_add_return(&time, 0) for the reads would
really provide much of a benefit when we factor in the extra overhead.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Jason Low
On Thu, 2015-03-05 at 16:20 +0100, Frederic Weisbecker wrote:
> On Mon, Mar 02, 2015 at 01:44:04PM -0800, Linus Torvalds wrote:
> > On Mon, Mar 2, 2015 at 1:16 PM, Jason Low  wrote:
> > >
> > > In original code, we set cputimer->running first so it is running while
> > > we call update_gt_cputime(). Now in this patch, we swapped the 2 calls
> > > such that we set running after calling update_gt_cputime(), so that
> > > wouldn't be an issue anymore.
> > 
> > Hmm. If you actually care about ordering, and 'running' should be
> > written to after the other things, then it might be best if you use
> > 
> >smp_store_release(&cputimer->running, 1);
> > 
> > which makes it clear that the store happens *after* what went before it.
> > 
> > Or at least have a "smp_wmb()" between the atomic64 updates and the
> > "WRITE_ONCE()".
> 
> FWIW, perhaps it can be reduced with an smp_mb__before_atomic() on the
> account_group_*_time() side,

Hi Frederic,

I think Linus might be referring to the updates in update_gt_cputime()?
Otherwise, if the atomic updates in account_group_*_time() is already
enough for correctness, then we might not want to be adding barriers in
the hot paths if they aren't necessary.

I was thinking about the adding smp_store_release(&cputimer->running, 1)
to document that we want to write to the running field after the
operations in update_gt_cputime(). The overhead here won't be much since
it doesn't get called frequently as you mentioned.

>  paired with smp_wmb() from the thread_group_cputimer()
> side. Arming cputime->running shouldn't be too frequent while update cputime
> happens at least every tick...
> 
> Assuming smp_mb__before_atomic() is more lightweight than smp_load_acquire()
> of course. 
> 
> > 
> > I guess that since you use cmpxchg in update_gt_cputime, the accesses
> > end up being ordered anyway, but it might be better to make that thing
> > very explicit.
> > 
> >Linus


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Paul E. McKenney
On Thu, Mar 05, 2015 at 05:00:05PM +0100, Frederic Weisbecker wrote:
> On Thu, Mar 05, 2015 at 07:56:59AM -0800, Paul E. McKenney wrote:
> > On Thu, Mar 05, 2015 at 04:35:09PM +0100, Frederic Weisbecker wrote:
> > > So, in the case we are calling that right after setting 
> > > cputimer->running, I guess we are fine
> > > because we just updated cputimer with the freshest values.
> > > 
> > > But if we are reading this a while after, say several ticks further, 
> > > there is a chance that
> > > we read stale values since we don't lock anymore.
> > > 
> > > I don't know if it matters or not, I guess it depends how stale it can be 
> > > and how much precision
> > > we expect from posix cpu timers. It probably doesn't matter.
> > > 
> > > But just in case, atomic64_read_return(&cputimer->utime, 0) would make 
> > > sure we get the freshest
> > > value because it performs a full barrier, at the cost of more overhead of 
> > > course.
> > 
> > Well, if we are running within a guest OS, we might be delayed at any point
> > for quite some time.  Even with interrupts disabled.
> 
> You mean delayed because of the overhead of atomic_add_return() or the stale 
> value
> of cptimer-> fields? 

Because of preemption of the guest OS's VCPUs by the host OS.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Frederic Weisbecker
On Thu, Mar 05, 2015 at 07:56:59AM -0800, Paul E. McKenney wrote:
> On Thu, Mar 05, 2015 at 04:35:09PM +0100, Frederic Weisbecker wrote:
> > So, in the case we are calling that right after setting cputimer->running, 
> > I guess we are fine
> > because we just updated cputimer with the freshest values.
> > 
> > But if we are reading this a while after, say several ticks further, there 
> > is a chance that
> > we read stale values since we don't lock anymore.
> > 
> > I don't know if it matters or not, I guess it depends how stale it can be 
> > and how much precision
> > we expect from posix cpu timers. It probably doesn't matter.
> > 
> > But just in case, atomic64_read_return(&cputimer->utime, 0) would make sure 
> > we get the freshest
> > value because it performs a full barrier, at the cost of more overhead of 
> > course.
> 
> Well, if we are running within a guest OS, we might be delayed at any point
> for quite some time.  Even with interrupts disabled.

You mean delayed because of the overhead of atomic_add_return() or the stale 
value
of cptimer-> fields? 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Paul E. McKenney
On Thu, Mar 05, 2015 at 04:35:09PM +0100, Frederic Weisbecker wrote:
> On Mon, Mar 02, 2015 at 10:42:11AM -0800, Jason Low wrote:
> > v1->v2:
> > - Peter suggested that cputimer->running does not need to be atomic,
> >   so we can leave it as an integer.
> > - Address a race condition that could occur in update_gt_cputime().
> > - Add helper functions to avoid repeating code.
> > 
> > While running a database workload, we found a scalability issue
> > with itimers.
> > 
> > Much of the problem was caused by the thread_group_cputimer spinlock.
> > Each time we account for group system/user time, we need to obtain a
> > thread_group_cputimer's spinlock to update the timers. On larger
> > systems (such as a 16 socket machine), this caused more than 30% of
> > total time spent trying to obtain this kernel lock to update these
> > group timer stats.
> > 
> > This patch converts the timers to 64 bit atomic variables and use
> > atomic add to update them without a lock. With this patch, the percent
> > of total time spent updating thread group cputimer timers was reduced
> > from 30% down to less than 1%.
> > 
> > Signed-off-by: Jason Low 
> > ---
> >  include/linux/init_task.h  |7 +++--
> >  include/linux/sched.h  |   10 ++-
> >  kernel/fork.c  |3 --
> >  kernel/sched/stats.h   |   12 ++--
> >  kernel/time/posix-cpu-timers.c |   55 
> > +++
> >  5 files changed, 42 insertions(+), 45 deletions(-)
> > 
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 3037fc0..c4cdec7 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -50,9 +50,10 @@ extern struct fs_struct init_fs;
> > .cpu_timers = INIT_CPU_TIMERS(sig.cpu_timers),  \
> > .rlim   = INIT_RLIMITS, \
> > .cputimer   = { \
> > -   .cputime = INIT_CPUTIME,\
> > -   .running = 0,   \
> > -   .lock = __RAW_SPIN_LOCK_UNLOCKED(sig.cputimer.lock),\
> > +   .utime = ATOMIC64_INIT(0),  \
> > +   .stime = ATOMIC64_INIT(0),  \
> > +   .sum_exec_runtime = ATOMIC64_INIT(0),   \
> > +   .running = 0\
> > },  \
> > .cred_guard_mutex = \
> >  __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 8db31ef..d6b0f76 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -588,9 +588,10 @@ struct task_cputime {
> >   * used for thread group CPU timer calculations.
> >   */
> >  struct thread_group_cputimer {
> > -   struct task_cputime cputime;
> > +   atomic64_t utime;
> > +   atomic64_t stime;
> > +   atomic64_t sum_exec_runtime;
> > int running;
> > -   raw_spinlock_t lock;
> >  };
> >  
> >  #include 
> > @@ -2942,11 +2943,6 @@ static __always_inline bool need_resched(void)
> >  void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
> > *times);
> >  void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
> > *times);
> >  
> > -static inline void thread_group_cputime_init(struct signal_struct *sig)
> > -{
> > -   raw_spin_lock_init(&sig->cputimer.lock);
> > -}
> > -
> >  /*
> >   * Reevaluate whether the task has signals pending delivery.
> >   * Wake the task if so.
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 4dc2dda..df9dfe9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1037,9 +1037,6 @@ static void posix_cpu_timers_init_group(struct 
> > signal_struct *sig)
> >  {
> > unsigned long cpu_limit;
> >  
> > -   /* Thread group counters. */
> > -   thread_group_cputime_init(sig);
> > -
> > cpu_limit = ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur);
> > if (cpu_limit != RLIM_INFINITY) {
> > sig->cputime_expires.prof_exp = secs_to_cputime(cpu_limit);
> > diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> > index 4ab7043..adda94e 100644
> > --- a/kernel/sched/stats.h
> > +++ b/kernel/sched/stats.h
> > @@ -215,9 +215,7 @@ static inline void account_group_user_time(struct 
> > task_struct *tsk,
> > if (!cputimer_running(tsk))
> > return;
> >  
> > -   raw_spin_lock(&cputimer->lock);
> > -   cputimer->cputime.utime += cputime;
> > -   raw_spin_unlock(&cputimer->lock);
> > +   atomic64_add(cputime, &cputimer->utime);
> >  }
> >  
> >  /**
> > @@ -238,9 +236,7 @@ static inline void account_group_system_time(struct 
> > task_struct *tsk,
> > if (!cputimer_running(tsk))
> > return;
> >  
> > -   raw_spin_lock(&cputimer->lock);
> > -   cputime

Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Frederic Weisbecker
On Mon, Mar 02, 2015 at 10:42:11AM -0800, Jason Low wrote:
> v1->v2:
> - Peter suggested that cputimer->running does not need to be atomic,
>   so we can leave it as an integer.
> - Address a race condition that could occur in update_gt_cputime().
> - Add helper functions to avoid repeating code.
> 
> While running a database workload, we found a scalability issue
> with itimers.
> 
> Much of the problem was caused by the thread_group_cputimer spinlock.
> Each time we account for group system/user time, we need to obtain a
> thread_group_cputimer's spinlock to update the timers. On larger
> systems (such as a 16 socket machine), this caused more than 30% of
> total time spent trying to obtain this kernel lock to update these
> group timer stats.
> 
> This patch converts the timers to 64 bit atomic variables and use
> atomic add to update them without a lock. With this patch, the percent
> of total time spent updating thread group cputimer timers was reduced
> from 30% down to less than 1%.
> 
> Signed-off-by: Jason Low 
> ---
>  include/linux/init_task.h  |7 +++--
>  include/linux/sched.h  |   10 ++-
>  kernel/fork.c  |3 --
>  kernel/sched/stats.h   |   12 ++--
>  kernel/time/posix-cpu-timers.c |   55 +++
>  5 files changed, 42 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 3037fc0..c4cdec7 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -50,9 +50,10 @@ extern struct fs_struct init_fs;
>   .cpu_timers = INIT_CPU_TIMERS(sig.cpu_timers),  \
>   .rlim   = INIT_RLIMITS, \
>   .cputimer   = { \
> - .cputime = INIT_CPUTIME,\
> - .running = 0,   \
> - .lock = __RAW_SPIN_LOCK_UNLOCKED(sig.cputimer.lock),\
> + .utime = ATOMIC64_INIT(0),  \
> + .stime = ATOMIC64_INIT(0),  \
> + .sum_exec_runtime = ATOMIC64_INIT(0),   \
> + .running = 0\
>   },  \
>   .cred_guard_mutex = \
>__MUTEX_INITIALIZER(sig.cred_guard_mutex), \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..d6b0f76 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -588,9 +588,10 @@ struct task_cputime {
>   * used for thread group CPU timer calculations.
>   */
>  struct thread_group_cputimer {
> - struct task_cputime cputime;
> + atomic64_t utime;
> + atomic64_t stime;
> + atomic64_t sum_exec_runtime;
>   int running;
> - raw_spinlock_t lock;
>  };
>  
>  #include 
> @@ -2942,11 +2943,6 @@ static __always_inline bool need_resched(void)
>  void thread_group_cputime(struct task_struct *tsk, struct task_cputime 
> *times);
>  void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
> *times);
>  
> -static inline void thread_group_cputime_init(struct signal_struct *sig)
> -{
> - raw_spin_lock_init(&sig->cputimer.lock);
> -}
> -
>  /*
>   * Reevaluate whether the task has signals pending delivery.
>   * Wake the task if so.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4dc2dda..df9dfe9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1037,9 +1037,6 @@ static void posix_cpu_timers_init_group(struct 
> signal_struct *sig)
>  {
>   unsigned long cpu_limit;
>  
> - /* Thread group counters. */
> - thread_group_cputime_init(sig);
> -
>   cpu_limit = ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur);
>   if (cpu_limit != RLIM_INFINITY) {
>   sig->cputime_expires.prof_exp = secs_to_cputime(cpu_limit);
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 4ab7043..adda94e 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -215,9 +215,7 @@ static inline void account_group_user_time(struct 
> task_struct *tsk,
>   if (!cputimer_running(tsk))
>   return;
>  
> - raw_spin_lock(&cputimer->lock);
> - cputimer->cputime.utime += cputime;
> - raw_spin_unlock(&cputimer->lock);
> + atomic64_add(cputime, &cputimer->utime);
>  }
>  
>  /**
> @@ -238,9 +236,7 @@ static inline void account_group_system_time(struct 
> task_struct *tsk,
>   if (!cputimer_running(tsk))
>   return;
>  
> - raw_spin_lock(&cputimer->lock);
> - cputimer->cputime.stime += cputime;
> - raw_spin_unlock(&cputimer->lock);
> + atomic64_add(cputime, &cputimer->stime);
>  }
>  
>  /**
> @@ -261,7 +257,5 @@ static inline void account_group_exec_runtime(struct 
> task_struct *tsk,
> 

Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-05 Thread Frederic Weisbecker
On Mon, Mar 02, 2015 at 01:44:04PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 1:16 PM, Jason Low  wrote:
> >
> > In original code, we set cputimer->running first so it is running while
> > we call update_gt_cputime(). Now in this patch, we swapped the 2 calls
> > such that we set running after calling update_gt_cputime(), so that
> > wouldn't be an issue anymore.
> 
> Hmm. If you actually care about ordering, and 'running' should be
> written to after the other things, then it might be best if you use
> 
>smp_store_release(&cputimer->running, 1);
> 
> which makes it clear that the store happens *after* what went before it.
> 
> Or at least have a "smp_wmb()" between the atomic64 updates and the
> "WRITE_ONCE()".

FWIW, perhaps it can be reduced with an smp_mb__before_atomic() on the
account_group_*_time() side, paired with smp_wmb() from the 
thread_group_cputimer()
side. Arming cputime->running shouldn't be too frequent while update cputime
happens at least every tick...

Assuming smp_mb__before_atomic() is more lightweight than smp_load_acquire()
of course. 

> 
> I guess that since you use cmpxchg in update_gt_cputime, the accesses
> end up being ordered anyway, but it might be better to make that thing
> very explicit.
> 
>Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Jason Low
On Mon, 2015-03-02 at 13:44 -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 1:16 PM, Jason Low  wrote:
> >
> > In original code, we set cputimer->running first so it is running while
> > we call update_gt_cputime(). Now in this patch, we swapped the 2 calls
> > such that we set running after calling update_gt_cputime(), so that
> > wouldn't be an issue anymore.
> 
> Hmm. If you actually care about ordering, and 'running' should be
> written to after the other things, then it might be best if you use
> 
>smp_store_release(&cputimer->running, 1);
> 
> which makes it clear that the store happens *after* what went before it.
> 
> Or at least have a "smp_wmb()" between the atomic64 updates and the
> "WRITE_ONCE()".
> 
> I guess that since you use cmpxchg in update_gt_cputime, the accesses
> end up being ordered anyway, but it might be better to make that thing
> very explicit.

Yeah, I suppose the extra (smp_mb or smp_wmb) might add more overhead
but since this is not a common code path anyway, it would be worth
adding it to make things more clear.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Jason Low
On Mon, 2015-03-02 at 11:03 -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 10:42 AM, Jason Low  wrote:
> >
> > This patch converts the timers to 64 bit atomic variables and use
> > atomic add to update them without a lock. With this patch, the percent
> > of total time spent updating thread group cputimer timers was reduced
> > from 30% down to less than 1%.
> 
> NAK.
> 
> Not because I think this is wrong, but because somebody needs to look
> at the effects on 32-bit architectures too.
> 
> In particular, check out lib/atomic64.c - which uses a hashed array of
> 16-bit spinlocks to do 64-bit atomics. That may or may well work ok in
> practice, but it does mean that now sample_group_cputimer() and
> update_gt_cputime() will take that (it ends up generally being the
> same) spinlock three times for the three atomic64_read()'s.

Okay, I will run some tests to see how this change affects the
performance of itimers on 32 bit systems.

While the update_gt_cputime() shouldn't be an issue for performance
since it doesn't get called often, the sample_group_cputimer() needing
to take locks 3 times for each atomic64_read is something that could
impact performance, so we should take a look at that.

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 1:16 PM, Jason Low  wrote:
>
> In original code, we set cputimer->running first so it is running while
> we call update_gt_cputime(). Now in this patch, we swapped the 2 calls
> such that we set running after calling update_gt_cputime(), so that
> wouldn't be an issue anymore.

Hmm. If you actually care about ordering, and 'running' should be
written to after the other things, then it might be best if you use

   smp_store_release(&cputimer->running, 1);

which makes it clear that the store happens *after* what went before it.

Or at least have a "smp_wmb()" between the atomic64 updates and the
"WRITE_ONCE()".

I guess that since you use cmpxchg in update_gt_cputime, the accesses
end up being ordered anyway, but it might be better to make that thing
very explicit.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Jason Low
On Mon, 2015-03-02 at 20:40 +0100, Oleg Nesterov wrote:
> Well, I forgot everything about this code, but let me ask anyway ;)
> 
> On 03/02, Jason Low wrote:

> > @@ -222,13 +239,10 @@ void thread_group_cputimer(struct task_struct *tsk, 
> > struct task_cputime *times)
> >  * it.
> >  */
> > thread_group_cputime(tsk, &sum);
> > -   raw_spin_lock_irqsave(&cputimer->lock, flags);
> > -   cputimer->running = 1;
> > -   update_gt_cputime(&cputimer->cputime, &sum);
> > -   } else
> > -   raw_spin_lock_irqsave(&cputimer->lock, flags);
> > -   *times = cputimer->cputime;
> > -   raw_spin_unlock_irqrestore(&cputimer->lock, flags);
> > +   update_gt_cputime(cputimer, &sum);
> > +   ACCESS_ONCE(cputimer->running) = 1;
> 
> WRITE_ONCE() looks better... 

Okay, I can update that.

> but it is not clear to me why do we need it
> at all.

Peter suggested it here as we would now be updating the running field
without the lock:

https://lkml.org/lkml/2015/1/23/641

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Jason Low
On Mon, 2015-03-02 at 20:43 +0100, Oleg Nesterov wrote:
> On 03/02, Oleg Nesterov wrote:
> >
> > Well, I forgot everything about this code, but let me ask anyway ;)
> >
> > On 03/02, Jason Low wrote:
> > >
> > > -static void update_gt_cputime(struct task_cputime *a, struct 
> > > task_cputime *b)
> > > +static inline void __update_gt_cputime(atomic64_t *cputime, u64 
> > > sum_cputime)
> > >  {
> > > - if (b->utime > a->utime)
> > > - a->utime = b->utime;
> > > -
> > > - if (b->stime > a->stime)
> > > - a->stime = b->stime;
> > > + u64 curr_cputime;
> > > + /*
> > > +  * Set cputime to sum_cputime if sum_cputime > cputime. Use cmpxchg
> > > +  * to avoid race conditions with concurrent updates to cputime.
> > > +  */
> > > +retry:
> > > + curr_cputime = atomic64_read(cputime);
> > > + if (sum_cputime > curr_cputime) {
> > > + if (atomic64_cmpxchg(cputime, curr_cputime, sum_cputime) != 
> > > curr_cputime)
> > > + goto retry;
> > > + }
> > > +}
> > >
> > > - if (b->sum_exec_runtime > a->sum_exec_runtime)
> > > - a->sum_exec_runtime = b->sum_exec_runtime;
> > > +static void update_gt_cputime(struct thread_group_cputimer *cputimer, 
> > > struct task_cputime *sum)
> > > +{
> > > + __update_gt_cputime(&cputimer->utime, sum->utime);
> > > + __update_gt_cputime(&cputimer->stime, sum->stime);
> > > + __update_gt_cputime(&cputimer->sum_exec_runtime, sum->sum_exec_runtime);
> > >  }
> >
> > And this is called if !cputimer_running().
> >
> > So who else can update these atomic64_t's ? The caller is called under 
> > ->siglock.
> > IOW, do we really need to cmpxchg/retry ?
> >
> > Just curious, I am sure I missed something.
> 
> Ah, sorry, I seem to understand.
> 
> We still can race with account_group_*time() even if ->running == 0. Because
> (say) account_group_exec_runtime() can race with 1 -> 0 -> 1 transition.
> 
> Or is there another reason?

Hi Oleg,

Yes, that 1 -> 0 -> 1 transition was the race that I had in mind. Thus,
I added the extra atomic logic in update_gt_cputime() just to be safe.

In original code, we set cputimer->running first so it is running while
we call update_gt_cputime(). Now in this patch, we swapped the 2 calls
such that we set running after calling update_gt_cputime(), so that
wouldn't be an issue anymore.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Oleg Nesterov
On 03/02, Oleg Nesterov wrote:
>
> Well, I forgot everything about this code, but let me ask anyway ;)
>
> On 03/02, Jason Low wrote:
> >
> > -static void update_gt_cputime(struct task_cputime *a, struct task_cputime 
> > *b)
> > +static inline void __update_gt_cputime(atomic64_t *cputime, u64 
> > sum_cputime)
> >  {
> > -   if (b->utime > a->utime)
> > -   a->utime = b->utime;
> > -
> > -   if (b->stime > a->stime)
> > -   a->stime = b->stime;
> > +   u64 curr_cputime;
> > +   /*
> > +* Set cputime to sum_cputime if sum_cputime > cputime. Use cmpxchg
> > +* to avoid race conditions with concurrent updates to cputime.
> > +*/
> > +retry:
> > +   curr_cputime = atomic64_read(cputime);
> > +   if (sum_cputime > curr_cputime) {
> > +   if (atomic64_cmpxchg(cputime, curr_cputime, sum_cputime) != 
> > curr_cputime)
> > +   goto retry;
> > +   }
> > +}
> >
> > -   if (b->sum_exec_runtime > a->sum_exec_runtime)
> > -   a->sum_exec_runtime = b->sum_exec_runtime;
> > +static void update_gt_cputime(struct thread_group_cputimer *cputimer, 
> > struct task_cputime *sum)
> > +{
> > +   __update_gt_cputime(&cputimer->utime, sum->utime);
> > +   __update_gt_cputime(&cputimer->stime, sum->stime);
> > +   __update_gt_cputime(&cputimer->sum_exec_runtime, sum->sum_exec_runtime);
> >  }
>
> And this is called if !cputimer_running().
>
> So who else can update these atomic64_t's ? The caller is called under 
> ->siglock.
> IOW, do we really need to cmpxchg/retry ?
>
> Just curious, I am sure I missed something.

Ah, sorry, I seem to understand.

We still can race with account_group_*time() even if ->running == 0. Because
(say) account_group_exec_runtime() can race with 1 -> 0 -> 1 transition.

Or is there another reason?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Oleg Nesterov
Well, I forgot everything about this code, but let me ask anyway ;)

On 03/02, Jason Low wrote:
>
> -static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b)
> +static inline void __update_gt_cputime(atomic64_t *cputime, u64 sum_cputime)
>  {
> - if (b->utime > a->utime)
> - a->utime = b->utime;
> -
> - if (b->stime > a->stime)
> - a->stime = b->stime;
> + u64 curr_cputime;
> + /*
> +  * Set cputime to sum_cputime if sum_cputime > cputime. Use cmpxchg
> +  * to avoid race conditions with concurrent updates to cputime.
> +  */
> +retry:
> + curr_cputime = atomic64_read(cputime);
> + if (sum_cputime > curr_cputime) {
> + if (atomic64_cmpxchg(cputime, curr_cputime, sum_cputime) != 
> curr_cputime)
> + goto retry;
> + }
> +}
>  
> - if (b->sum_exec_runtime > a->sum_exec_runtime)
> - a->sum_exec_runtime = b->sum_exec_runtime;
> +static void update_gt_cputime(struct thread_group_cputimer *cputimer, struct 
> task_cputime *sum)
> +{
> + __update_gt_cputime(&cputimer->utime, sum->utime);
> + __update_gt_cputime(&cputimer->stime, sum->stime);
> + __update_gt_cputime(&cputimer->sum_exec_runtime, sum->sum_exec_runtime);
>  }

And this is called if !cputimer_running().

So who else can update these atomic64_t's ? The caller is called under 
->siglock.
IOW, do we really need to cmpxchg/retry ?

Just curious, I am sure I missed something.

> @@ -222,13 +239,10 @@ void thread_group_cputimer(struct task_struct *tsk, 
> struct task_cputime *times)
>* it.
>*/
>   thread_group_cputime(tsk, &sum);
> - raw_spin_lock_irqsave(&cputimer->lock, flags);
> - cputimer->running = 1;
> - update_gt_cputime(&cputimer->cputime, &sum);
> - } else
> - raw_spin_lock_irqsave(&cputimer->lock, flags);
> - *times = cputimer->cputime;
> - raw_spin_unlock_irqrestore(&cputimer->lock, flags);
> + update_gt_cputime(cputimer, &sum);
> + ACCESS_ONCE(cputimer->running) = 1;

WRITE_ONCE() looks better... but it is not clear to me why do we need it
at all.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched, timer: Use atomics for thread_group_cputimer to improve scalability

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 10:42 AM, Jason Low  wrote:
>
> This patch converts the timers to 64 bit atomic variables and use
> atomic add to update them without a lock. With this patch, the percent
> of total time spent updating thread group cputimer timers was reduced
> from 30% down to less than 1%.

NAK.

Not because I think this is wrong, but because somebody needs to look
at the effects on 32-bit architectures too.

In particular, check out lib/atomic64.c - which uses a hashed array of
16-bit spinlocks to do 64-bit atomics. That may or may well work ok in
practice, but it does mean that now sample_group_cputimer() and
update_gt_cputime() will take that (it ends up generally being the
same) spinlock three times for the three atomic64_read()'s.

Now, I think on x86, we end up using not lib/atomic64.c but our own
versions that use cmpxchg8b, which is probably fine from a performance
standpoint. But I see a lot of "select GENERIC_ATOMIC64" for other
architectures.

Anyway, it is *possible* that even on those 32-bit targets, the
atomic64's aren't any worse than the current spinlock in practice.  So
the "NAK" is in no way absolute - but I'd just like to hear that this
is all reasonably fine on 32-bit ARM and powerpc, for example.

Hmm?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/