Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-28 Thread Andrew Hunter
On Thu, Jul 27, 2017 at 12:06 PM, Paul E. McKenney
 wrote:
> IPIin only those CPUs running threads in the same process as the
> thread invoking membarrier() would be very nice!  There is some LKML
> discussion on this topic, which is currently circling around making this
> determination reliable on all CPU families.  ARM and x86 are thought
> to be OK, PowerPC is thought to require a smallish patch, MIPS is
> a big question mark, and so on.
>

I'm not sure what you mean by the determination or how this is arch specific?

> But I am surprised when you say that the downgrade would not work, at
> least if you are not running with nohz_full CPUs.  The rcu_sched_qs()
> function simply sets a per-CPU quiescent-state flag.  The needed strong
> ordering is instead supplied by the combination of the code starting
> the grace period, reporting the setting of the quiescent-state flag
> to core RCU, and the code completing the grace period.  Each non-idle
> CPU will execute full memory barriers either in RCU_SOFTIRQ context,
> on entry to idle, on exit from idle, or within the grace-period kthread.
> In particular, a CPU running the same usermode thread for the entire
> grace period will execute the needed memory barriers in RCU_SOFTIRQ
> context shortly after taking a scheduling-clock interrupt.
>

Recall that I need more than just a memory barrier--also to interrupt
RSEQ critical sections in progress on those CPUs. I know this isn't
general purpose, I'm just saying a trivial downgrade wouldn't work for
me. :)  It would probably be sufficient to set NOTIFY_RESUME on all
cpus running my code (which is what my IPI function does anyway...)


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-28 Thread Andrew Hunter
On Thu, Jul 27, 2017 at 12:06 PM, Paul E. McKenney
 wrote:
> IPIin only those CPUs running threads in the same process as the
> thread invoking membarrier() would be very nice!  There is some LKML
> discussion on this topic, which is currently circling around making this
> determination reliable on all CPU families.  ARM and x86 are thought
> to be OK, PowerPC is thought to require a smallish patch, MIPS is
> a big question mark, and so on.
>

I'm not sure what you mean by the determination or how this is arch specific?

> But I am surprised when you say that the downgrade would not work, at
> least if you are not running with nohz_full CPUs.  The rcu_sched_qs()
> function simply sets a per-CPU quiescent-state flag.  The needed strong
> ordering is instead supplied by the combination of the code starting
> the grace period, reporting the setting of the quiescent-state flag
> to core RCU, and the code completing the grace period.  Each non-idle
> CPU will execute full memory barriers either in RCU_SOFTIRQ context,
> on entry to idle, on exit from idle, or within the grace-period kthread.
> In particular, a CPU running the same usermode thread for the entire
> grace period will execute the needed memory barriers in RCU_SOFTIRQ
> context shortly after taking a scheduling-clock interrupt.
>

Recall that I need more than just a memory barrier--also to interrupt
RSEQ critical sections in progress on those CPUs. I know this isn't
general purpose, I'm just saying a trivial downgrade wouldn't work for
me. :)  It would probably be sufficient to set NOTIFY_RESUME on all
cpus running my code (which is what my IPI function does anyway...)


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-28 Thread Andrew Hunter
On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney
 wrote:
> On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> IPIing only running threads of my process would be perfect. In fact
>> I might even be able to make use of "membarrier these threads
>> please" to reduce IPIs, when I change the topology from fully
>> connected to something more sparse, on larger machines.
>>

We do this as well--sometimes we only need RSEQ fences against
specific CPU(s), and thus pass a subset.

> +static void membarrier_private_expedited_ipi_each(void)
> +{
> +   int cpu;
> +
> +   for_each_online_cpu(cpu) {
> +   struct task_struct *p;
> +
> +   rcu_read_lock();
> +   p = task_rcu_dereference(_rq(cpu)->curr);
> +   if (p && p->mm == current->mm)
> +   smp_call_function_single(cpu, ipi_mb, NULL, 1);
> +   rcu_read_unlock();
> +   }
> +}
> +

We have the (simpler imho)

const struct cpumask *mask = mm_cpumask(mm);
/* possibly AND it with a user requested mask */
smp_call_function_many(mask, ipi_func, );

which I think will be faster on some archs (that support broadcast)
and have fewer problems with out of sync values (though we do have to
check in our IPI function that we haven't context switched out.

Am I missing why this won't work?


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-28 Thread Andrew Hunter
On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney
 wrote:
> On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> IPIing only running threads of my process would be perfect. In fact
>> I might even be able to make use of "membarrier these threads
>> please" to reduce IPIs, when I change the topology from fully
>> connected to something more sparse, on larger machines.
>>

We do this as well--sometimes we only need RSEQ fences against
specific CPU(s), and thus pass a subset.

> +static void membarrier_private_expedited_ipi_each(void)
> +{
> +   int cpu;
> +
> +   for_each_online_cpu(cpu) {
> +   struct task_struct *p;
> +
> +   rcu_read_lock();
> +   p = task_rcu_dereference(_rq(cpu)->curr);
> +   if (p && p->mm == current->mm)
> +   smp_call_function_single(cpu, ipi_mb, NULL, 1);
> +   rcu_read_unlock();
> +   }
> +}
> +

We have the (simpler imho)

const struct cpumask *mask = mm_cpumask(mm);
/* possibly AND it with a user requested mask */
smp_call_function_many(mask, ipi_func, );

which I think will be faster on some archs (that support broadcast)
and have fewer problems with out of sync values (though we do have to
check in our IPI function that we haven't context switched out.

Am I missing why this won't work?


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-27 Thread Andrew Hunter
On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney
 wrote:
> Hello!
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form.

Google doesn't use sys_membarrier (that I know of...), but we do use
RSEQ fences, which implements membarrier + a little extra to interrupt
RSEQ critical sections (via IPI--smp_call_function_many.)  One
important optimization here is that we only throw IPIs to cpus running
the same mm as current (or a subset if requested by userspace), as
this is sufficient for the API guarantees we provide. I suspect a
similar optimization would largely mitigate DOS concerns, no? I don't
know if there are use cases not covered. To answer your question:
throttling these (or our equivalents) would be fine in terms of
userspace throughput. We haven't noticed performance problems
requiring such an intervention, however.

Furthermore: I wince a bit at the silent downgrade; I'd almost prefer
-EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade
simply wouldn't work; rcu_sched_qs() gets called at many points that
aren't sufficiently quiescent for RSEQ (in particular, when userspace
code is running!)  This is solvable, but worth thinking about.


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-27 Thread Andrew Hunter
On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney
 wrote:
> Hello!
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form.

Google doesn't use sys_membarrier (that I know of...), but we do use
RSEQ fences, which implements membarrier + a little extra to interrupt
RSEQ critical sections (via IPI--smp_call_function_many.)  One
important optimization here is that we only throw IPIs to cpus running
the same mm as current (or a subset if requested by userspace), as
this is sufficient for the API guarantees we provide. I suspect a
similar optimization would largely mitigate DOS concerns, no? I don't
know if there are use cases not covered. To answer your question:
throttling these (or our equivalents) would be fine in terms of
userspace throughput. We haven't noticed performance problems
requiring such an intervention, however.

Furthermore: I wince a bit at the silent downgrade; I'd almost prefer
-EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade
simply wouldn't work; rcu_sched_qs() gets called at many points that
aren't sufficiently quiescent for RSEQ (in particular, when userspace
code is running!)  This is solvable, but worth thinking about.


Re: Migrated CFS task getting an unfair advantage

2016-03-09 Thread Andrew Hunter
At Google, we essentially reverted 88ec22d and the subsequent tweaks
to it, keeping vruntime absolute always , instead using
task_move_group_fair to change the basis of relative min_vruntime
between cpus.  We found this made it a lot easier to reason about and
work with corss-cpu computations.  I could post the patch if it would
be of interest...

On Wed, Mar 9, 2016 at 5:06 AM, pavankumar kondeti
 wrote:
> Hi Peter,
>
> On Wed, Mar 9, 2016 at 5:34 PM, Peter Zijlstra  wrote:
>> On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote:
>>
>>> When a CFS task is enqueued during migration (load balance or change in
>>> affinity), its vruntime is normalized before updating the current and
>>> cfs_rq->min_vruntime.
>>
>> static void
>> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> {
>> /*
>>  * Update the normalized vruntime before updating min_vruntime
>>  * through calling update_curr().
>>  */
>> if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
>> se->vruntime += cfs_rq->min_vruntime;
>>
>> update_curr(cfs_rq);
>>
>> This, right? Some idiot wrote a comment but forgot to explain why.
>>
>>> If the current entity is a low priority task or belongs to a cgroup
>>> that has lower cpu.shares and it is the only entity queued, there is a
>>> possibility of big update to the cfs_rq->min_vruntime.
>>
>>> As the migrated task is normalized before this update, it gets an
>>> unfair advantage over tasks queued after this point. If the migrated
>>> task is a CPU hogger, the other CFS tasks queued on this CPU gets
>>> starved.
>>
>> Because it takes a whole while for the newly placed task to gain on the
>> previous task, right?
>>
>
> Yes. The newly woken up task vruntime is adjusted wrt the 
> cfs_rq->min_vruntime.
> The cfs_rq->min_vruntime can potentially be hundreds of msec beyond the
> migrated task.
>
>>> If we add the migrated task to destination CPU cfs_rq's rb tree before
>>> updating the current in enqueue_entity(), the cfs_rq->min_vruntime
>>> does not go beyond the newly migrated task. Is this an acceptable
>>> solution?
>>
>> Hurm.. so I'm not sure how that would solve anything. The existing task
>> would still be shot far into the future.
>>
> In my testing, the problem is gone with this approach.
>
> The update_min_vruntime() called from  update_curr() has a check to make
> sure that cfs_rq->min_vruntime  does not go beyond the leftmost entity
> (in this case it would be the migrated task) vruntime. so we don't see
> the migrated task getting any advantage.
>
>> What you want is to normalize after update_curr()... but we cannot do
>> that in the case cfs_rq->curr == se (which I suppose is what that
>> comment is on about).
>>
>> Does something like the below work?
>>
>
> Thanks for providing this patch. It solved the problem.
>
>> ---
>>  kernel/sched/fair.c | 20 ++--
>>  1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 33130529e9b5..3c114d971d84 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void)
>>  static void
>>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>  {
>> +   bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
>> +   bool curr = cfs_rq->curr == se;
>> +
>> /*
>> -* Update the normalized vruntime before updating min_vruntime
>> -* through calling update_curr().
>> +* If we're the current task, we must renormalise before calling
>> +* update_curr().
>>  */
>> -   if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
>> +   if (renorm && curr)
>> se->vruntime += cfs_rq->min_vruntime;
>>
>> +   update_curr(cfs_rq);
>> +
>> /*
>> -* Update run-time statistics of the 'current'.
>> +* Otherwise, renormalise after, such that we're placed at the 
>> current
>> +* moment in time, instead of some random moment in the past.
>>  */
>> -   update_curr(cfs_rq);
>> +   if (renorm && !curr)
>> +   se->vruntime += cfs_rq->min_vruntime;
>> +
>> enqueue_entity_load_avg(cfs_rq, se);
>> account_entity_enqueue(cfs_rq, se);
>> update_cfs_shares(cfs_rq);
>> @@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
>> sched_entity *se, int flags)
>> update_stats_enqueue(cfs_rq, se);
>> check_spread(cfs_rq, se);
>> }
>> -   if (se != cfs_rq->curr)
>> +   if (!curr)
>> __enqueue_entity(cfs_rq, se);
>> se->on_rq = 1;
>>


Re: Migrated CFS task getting an unfair advantage

2016-03-09 Thread Andrew Hunter
At Google, we essentially reverted 88ec22d and the subsequent tweaks
to it, keeping vruntime absolute always , instead using
task_move_group_fair to change the basis of relative min_vruntime
between cpus.  We found this made it a lot easier to reason about and
work with corss-cpu computations.  I could post the patch if it would
be of interest...

On Wed, Mar 9, 2016 at 5:06 AM, pavankumar kondeti
 wrote:
> Hi Peter,
>
> On Wed, Mar 9, 2016 at 5:34 PM, Peter Zijlstra  wrote:
>> On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote:
>>
>>> When a CFS task is enqueued during migration (load balance or change in
>>> affinity), its vruntime is normalized before updating the current and
>>> cfs_rq->min_vruntime.
>>
>> static void
>> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> {
>> /*
>>  * Update the normalized vruntime before updating min_vruntime
>>  * through calling update_curr().
>>  */
>> if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
>> se->vruntime += cfs_rq->min_vruntime;
>>
>> update_curr(cfs_rq);
>>
>> This, right? Some idiot wrote a comment but forgot to explain why.
>>
>>> If the current entity is a low priority task or belongs to a cgroup
>>> that has lower cpu.shares and it is the only entity queued, there is a
>>> possibility of big update to the cfs_rq->min_vruntime.
>>
>>> As the migrated task is normalized before this update, it gets an
>>> unfair advantage over tasks queued after this point. If the migrated
>>> task is a CPU hogger, the other CFS tasks queued on this CPU gets
>>> starved.
>>
>> Because it takes a whole while for the newly placed task to gain on the
>> previous task, right?
>>
>
> Yes. The newly woken up task vruntime is adjusted wrt the 
> cfs_rq->min_vruntime.
> The cfs_rq->min_vruntime can potentially be hundreds of msec beyond the
> migrated task.
>
>>> If we add the migrated task to destination CPU cfs_rq's rb tree before
>>> updating the current in enqueue_entity(), the cfs_rq->min_vruntime
>>> does not go beyond the newly migrated task. Is this an acceptable
>>> solution?
>>
>> Hurm.. so I'm not sure how that would solve anything. The existing task
>> would still be shot far into the future.
>>
> In my testing, the problem is gone with this approach.
>
> The update_min_vruntime() called from  update_curr() has a check to make
> sure that cfs_rq->min_vruntime  does not go beyond the leftmost entity
> (in this case it would be the migrated task) vruntime. so we don't see
> the migrated task getting any advantage.
>
>> What you want is to normalize after update_curr()... but we cannot do
>> that in the case cfs_rq->curr == se (which I suppose is what that
>> comment is on about).
>>
>> Does something like the below work?
>>
>
> Thanks for providing this patch. It solved the problem.
>
>> ---
>>  kernel/sched/fair.c | 20 ++--
>>  1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 33130529e9b5..3c114d971d84 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void)
>>  static void
>>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>  {
>> +   bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
>> +   bool curr = cfs_rq->curr == se;
>> +
>> /*
>> -* Update the normalized vruntime before updating min_vruntime
>> -* through calling update_curr().
>> +* If we're the current task, we must renormalise before calling
>> +* update_curr().
>>  */
>> -   if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
>> +   if (renorm && curr)
>> se->vruntime += cfs_rq->min_vruntime;
>>
>> +   update_curr(cfs_rq);
>> +
>> /*
>> -* Update run-time statistics of the 'current'.
>> +* Otherwise, renormalise after, such that we're placed at the 
>> current
>> +* moment in time, instead of some random moment in the past.
>>  */
>> -   update_curr(cfs_rq);
>> +   if (renorm && !curr)
>> +   se->vruntime += cfs_rq->min_vruntime;
>> +
>> enqueue_entity_load_avg(cfs_rq, se);
>> account_entity_enqueue(cfs_rq, se);
>> update_cfs_shares(cfs_rq);
>> @@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
>> sched_entity *se, int flags)
>> update_stats_enqueue(cfs_rq, se);
>> check_spread(cfs_rq, se);
>> }
>> -   if (se != cfs_rq->curr)
>> +   if (!curr)
>> __enqueue_entity(cfs_rq, se);
>> se->on_rq = 1;
>>


Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread

2016-01-27 Thread Andrew Hunter
On Wed, Jan 27, 2016 at 9:36 AM, Mathieu Desnoyers
 wrote:
> - On Jan 27, 2016, at 12:24 PM, Thomas Gleixner t...@linutronix.de wrote:
>
>> On Wed, 27 Jan 2016, Josh Triplett wrote:
>>> With the dynamic allocation removed, this seems sensible to me.  One
>>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>>> number should never need to hold a negative number.
>>
>> You try to block the future of computing: https://lwn.net/Articles/638673/
>
> Besides impossible architectures, there is actually a use-case for
> signedness here. It makes it possible to initialize the cpu number
> cache to a negative value, e.g. -1, in userspace. Then, a check for
> value < 0 can be used to figure out cases where the getcpu_cache
> system call is not implemented, and where a fallback (vdso or getcpu
> syscall) needs to be used.
>
> This is why I have chosen a signed type for the cpu cache so far.
>

In our internal version of this patch (part of the RSEQ system
discussed elsewhere) we have a signed CPU id for this reason.  I think
it's a good idea to keep that in userspace and it makes more sense to
match the user and kernel versions of the types.


Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread

2016-01-27 Thread Andrew Hunter
On Wed, Jan 27, 2016 at 9:36 AM, Mathieu Desnoyers
 wrote:
> - On Jan 27, 2016, at 12:24 PM, Thomas Gleixner t...@linutronix.de wrote:
>
>> On Wed, 27 Jan 2016, Josh Triplett wrote:
>>> With the dynamic allocation removed, this seems sensible to me.  One
>>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU
>>> number should never need to hold a negative number.
>>
>> You try to block the future of computing: https://lwn.net/Articles/638673/
>
> Besides impossible architectures, there is actually a use-case for
> signedness here. It makes it possible to initialize the cpu number
> cache to a negative value, e.g. -1, in userspace. Then, a check for
> value < 0 can be used to figure out cases where the getcpu_cache
> system call is not implemented, and where a fallback (vdso or getcpu
> syscall) needs to be used.
>
> This is why I have chosen a signed type for the cpu cache so far.
>

In our internal version of this patch (part of the RSEQ system
discussed elsewhere) we have a signed CPU id for this reason.  I think
it's a good idea to keep that in userspace and it makes more sense to
match the user and kernel versions of the types.


Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Andrew Hunter
On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski  wrote:
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
>
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
>

We had to deploy the userspace percpu API (percpu sharded locks,
{double,}compare-and-swap, atomic-increment, etc) universally to the
fleet without waiting for 100% kernel penetration, not to mention
wanting to disable the kernel acceleration in case of kernel bugs.
(Since this is mostly used in core infrastructure--malloc, various
statistics platforms, etc--in userspace, checking for availability
isn't feasible. The primitives have to work 100% of the time or it
would be too complex for our developers to bother using them.)

So we did basically this (without the lock stealing...): we have a
single per-cpu spin lock manipulated with atomics, which we take very
briefly to implement (e.g.) compare-and-swap.  The performance is
hugely worse; typical overheads are in the 10x range _without_ any
on-cpu contention. Uncontended atomics are much cheaper than they were
on pre-Nehalem chips, but they still can't hold a candle to
unsynchronized instructions.

As a fallback path for userspace, this is fine--if 5% of binaries on
busted kernels aren't quite as fast, we can work with that in exchange
for being able to write a percpu op without worrying about what to do
on -ENOSYS.  But it's just not fast enough to compete as the intended
way to do things.

AHH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Andrew Hunter
On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski l...@amacapital.net wrote:
 Create an array of user-managed locks, one per cpu.  Call them lock[i]
 for 0 = i  ncpus.

 To acquire, look up your CPU number.  Then, atomically, check that
 lock[cpu] isn't held and, if so, mark it held and record both your tid
 and your lock acquisition count.  If you learn that the lock *was*
 held after all, signal the holder (with kill or your favorite other
 mechanism), telling it which lock acquisition count is being aborted.
 Then atomically steal the lock, but only if the lock acquisition count
 hasn't changed.


We had to deploy the userspace percpu API (percpu sharded locks,
{double,}compare-and-swap, atomic-increment, etc) universally to the
fleet without waiting for 100% kernel penetration, not to mention
wanting to disable the kernel acceleration in case of kernel bugs.
(Since this is mostly used in core infrastructure--malloc, various
statistics platforms, etc--in userspace, checking for availability
isn't feasible. The primitives have to work 100% of the time or it
would be too complex for our developers to bother using them.)

So we did basically this (without the lock stealing...): we have a
single per-cpu spin lock manipulated with atomics, which we take very
briefly to implement (e.g.) compare-and-swap.  The performance is
hugely worse; typical overheads are in the 10x range _without_ any
on-cpu contention. Uncontended atomics are much cheaper than they were
on pre-Nehalem chips, but they still can't hold a candle to
unsynchronized instructions.

As a fallback path for userspace, this is fine--if 5% of binaries on
busted kernels aren't quite as fast, we can work with that in exchange
for being able to write a percpu op without worrying about what to do
on -ENOSYS.  But it's just not fast enough to compete as the intended
way to do things.

AHH
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Andrew Hunter
On Thu, Sep 4, 2014 at 2:36 PM, Paul Turner  wrote:
> On Thu, Sep 4, 2014 at 2:30 PM, John Stultz  wrote:
>> This seems to be a quite old bug.. Do you think this is needed for -stable?
>
> Seems reasonable to me.
>

I have no opinion: backport or don't at your preference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Andrew Hunter
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, , NULL);
setitimer(ITIMER_PROF, NULL, );

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.)  Doing
this repeatedly would cause unbounded growth in val.  So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 1 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 1;
  setitimer(ITIMER_PROF, , NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, , );
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
   prev.it_interval.tv_sec, prev.it_interval.tv_usec,
   prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, , NULL);
  }
return 0;
}

Signed-off-by: Andrew Hunter 
Reviewed-by: Paul Turner 
Reported-by: Aaron Jacobs 
Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453
---
 include/linux/jiffies.h | 12 ---
 kernel/time.c   | 54 +++--
 2 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 1f44466..c367cbd 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
 #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
 #endif
 #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
 #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC << SEC_JIFFIE_SC) 
+\
 TICK_NSEC -1) / (u64)TICK_NSEC))
 
 #define NSEC_CONVERSION ((unsigned long)u64)1 << NSEC_JIFFIE_SC) +\
 TICK_NSEC -1) / (u64)TICK_NSEC))
-#define USEC_CONVERSION  \
-((unsigned long)u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
-TICK_NSEC -1) / (u64)TICK_NSEC))
-/*
- * USEC_ROUND is used in the timeval to jiffie conversion.  See there
- * for more details.  It is the scaled resolution rounding value.  Note
- * that it is a 64-bit value.  Since, when it is applied, we are already
- * in jiffies (albit scaled), it is nothing but the bits we will shift
- * off.
- */
-#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
 /*
  * The maximum jiffie value is (MAX_INT >> 1).  Here we translate that
  * into seconds.  The 64-bit case will overflow if we are not careful,
diff --git a/kernel/time.c b/kernel/time.c
index 7c7964c..3c49ab4 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
  * that a remainder subtract here would not do the right thing as the
  * resolution values don't fall on second boundries.  I.e. the line:
  * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ * Note that due to the small error in the multiplier here, this
+ * rounding is incorrect for sufficiently large values of tv_nsec, but
+ * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're
+ * OK.
  *
  * Rather, we just shift

Re: [PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Andrew Hunter
On Wed, Sep 3, 2014 at 5:06 PM, John Stultz  wrote:
> Maybe with the next version of the patch, before you get into the
> unwinding the math, you might practically describe what is broken,
> then explain how its broken.
>

Done.

> My quick read here is that we're converting a timespec -> jiffies, and
> in doing so we round up by one jiffy.
>
> This seems actually perfectly normal, as we usually end up rounding up
> by a jiffy in many cases since we don't want to undershoot any
> timeout, and we're always allowed return later then specified.

Well, yes, timeouts can be longer than specified, but what you said
technically applies just as much to code that arbitrarily multiplies
jiffies by 10 before returning, no? :)

The problem isn't the rounding, it's that the rounding is _wrong_: I'm
fine rounding 10100usec / (1000 usec/jiffie) = 11 jiffies. The current
code rounds 1usec / (1000 usec/jiffies) to 11.  I've rewritten the
description to make this clearer.

>
>> In particular, with HZ=1000, we consistently computed that 1 usec
>> was 11 jiffies; the same was true for any exact multiple of
>> TICK_NSEC. This is obviously bad as a general rule, and caused
>> observable user problems with setitimer() at the very least:
>>
>> setitimer(ITIMER_PROF, , NULL);
>> setitimer(ITIMER_PROF, NULL, );
>>
>> would actually add a tick to val!
>
> So this looks like an issue. Since we convert and store the internal
> representation in jiffies, when we pull it back out, we get the
> rounded up value, which is larger then the timespec value originally
> submitted.  This is really the core issue, correct?

For the particular user bug reported to me, yes, this was the core
issue: some code that stopped and restarted an itimer found the
interval growing by 1ms each time.  But again, it wasn't that it was
rounded:  if we initially passed e.g. 10500 usec and got back 11000,
that'd be annoying but workable, because if we then went through
another cycle of enabling/disabling itimer, we'd set it to 11000 usec
and get back 11000 again.  What we have now instead adds a full jiffie
_every time_.

> Or does this
> manifest in other ways that are problematic (the patch seems to hint
> that it does)?
>

Um: hard to be sure. I can tell you that for PROF itimers we track the
difference between the requested timeval in ns and the given number of
jiffies (see the logic with it->error in check_cpu_itimer() in
kernel/posix-cpu-timers.c), so over long periods we'll get the right
sample rate (the only problem is the timeval reported back to
userspace, and the samples might not quite be uniform.)  It appears
there are no other uses of timeval_to_jiffies, so we're probably safe.
But in any case it's wrong as currently coded so we should fix it.

(Actually, one could use the same code tracking it->error to always
return the exact--unrounded--interval to userspace, but that's minor
compared to a constantly growing interval, and I'd rather not
complicate this patch any more.)

>> +   nsec = nsec + TICK_NSEC - 1;
>
>
> If we're still rounding up one tick here, does the same issue not persist?
>

Nope (at least not in tested configurations.) Since jiffie is
effectively defined as some integral number of nsec, this rounding
DTRT to round up to an integral number of jiffies. (The scaled
arithmetic might, for sufficiently large intervals/right values of
timeval produce some error, but I haven't been able to make it do so.)

> Overall the patch doesn't look too bad and does reduce the amount of
> ugly math with simpler code reuse.  But I'd like the patch description
> to be more clear about why this is needed and the impact.
>

Rewritten; sending v2.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Andrew Hunter
On Wed, Sep 3, 2014 at 5:06 PM, John Stultz john.stu...@linaro.org wrote:
 Maybe with the next version of the patch, before you get into the
 unwinding the math, you might practically describe what is broken,
 then explain how its broken.


Done.

 My quick read here is that we're converting a timespec - jiffies, and
 in doing so we round up by one jiffy.

 This seems actually perfectly normal, as we usually end up rounding up
 by a jiffy in many cases since we don't want to undershoot any
 timeout, and we're always allowed return later then specified.

Well, yes, timeouts can be longer than specified, but what you said
technically applies just as much to code that arbitrarily multiplies
jiffies by 10 before returning, no? :)

The problem isn't the rounding, it's that the rounding is _wrong_: I'm
fine rounding 10100usec / (1000 usec/jiffie) = 11 jiffies. The current
code rounds 1usec / (1000 usec/jiffies) to 11.  I've rewritten the
description to make this clearer.


 In particular, with HZ=1000, we consistently computed that 1 usec
 was 11 jiffies; the same was true for any exact multiple of
 TICK_NSEC. This is obviously bad as a general rule, and caused
 observable user problems with setitimer() at the very least:

 setitimer(ITIMER_PROF, val, NULL);
 setitimer(ITIMER_PROF, NULL, val);

 would actually add a tick to val!

 So this looks like an issue. Since we convert and store the internal
 representation in jiffies, when we pull it back out, we get the
 rounded up value, which is larger then the timespec value originally
 submitted.  This is really the core issue, correct?

For the particular user bug reported to me, yes, this was the core
issue: some code that stopped and restarted an itimer found the
interval growing by 1ms each time.  But again, it wasn't that it was
rounded:  if we initially passed e.g. 10500 usec and got back 11000,
that'd be annoying but workable, because if we then went through
another cycle of enabling/disabling itimer, we'd set it to 11000 usec
and get back 11000 again.  What we have now instead adds a full jiffie
_every time_.

 Or does this
 manifest in other ways that are problematic (the patch seems to hint
 that it does)?


Um: hard to be sure. I can tell you that for PROF itimers we track the
difference between the requested timeval in ns and the given number of
jiffies (see the logic with it-error in check_cpu_itimer() in
kernel/posix-cpu-timers.c), so over long periods we'll get the right
sample rate (the only problem is the timeval reported back to
userspace, and the samples might not quite be uniform.)  It appears
there are no other uses of timeval_to_jiffies, so we're probably safe.
But in any case it's wrong as currently coded so we should fix it.

(Actually, one could use the same code tracking it-error to always
return the exact--unrounded--interval to userspace, but that's minor
compared to a constantly growing interval, and I'd rather not
complicate this patch any more.)

 +   nsec = nsec + TICK_NSEC - 1;


 If we're still rounding up one tick here, does the same issue not persist?


Nope (at least not in tested configurations.) Since jiffie is
effectively defined as some integral number of nsec, this rounding
DTRT to round up to an integral number of jiffies. (The scaled
arithmetic might, for sufficiently large intervals/right values of
timeval produce some error, but I haven't been able to make it do so.)

 Overall the patch doesn't look too bad and does reduce the amount of
 ugly math with simpler code reuse.  But I'd like the patch description
 to be more clear about why this is needed and the impact.


Rewritten; sending v2.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Andrew Hunter
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, val, NULL);
setitimer(ITIMER_PROF, NULL, val);

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.)  Doing
this repeatedly would cause unbounded growth in val.  So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x)  USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 1 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec-nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 1;
  setitimer(ITIMER_PROF, initial, NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i  10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, zero, prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf(previous value: %ld %ld %ld %ld\n,
   prev.it_interval.tv_sec, prev.it_interval.tv_usec,
   prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, prev, NULL);
  }
return 0;
}

Signed-off-by: Andrew Hunter a...@google.com
Reviewed-by: Paul Turner p...@google.com
Reported-by: Aaron Jacobs jaco...@google.com
Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453
---
 include/linux/jiffies.h | 12 ---
 kernel/time.c   | 54 +++--
 2 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 1f44466..c367cbd 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
 #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
 #endif
 #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
 #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC  SEC_JIFFIE_SC) 
+\
 TICK_NSEC -1) / (u64)TICK_NSEC))
 
 #define NSEC_CONVERSION ((unsigned long)u64)1  NSEC_JIFFIE_SC) +\
 TICK_NSEC -1) / (u64)TICK_NSEC))
-#define USEC_CONVERSION  \
-((unsigned long)u64)NSEC_PER_USEC  USEC_JIFFIE_SC) +\
-TICK_NSEC -1) / (u64)TICK_NSEC))
-/*
- * USEC_ROUND is used in the timeval to jiffie conversion.  See there
- * for more details.  It is the scaled resolution rounding value.  Note
- * that it is a 64-bit value.  Since, when it is applied, we are already
- * in jiffies (albit scaled), it is nothing but the bits we will shift
- * off.
- */
-#define USEC_ROUND (u64)(((u64)1  USEC_JIFFIE_SC) - 1)
 /*
  * The maximum jiffie value is (MAX_INT  1).  Here we translate that
  * into seconds.  The 64-bit case will overflow if we are not careful,
diff --git a/kernel/time.c b/kernel/time.c
index 7c7964c..3c49ab4 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
  * that a remainder subtract here would not do the right thing as the
  * resolution values don't fall on second boundries.  I.e. the line:
  * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ * Note that due to the small error in the multiplier here, this
+ * rounding is incorrect for sufficiently large values of tv_nsec, but
+ * well formed timespecs should have tv_nsec  NSEC_PER_SEC, so we're
+ * OK.
  *
  * Rather, we just shift the bits

Re: [PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Andrew Hunter
On Thu, Sep 4, 2014 at 2:36 PM, Paul Turner p...@google.com wrote:
 On Thu, Sep 4, 2014 at 2:30 PM, John Stultz john.stu...@linaro.org wrote:
 This seems to be a quite old bug.. Do you think this is needed for -stable?

 Seems reasonable to me.


I have no opinion: backport or don't at your preference.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: fix timeval conversion to jiffies

2014-08-07 Thread Andrew Hunter
timeval_to_jiffies rounding was broken.  It essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and it rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value: rounding the division up by adding 2^USEC_JIFFIE_SC - 1, when
the error in scaling was added in, was sufficient to add one jiffie to
the final result.

In particular, with HZ=1000, we consistently computed that 1 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC. This is obviously bad as a general rule, and caused
observable user problems with setitimer() at the very least:

setitimer(ITIMER_PROF, , NULL);
setitimer(ITIMER_PROF, NULL, );

would actually add a tick to val!

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 1;
  setitimer(ITIMER_PROF, , NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, , );
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
   prev.it_interval.tv_sec, prev.it_interval.tv_usec,
   prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, , NULL);
  }
return 0;
}

Signed-off-by: Andrew Hunter 
Reviewed-by: Paul Turner 
Reported-by: Aaron Jacobs 
Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453
---
 include/linux/jiffies.h | 12 ---
 kernel/time.c   | 54 +++--
 2 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 1f44466..c367cbd 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
 #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
 #endif
 #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
 #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC << SEC_JIFFIE_SC) 
+\
 TICK_NSEC -1) / (u64)TICK_NSEC))
 
 #define NSEC_CONVERSION ((unsigned long)u64)1 << NSEC_JIFFIE_SC) +\
 TICK_NSEC -1) / (u64)TICK_NSEC))
-#define USEC_CONVERSION  \
-((unsigned long)u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
-TICK_NSEC -1) / (u64)TICK_NSEC))
-/*
- * USEC_ROUND is used in the timeval to jiffie conversion.  See there
- * for more details.  It is the scaled resolution rounding value.  Note
- * that it is a 64-bit value.  Since, when it is applied, we are already
- * in jiffies (albit scaled), it is nothing but the bits we will shift
- * off.
- */
-#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
 /*
  * The maximum jiffie value is (MAX_INT >> 1).  Here we translate that
  * into seconds.  The 64-bit case will overflow if we are not careful,
diff --git a/kernel/time.c b/kernel/time.c
index 7c7964c..3c49ab4 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
  * that a remainder subtract here would not do the right thing as the
  * resolution values don't fall on second boundries.  I.e. the line:
  * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ * Note that due to the small error in the multiplier here, this
+ * rounding is incorrect for sufficiently large values of tv_nsec, but
+ * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're
+ * OK.
  *
  * Rather, we just shift the bits off the right.
  *
  * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
  * value to a scaled second value.
  */
-unsigned long
-timespec_to_jiffies(const struct timespec *value)
+static unsigned long
+__timespec_to_jiffies(unsigned long sec, long nsec)
 {
-   unsigned long sec = value->tv_sec;
-   long nsec = value->tv_nsec + TICK_NSEC - 1;
+   nsec = nsec + TICK_NSEC - 1;
 
if (sec >= MAX_SEC_IN_JIFFIES){
sec = MAX_SEC_IN_JIFFIES;
@@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespec *value)
 (NSEC_JIFFIE_SC - SEC_JIF

[PATCH] sched: fix timeval conversion to jiffies

2014-08-07 Thread Andrew Hunter
timeval_to_jiffies rounding was broken.  It essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x)  USEC_JIFFIE_SC

and it rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value: rounding the division up by adding 2^USEC_JIFFIE_SC - 1, when
the error in scaling was added in, was sufficient to add one jiffie to
the final result.

In particular, with HZ=1000, we consistently computed that 1 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC. This is obviously bad as a general rule, and caused
observable user problems with setitimer() at the very least:

setitimer(ITIMER_PROF, val, NULL);
setitimer(ITIMER_PROF, NULL, val);

would actually add a tick to val!

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec-nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 1;
  setitimer(ITIMER_PROF, initial, NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i  10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, zero, prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf(previous value: %ld %ld %ld %ld\n,
   prev.it_interval.tv_sec, prev.it_interval.tv_usec,
   prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, prev, NULL);
  }
return 0;
}

Signed-off-by: Andrew Hunter a...@google.com
Reviewed-by: Paul Turner p...@google.com
Reported-by: Aaron Jacobs jaco...@google.com
Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453
---
 include/linux/jiffies.h | 12 ---
 kernel/time.c   | 54 +++--
 2 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 1f44466..c367cbd 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
 #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
 #endif
 #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
 #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC  SEC_JIFFIE_SC) 
+\
 TICK_NSEC -1) / (u64)TICK_NSEC))
 
 #define NSEC_CONVERSION ((unsigned long)u64)1  NSEC_JIFFIE_SC) +\
 TICK_NSEC -1) / (u64)TICK_NSEC))
-#define USEC_CONVERSION  \
-((unsigned long)u64)NSEC_PER_USEC  USEC_JIFFIE_SC) +\
-TICK_NSEC -1) / (u64)TICK_NSEC))
-/*
- * USEC_ROUND is used in the timeval to jiffie conversion.  See there
- * for more details.  It is the scaled resolution rounding value.  Note
- * that it is a 64-bit value.  Since, when it is applied, we are already
- * in jiffies (albit scaled), it is nothing but the bits we will shift
- * off.
- */
-#define USEC_ROUND (u64)(((u64)1  USEC_JIFFIE_SC) - 1)
 /*
  * The maximum jiffie value is (MAX_INT  1).  Here we translate that
  * into seconds.  The 64-bit case will overflow if we are not careful,
diff --git a/kernel/time.c b/kernel/time.c
index 7c7964c..3c49ab4 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
  * that a remainder subtract here would not do the right thing as the
  * resolution values don't fall on second boundries.  I.e. the line:
  * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ * Note that due to the small error in the multiplier here, this
+ * rounding is incorrect for sufficiently large values of tv_nsec, but
+ * well formed timespecs should have tv_nsec  NSEC_PER_SEC, so we're
+ * OK.
  *
  * Rather, we just shift the bits off the right.
  *
  * The  (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
  * value to a scaled second value.
  */
-unsigned long
-timespec_to_jiffies(const struct timespec *value)
+static unsigned long
+__timespec_to_jiffies(unsigned long sec, long nsec)
 {
-   unsigned long sec = value-tv_sec;
-   long nsec = value-tv_nsec + TICK_NSEC - 1;
+   nsec = nsec + TICK_NSEC - 1;
 
if (sec = MAX_SEC_IN_JIFFIES){
sec = MAX_SEC_IN_JIFFIES;
@@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespec *value)
 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC

[RFC] [PATCH] x86: avoid per_cpu for APIC id tables

2013-07-17 Thread Andrew Hunter
Hi, I have a patch (following) that modifies handling of APIC id tables,
trading a small amount of space in the (NR_CPUS - nr_cpu_ids) >> 0 case for
faster accesses and slightly better cache layout (as APIC ids are mostly used
cross-cpu.)  I'm not an APIC expert so I'd appreciate some eyes on this, but
it shouldn't change any behavior whatsoever.  Thoughts? (We're likely to merge
this internally even if upstream judges the space loss too much of a cost, so
I'd like to know if there's some other problem I've missed that this causes.)

I've tested this cursorily in most of our internal configurations but not in
any particularly exotic hardware/config.


>From e6bf354c05d98651e8c27f96582f0ab56992e58a Mon Sep 17 00:00:00 2001
From: Andrew Hunter 
Date: Tue, 16 Jul 2013 16:50:36 -0700
Subject: [PATCH] x86: avoid per_cpu for APIC id tables

DEFINE_PER_CPU(var) and friends go to lengths to arrange all of cpu
i's per cpu variables as contiguous with each other; this requires a
double indirection to reference a variable.

For data that is logically per-cpu but

a) rarely modified
b) commonly accessed from other CPUs

this is bad: no writes means we don't have to worry about cache ping
pong, and cross-CPU access means there's no cache savings from not
pulling in remote entries.  (Actually, it's worse than "no" cache
savings: instead of one cache line containing 32 useful APIC ids, it
will contain 3 useful APIC ids and much other percpu data from the
remote CPU we don't want.)  It's also slower to access, due to the
indirection.

So instead use a flat array for APIC ids, most commonly used for IPIs
and the like.  This makes a measurable improvement (up to 10%) in some
benchmarks that heavily stress remote wakeups.

The one disadvantage is that we waste 8 bytes per unused CPU (NR_CPUS
- actual). But this is a fairly small amount of memory for reasonable
values of NR_CPUS.

Tested: builds and boots, runs a suite of wakeup-intensive test without failure.

---
 arch/x86/include/asm/apic.h   |  5 ++---
 arch/x86/include/asm/smp.h|  8 +++
 arch/x86/kernel/acpi/boot.c   |  4 ++--
 arch/x86/kernel/apic/apic.c   | 42 ---
 arch/x86/kernel/apic/apic_numachip.c  |  2 +-
 arch/x86/kernel/apic/bigsmp_32.c  | 11 +
 arch/x86/kernel/apic/es7000_32.c  | 12 +-
 arch/x86/kernel/apic/ipi.c| 14 ++--
 arch/x86/kernel/apic/numaq_32.c   |  2 +-
 arch/x86/kernel/apic/summit_32.c  | 12 +-
 arch/x86/kernel/apic/x2apic_cluster.c | 12 +-
 arch/x86/kernel/apic/x2apic_phys.c|  2 +-
 arch/x86/kernel/apic/x2apic_uv_x.c|  4 ++--
 arch/x86/kernel/setup_percpu.c| 17 --
 arch/x86/kernel/smpboot.c |  2 +-
 arch/x86/mm/numa.c|  5 +
 arch/x86/platform/uv/tlb_uv.c |  2 +-
 17 files changed, 60 insertions(+), 96 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index f8119b5..9a80f49 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -547,8 +547,7 @@ static inline const struct cpumask *online_target_cpus(void)
return cpu_online_mask;
 }
 
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
-
+extern u16 x86_bios_cpu_apicid[NR_CPUS];
 
 static inline unsigned int read_apic_id(void)
 {
@@ -660,7 +659,7 @@ static inline void default_ioapic_phys_id_map(physid_mask_t 
*phys_map, physid_ma
 static inline int __default_cpu_present_to_apicid(int mps_cpu)
 {
if (mps_cpu < nr_cpu_ids && cpu_present(mps_cpu))
-   return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu);
+   return (int)x86_bios_cpu_apicid[mps_cpu];
else
return BAD_APICID;
 }
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index b073aae..25deeac0 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -53,10 +53,10 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
 }
 
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
+extern u16 x86_cpu_to_apicid[NR_CPUS];
+extern u16 x86_bios_cpu_apicid[NR_CPUS];
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_32)
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid);
+extern u16 x86_cpu_to_logical_apicid[NR_CPUS];
 #endif
 
 /* Static state in head.S used to set up a CPU */
@@ -168,7 +168,7 @@ void x86_idle_thread_init(unsigned int cpu, struct 
task_struct *idle);
 
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
-#define cpu_physical_id(cpu)   per_cpu(x86_cpu_to_apicid, cpu)
+#define cpu_physical_id(cpu)   x86_cpu_to_apicid[cpu]
 
 #else /* !CONFIG_SMP */
 #define wbinvd_on_cpu(cpu) wbinvd()
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index d81a972

[RFC] [PATCH] x86: avoid per_cpu for APIC id tables

2013-07-17 Thread Andrew Hunter
Hi, I have a patch (following) that modifies handling of APIC id tables,
trading a small amount of space in the (NR_CPUS - nr_cpu_ids)  0 case for
faster accesses and slightly better cache layout (as APIC ids are mostly used
cross-cpu.)  I'm not an APIC expert so I'd appreciate some eyes on this, but
it shouldn't change any behavior whatsoever.  Thoughts? (We're likely to merge
this internally even if upstream judges the space loss too much of a cost, so
I'd like to know if there's some other problem I've missed that this causes.)

I've tested this cursorily in most of our internal configurations but not in
any particularly exotic hardware/config.


From e6bf354c05d98651e8c27f96582f0ab56992e58a Mon Sep 17 00:00:00 2001
From: Andrew Hunter a...@google.com
Date: Tue, 16 Jul 2013 16:50:36 -0700
Subject: [PATCH] x86: avoid per_cpu for APIC id tables

DEFINE_PER_CPU(var) and friends go to lengths to arrange all of cpu
i's per cpu variables as contiguous with each other; this requires a
double indirection to reference a variable.

For data that is logically per-cpu but

a) rarely modified
b) commonly accessed from other CPUs

this is bad: no writes means we don't have to worry about cache ping
pong, and cross-CPU access means there's no cache savings from not
pulling in remote entries.  (Actually, it's worse than no cache
savings: instead of one cache line containing 32 useful APIC ids, it
will contain 3 useful APIC ids and much other percpu data from the
remote CPU we don't want.)  It's also slower to access, due to the
indirection.

So instead use a flat array for APIC ids, most commonly used for IPIs
and the like.  This makes a measurable improvement (up to 10%) in some
benchmarks that heavily stress remote wakeups.

The one disadvantage is that we waste 8 bytes per unused CPU (NR_CPUS
- actual). But this is a fairly small amount of memory for reasonable
values of NR_CPUS.

Tested: builds and boots, runs a suite of wakeup-intensive test without failure.

---
 arch/x86/include/asm/apic.h   |  5 ++---
 arch/x86/include/asm/smp.h|  8 +++
 arch/x86/kernel/acpi/boot.c   |  4 ++--
 arch/x86/kernel/apic/apic.c   | 42 ---
 arch/x86/kernel/apic/apic_numachip.c  |  2 +-
 arch/x86/kernel/apic/bigsmp_32.c  | 11 +
 arch/x86/kernel/apic/es7000_32.c  | 12 +-
 arch/x86/kernel/apic/ipi.c| 14 ++--
 arch/x86/kernel/apic/numaq_32.c   |  2 +-
 arch/x86/kernel/apic/summit_32.c  | 12 +-
 arch/x86/kernel/apic/x2apic_cluster.c | 12 +-
 arch/x86/kernel/apic/x2apic_phys.c|  2 +-
 arch/x86/kernel/apic/x2apic_uv_x.c|  4 ++--
 arch/x86/kernel/setup_percpu.c| 17 --
 arch/x86/kernel/smpboot.c |  2 +-
 arch/x86/mm/numa.c|  5 +
 arch/x86/platform/uv/tlb_uv.c |  2 +-
 17 files changed, 60 insertions(+), 96 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index f8119b5..9a80f49 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -547,8 +547,7 @@ static inline const struct cpumask *online_target_cpus(void)
return cpu_online_mask;
 }
 
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
-
+extern u16 x86_bios_cpu_apicid[NR_CPUS];
 
 static inline unsigned int read_apic_id(void)
 {
@@ -660,7 +659,7 @@ static inline void default_ioapic_phys_id_map(physid_mask_t 
*phys_map, physid_ma
 static inline int __default_cpu_present_to_apicid(int mps_cpu)
 {
if (mps_cpu  nr_cpu_ids  cpu_present(mps_cpu))
-   return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu);
+   return (int)x86_bios_cpu_apicid[mps_cpu];
else
return BAD_APICID;
 }
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index b073aae..25deeac0 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -53,10 +53,10 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
 }
 
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
+extern u16 x86_cpu_to_apicid[NR_CPUS];
+extern u16 x86_bios_cpu_apicid[NR_CPUS];
 #if defined(CONFIG_X86_LOCAL_APIC)  defined(CONFIG_X86_32)
-DECLARE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid);
+extern u16 x86_cpu_to_logical_apicid[NR_CPUS];
 #endif
 
 /* Static state in head.S used to set up a CPU */
@@ -168,7 +168,7 @@ void x86_idle_thread_init(unsigned int cpu, struct 
task_struct *idle);
 
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
-#define cpu_physical_id(cpu)   per_cpu(x86_cpu_to_apicid, cpu)
+#define cpu_physical_id(cpu)   x86_cpu_to_apicid[cpu]
 
 #else /* !CONFIG_SMP */
 #define wbinvd_on_cpu(cpu) wbinvd()
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index d81a972..5bae841 100644
--- a/arch/x86

[tip:perf/core] perf/x86: Reduce stack usage of x86_schedule_events()

2013-06-19 Thread tip-bot for Andrew Hunter
Commit-ID:  43b4578071c0e6d87761e113e05d45776cc75437
Gitweb: http://git.kernel.org/tip/43b4578071c0e6d87761e113e05d45776cc75437
Author: Andrew Hunter 
AuthorDate: Thu, 23 May 2013 11:07:03 -0700
Committer:  Ingo Molnar 
CommitDate: Wed, 19 Jun 2013 12:50:44 +0200

perf/x86: Reduce stack usage of x86_schedule_events()

x86_schedule_events() caches event constraints on the stack during
scheduling.  Given the number of possible events, this is 512 bytes of
stack; since it can be invoked under schedule() under god-knows-what,
this is causing stack blowouts.

Trade some space usage for stack safety: add a place to cache the
constraint pointer to struct perf_event.  For 8 bytes per event (1% of
its size) we can save the giant stack frame.

This shouldn't change any aspect of scheduling whatsoever and while in
theory the locality's a tiny bit worse, I doubt we'll see any
performance impact either.

Tested: `perf stat whatever` does not blow up and produces
results that aren't hugely obviously wrong.  I'm not sure how to run
particularly good tests of perf code, but this should not produce any
functional change whatsoever.

Signed-off-by: Andrew Hunter 
Reviewed-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1369332423-4400-1-git-send-email-...@google.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event.c  | 28 ++-
 arch/x86/kernel/cpu/perf_event.h  |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++
 include/linux/perf_event.h|  4 
 4 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1025f3c..e52a9e5 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -568,7 +568,7 @@ struct sched_state {
 struct perf_sched {
int max_weight;
int max_events;
-   struct event_constraint **constraints;
+   struct perf_event   **events;
struct sched_state  state;
int saved_states;
struct sched_state  saved[SCHED_STATES_MAX];
@@ -577,7 +577,7 @@ struct perf_sched {
 /*
  * Initialize interator that runs through all events and counters.
  */
-static void perf_sched_init(struct perf_sched *sched, struct event_constraint 
**c,
+static void perf_sched_init(struct perf_sched *sched, struct perf_event 
**events,
int num, int wmin, int wmax)
 {
int idx;
@@ -585,10 +585,10 @@ static void perf_sched_init(struct perf_sched *sched, 
struct event_constraint **
memset(sched, 0, sizeof(*sched));
sched->max_events   = num;
sched->max_weight   = wmax;
-   sched->constraints  = c;
+   sched->events   = events;
 
for (idx = 0; idx < num; idx++) {
-   if (c[idx]->weight == wmin)
+   if (events[idx]->hw.constraint->weight == wmin)
break;
}
 
@@ -635,8 +635,7 @@ static bool __perf_sched_find_counter(struct perf_sched 
*sched)
if (sched->state.event >= sched->max_events)
return false;
 
-   c = sched->constraints[sched->state.event];
-
+   c = sched->events[sched->state.event]->hw.constraint;
/* Prefer fixed purpose counters */
if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) {
idx = INTEL_PMC_IDX_FIXED;
@@ -694,7 +693,7 @@ static bool perf_sched_next_event(struct perf_sched *sched)
if (sched->state.weight > sched->max_weight)
return false;
}
-   c = sched->constraints[sched->state.event];
+   c = sched->events[sched->state.event]->hw.constraint;
} while (c->weight != sched->state.weight);
 
sched->state.counter = 0;   /* start with first counter */
@@ -705,12 +704,12 @@ static bool perf_sched_next_event(struct perf_sched 
*sched)
 /*
  * Assign a counter for each event.
  */
-int perf_assign_events(struct event_constraint **constraints, int n,
+int perf_assign_events(struct perf_event **events, int n,
int wmin, int wmax, int *assign)
 {
struct perf_sched sched;
 
-   perf_sched_init(, constraints, n, wmin, wmax);
+   perf_sched_init(, events, n, wmin, wmax);
 
do {
if (!perf_sched_find_counter())
@@ -724,7 +723,7 @@ int perf_assign_events(struct event_constraint 
**constraints, int n,
 
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-   struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
+   struct event_constraint *c;
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
int i, wmin, wmax, num = 0;
struct hw_perf_ev

[tip:perf/core] perf/x86: Reduce stack usage of x86_schedule_events()

2013-06-19 Thread tip-bot for Andrew Hunter
Commit-ID:  43b4578071c0e6d87761e113e05d45776cc75437
Gitweb: http://git.kernel.org/tip/43b4578071c0e6d87761e113e05d45776cc75437
Author: Andrew Hunter a...@google.com
AuthorDate: Thu, 23 May 2013 11:07:03 -0700
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 19 Jun 2013 12:50:44 +0200

perf/x86: Reduce stack usage of x86_schedule_events()

x86_schedule_events() caches event constraints on the stack during
scheduling.  Given the number of possible events, this is 512 bytes of
stack; since it can be invoked under schedule() under god-knows-what,
this is causing stack blowouts.

Trade some space usage for stack safety: add a place to cache the
constraint pointer to struct perf_event.  For 8 bytes per event (1% of
its size) we can save the giant stack frame.

This shouldn't change any aspect of scheduling whatsoever and while in
theory the locality's a tiny bit worse, I doubt we'll see any
performance impact either.

Tested: `perf stat whatever` does not blow up and produces
results that aren't hugely obviously wrong.  I'm not sure how to run
particularly good tests of perf code, but this should not produce any
functional change whatsoever.

Signed-off-by: Andrew Hunter a...@google.com
Reviewed-by: Stephane Eranian eran...@google.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
Link: http://lkml.kernel.org/r/1369332423-4400-1-git-send-email-...@google.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/cpu/perf_event.c  | 28 ++-
 arch/x86/kernel/cpu/perf_event.h  |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++
 include/linux/perf_event.h|  4 
 4 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1025f3c..e52a9e5 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -568,7 +568,7 @@ struct sched_state {
 struct perf_sched {
int max_weight;
int max_events;
-   struct event_constraint **constraints;
+   struct perf_event   **events;
struct sched_state  state;
int saved_states;
struct sched_state  saved[SCHED_STATES_MAX];
@@ -577,7 +577,7 @@ struct perf_sched {
 /*
  * Initialize interator that runs through all events and counters.
  */
-static void perf_sched_init(struct perf_sched *sched, struct event_constraint 
**c,
+static void perf_sched_init(struct perf_sched *sched, struct perf_event 
**events,
int num, int wmin, int wmax)
 {
int idx;
@@ -585,10 +585,10 @@ static void perf_sched_init(struct perf_sched *sched, 
struct event_constraint **
memset(sched, 0, sizeof(*sched));
sched-max_events   = num;
sched-max_weight   = wmax;
-   sched-constraints  = c;
+   sched-events   = events;
 
for (idx = 0; idx  num; idx++) {
-   if (c[idx]-weight == wmin)
+   if (events[idx]-hw.constraint-weight == wmin)
break;
}
 
@@ -635,8 +635,7 @@ static bool __perf_sched_find_counter(struct perf_sched 
*sched)
if (sched-state.event = sched-max_events)
return false;
 
-   c = sched-constraints[sched-state.event];
-
+   c = sched-events[sched-state.event]-hw.constraint;
/* Prefer fixed purpose counters */
if (c-idxmsk64  (~0ULL  INTEL_PMC_IDX_FIXED)) {
idx = INTEL_PMC_IDX_FIXED;
@@ -694,7 +693,7 @@ static bool perf_sched_next_event(struct perf_sched *sched)
if (sched-state.weight  sched-max_weight)
return false;
}
-   c = sched-constraints[sched-state.event];
+   c = sched-events[sched-state.event]-hw.constraint;
} while (c-weight != sched-state.weight);
 
sched-state.counter = 0;   /* start with first counter */
@@ -705,12 +704,12 @@ static bool perf_sched_next_event(struct perf_sched 
*sched)
 /*
  * Assign a counter for each event.
  */
-int perf_assign_events(struct event_constraint **constraints, int n,
+int perf_assign_events(struct perf_event **events, int n,
int wmin, int wmax, int *assign)
 {
struct perf_sched sched;
 
-   perf_sched_init(sched, constraints, n, wmin, wmax);
+   perf_sched_init(sched, events, n, wmin, wmax);
 
do {
if (!perf_sched_find_counter(sched))
@@ -724,7 +723,7 @@ int perf_assign_events(struct event_constraint 
**constraints, int n,
 
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-   struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
+   struct event_constraint *c;
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
int i, wmin, wmax, num = 0;
struct hw_perf_event *hwc

[PATCH] perf: reduce stack usage of schedule_events

2013-05-23 Thread Andrew Hunter
schedule_events caches event constraints on the stack during
scheduling.  Given the number of possible events, this is 512 bytes of
stack; since it can be invoked under schedule() under god-knows-what,
this is causing stack blowouts.

Trade some space usage for stack safety: add a place to cache the
constraint pointer to struct perf_event.  For 8 bytes per event (1% of
its size) we can save the giant stack frame.

This shouldn't change any aspect of scheduling whatsoever and while in
theory the locality's a tiny bit worse, I doubt we'll see any
performance impact either.

Tested: `perf stat whatever` does not blow up and produces
results that aren't hugely obviously wrong.  I'm not sure how to run
particularly good tests of perf code, but this should not produce any
functional change whatsoever.

Signed-off-by: Andrew Hunter 
Reviewed-by: Stephane Eranian 
---
 arch/x86/kernel/cpu/perf_event.c  | 28 ++-
 arch/x86/kernel/cpu/perf_event.h  |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++
 include/linux/perf_event.h|  4 
 4 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index bf0f01a..e4bfc2b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -562,7 +562,7 @@ struct sched_state {
 struct perf_sched {
int max_weight;
int max_events;
-   struct event_constraint **constraints;
+   struct perf_event   **events;
struct sched_state  state;
int saved_states;
struct sched_state  saved[SCHED_STATES_MAX];
@@ -571,7 +571,7 @@ struct perf_sched {
 /*
  * Initialize interator that runs through all events and counters.
  */
-static void perf_sched_init(struct perf_sched *sched, struct event_constraint 
**c,
+static void perf_sched_init(struct perf_sched *sched, struct perf_event 
**events,
int num, int wmin, int wmax)
 {
int idx;
@@ -579,10 +579,10 @@ static void perf_sched_init(struct perf_sched *sched, 
struct event_constraint **
memset(sched, 0, sizeof(*sched));
sched->max_events   = num;
sched->max_weight   = wmax;
-   sched->constraints  = c;
+   sched->events   = events;
 
for (idx = 0; idx < num; idx++) {
-   if (c[idx]->weight == wmin)
+   if (events[idx]->hw.constraint->weight == wmin)
break;
}
 
@@ -629,8 +629,7 @@ static bool __perf_sched_find_counter(struct perf_sched 
*sched)
if (sched->state.event >= sched->max_events)
return false;
 
-   c = sched->constraints[sched->state.event];
-
+   c = sched->events[sched->state.event]->hw.constraint;
/* Prefer fixed purpose counters */
if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) {
idx = INTEL_PMC_IDX_FIXED;
@@ -688,7 +687,7 @@ static bool perf_sched_next_event(struct perf_sched *sched)
if (sched->state.weight > sched->max_weight)
return false;
}
-   c = sched->constraints[sched->state.event];
+   c = sched->events[sched->state.event]->hw.constraint;
} while (c->weight != sched->state.weight);
 
sched->state.counter = 0;   /* start with first counter */
@@ -699,12 +698,12 @@ static bool perf_sched_next_event(struct perf_sched 
*sched)
 /*
  * Assign a counter for each event.
  */
-int perf_assign_events(struct event_constraint **constraints, int n,
+int perf_assign_events(struct perf_event **events, int n,
int wmin, int wmax, int *assign)
 {
struct perf_sched sched;
 
-   perf_sched_init(, constraints, n, wmin, wmax);
+   perf_sched_init(, events, n, wmin, wmax);
 
do {
if (!perf_sched_find_counter())
@@ -718,7 +717,7 @@ int perf_assign_events(struct event_constraint 
**constraints, int n,
 
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-   struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
+   struct event_constraint *c;
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
int i, wmin, wmax, num = 0;
struct hw_perf_event *hwc;
@@ -726,8 +725,10 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, 
int *assign)
bitmap_zero(used_mask, X86_PMC_IDX_MAX);
 
for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
+   hwc = >event_list[i]->hw;
c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
-   constraints[i] = c;
+   hwc->constraint = c;
+
wmin = min(wmin, c

[PATCH] perf: reduce stack usage of schedule_events

2013-05-23 Thread Andrew Hunter
schedule_events caches event constraints on the stack during
scheduling.  Given the number of possible events, this is 512 bytes of
stack; since it can be invoked under schedule() under god-knows-what,
this is causing stack blowouts.

Trade some space usage for stack safety: add a place to cache the
constraint pointer to struct perf_event.  For 8 bytes per event (1% of
its size) we can save the giant stack frame.

This shouldn't change any aspect of scheduling whatsoever and while in
theory the locality's a tiny bit worse, I doubt we'll see any
performance impact either.

Tested: `perf stat whatever` does not blow up and produces
results that aren't hugely obviously wrong.  I'm not sure how to run
particularly good tests of perf code, but this should not produce any
functional change whatsoever.

Signed-off-by: Andrew Hunter a...@google.com
Reviewed-by: Stephane Eranian eran...@google.com
---
 arch/x86/kernel/cpu/perf_event.c  | 28 ++-
 arch/x86/kernel/cpu/perf_event.h  |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++
 include/linux/perf_event.h|  4 
 4 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index bf0f01a..e4bfc2b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -562,7 +562,7 @@ struct sched_state {
 struct perf_sched {
int max_weight;
int max_events;
-   struct event_constraint **constraints;
+   struct perf_event   **events;
struct sched_state  state;
int saved_states;
struct sched_state  saved[SCHED_STATES_MAX];
@@ -571,7 +571,7 @@ struct perf_sched {
 /*
  * Initialize interator that runs through all events and counters.
  */
-static void perf_sched_init(struct perf_sched *sched, struct event_constraint 
**c,
+static void perf_sched_init(struct perf_sched *sched, struct perf_event 
**events,
int num, int wmin, int wmax)
 {
int idx;
@@ -579,10 +579,10 @@ static void perf_sched_init(struct perf_sched *sched, 
struct event_constraint **
memset(sched, 0, sizeof(*sched));
sched-max_events   = num;
sched-max_weight   = wmax;
-   sched-constraints  = c;
+   sched-events   = events;
 
for (idx = 0; idx  num; idx++) {
-   if (c[idx]-weight == wmin)
+   if (events[idx]-hw.constraint-weight == wmin)
break;
}
 
@@ -629,8 +629,7 @@ static bool __perf_sched_find_counter(struct perf_sched 
*sched)
if (sched-state.event = sched-max_events)
return false;
 
-   c = sched-constraints[sched-state.event];
-
+   c = sched-events[sched-state.event]-hw.constraint;
/* Prefer fixed purpose counters */
if (c-idxmsk64  (~0ULL  INTEL_PMC_IDX_FIXED)) {
idx = INTEL_PMC_IDX_FIXED;
@@ -688,7 +687,7 @@ static bool perf_sched_next_event(struct perf_sched *sched)
if (sched-state.weight  sched-max_weight)
return false;
}
-   c = sched-constraints[sched-state.event];
+   c = sched-events[sched-state.event]-hw.constraint;
} while (c-weight != sched-state.weight);
 
sched-state.counter = 0;   /* start with first counter */
@@ -699,12 +698,12 @@ static bool perf_sched_next_event(struct perf_sched 
*sched)
 /*
  * Assign a counter for each event.
  */
-int perf_assign_events(struct event_constraint **constraints, int n,
+int perf_assign_events(struct perf_event **events, int n,
int wmin, int wmax, int *assign)
 {
struct perf_sched sched;
 
-   perf_sched_init(sched, constraints, n, wmin, wmax);
+   perf_sched_init(sched, events, n, wmin, wmax);
 
do {
if (!perf_sched_find_counter(sched))
@@ -718,7 +717,7 @@ int perf_assign_events(struct event_constraint 
**constraints, int n,
 
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-   struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
+   struct event_constraint *c;
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
int i, wmin, wmax, num = 0;
struct hw_perf_event *hwc;
@@ -726,8 +725,10 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, 
int *assign)
bitmap_zero(used_mask, X86_PMC_IDX_MAX);
 
for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i  n; i++) {
+   hwc = cpuc-event_list[i]-hw;
c = x86_pmu.get_event_constraints(cpuc, cpuc-event_list[i]);
-   constraints[i] = c;
+   hwc-constraint = c;
+
wmin = min(wmin, c-weight);
wmax = max(wmax, c-weight);
}
@@ -737,7 +738,7 @@ int

Re: [PATCH 1/1] core-kernel: use multiply instead of shifts in hash_64

2012-07-12 Thread Andrew Hunter
On Tue, Jul 10, 2012 at 6:35 AM, Michael Tokarev  wrote:
> On 03.07.2012 00:25, Andrew Hunter wrote:
>> diff --git a/include/linux/hash.h b/include/linux/hash.h
>> index b80506b..daabc3d 100644
>> --- a/include/linux/hash.h
>> +++ b/include/linux/hash.h
>> @@ -34,7 +34,9 @@
>>  static inline u64 hash_64(u64 val, unsigned int bits)
>>  {
>>   u64 hash = val;
>> -
>> +#if BITS_PER_LONG == 64
>> + hash *= GOLDEN_RATIO_PRIME_64;
>> +#else
>>   /*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
>
> Hmm.  Does this comment make sense here now?
>

I haven't checked what output gcc provides for 32-bit kernels with
this or a literal multiply.  It's not even clear what optimization is
_asked_ for here (possibly the reduction of strength that we probably
don't even want.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] core-kernel: use multiply instead of shifts in hash_64

2012-07-12 Thread Andrew Hunter
On Tue, Jul 10, 2012 at 6:35 AM, Michael Tokarev m...@tls.msk.ru wrote:
 On 03.07.2012 00:25, Andrew Hunter wrote:
 diff --git a/include/linux/hash.h b/include/linux/hash.h
 index b80506b..daabc3d 100644
 --- a/include/linux/hash.h
 +++ b/include/linux/hash.h
 @@ -34,7 +34,9 @@
  static inline u64 hash_64(u64 val, unsigned int bits)
  {
   u64 hash = val;
 -
 +#if BITS_PER_LONG == 64
 + hash *= GOLDEN_RATIO_PRIME_64;
 +#else
   /*  Sigh, gcc can't optimise this alone like it does for 32 bits. */

 Hmm.  Does this comment make sense here now?


I haven't checked what output gcc provides for 32-bit kernels with
this or a literal multiply.  It's not even clear what optimization is
_asked_ for here (possibly the reduction of strength that we probably
don't even want.)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/