Re: Udpated sys_membarrier() speedup patch, FYI
On Thu, Jul 27, 2017 at 12:06 PM, Paul E. McKenneywrote: > IPIin only those CPUs running threads in the same process as the > thread invoking membarrier() would be very nice! There is some LKML > discussion on this topic, which is currently circling around making this > determination reliable on all CPU families. ARM and x86 are thought > to be OK, PowerPC is thought to require a smallish patch, MIPS is > a big question mark, and so on. > I'm not sure what you mean by the determination or how this is arch specific? > But I am surprised when you say that the downgrade would not work, at > least if you are not running with nohz_full CPUs. The rcu_sched_qs() > function simply sets a per-CPU quiescent-state flag. The needed strong > ordering is instead supplied by the combination of the code starting > the grace period, reporting the setting of the quiescent-state flag > to core RCU, and the code completing the grace period. Each non-idle > CPU will execute full memory barriers either in RCU_SOFTIRQ context, > on entry to idle, on exit from idle, or within the grace-period kthread. > In particular, a CPU running the same usermode thread for the entire > grace period will execute the needed memory barriers in RCU_SOFTIRQ > context shortly after taking a scheduling-clock interrupt. > Recall that I need more than just a memory barrier--also to interrupt RSEQ critical sections in progress on those CPUs. I know this isn't general purpose, I'm just saying a trivial downgrade wouldn't work for me. :) It would probably be sufficient to set NOTIFY_RESUME on all cpus running my code (which is what my IPI function does anyway...)
Re: Udpated sys_membarrier() speedup patch, FYI
On Thu, Jul 27, 2017 at 12:06 PM, Paul E. McKenney wrote: > IPIin only those CPUs running threads in the same process as the > thread invoking membarrier() would be very nice! There is some LKML > discussion on this topic, which is currently circling around making this > determination reliable on all CPU families. ARM and x86 are thought > to be OK, PowerPC is thought to require a smallish patch, MIPS is > a big question mark, and so on. > I'm not sure what you mean by the determination or how this is arch specific? > But I am surprised when you say that the downgrade would not work, at > least if you are not running with nohz_full CPUs. The rcu_sched_qs() > function simply sets a per-CPU quiescent-state flag. The needed strong > ordering is instead supplied by the combination of the code starting > the grace period, reporting the setting of the quiescent-state flag > to core RCU, and the code completing the grace period. Each non-idle > CPU will execute full memory barriers either in RCU_SOFTIRQ context, > on entry to idle, on exit from idle, or within the grace-period kthread. > In particular, a CPU running the same usermode thread for the entire > grace period will execute the needed memory barriers in RCU_SOFTIRQ > context shortly after taking a scheduling-clock interrupt. > Recall that I need more than just a memory barrier--also to interrupt RSEQ critical sections in progress on those CPUs. I know this isn't general purpose, I'm just saying a trivial downgrade wouldn't work for me. :) It would probably be sufficient to set NOTIFY_RESUME on all cpus running my code (which is what my IPI function does anyway...)
Re: Udpated sys_membarrier() speedup patch, FYI
On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenneywrote: > On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: >> IPIing only running threads of my process would be perfect. In fact >> I might even be able to make use of "membarrier these threads >> please" to reduce IPIs, when I change the topology from fully >> connected to something more sparse, on larger machines. >> We do this as well--sometimes we only need RSEQ fences against specific CPU(s), and thus pass a subset. > +static void membarrier_private_expedited_ipi_each(void) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { > + struct task_struct *p; > + > + rcu_read_lock(); > + p = task_rcu_dereference(_rq(cpu)->curr); > + if (p && p->mm == current->mm) > + smp_call_function_single(cpu, ipi_mb, NULL, 1); > + rcu_read_unlock(); > + } > +} > + We have the (simpler imho) const struct cpumask *mask = mm_cpumask(mm); /* possibly AND it with a user requested mask */ smp_call_function_many(mask, ipi_func, ); which I think will be faster on some archs (that support broadcast) and have fewer problems with out of sync values (though we do have to check in our IPI function that we haven't context switched out. Am I missing why this won't work?
Re: Udpated sys_membarrier() speedup patch, FYI
On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney wrote: > On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: >> IPIing only running threads of my process would be perfect. In fact >> I might even be able to make use of "membarrier these threads >> please" to reduce IPIs, when I change the topology from fully >> connected to something more sparse, on larger machines. >> We do this as well--sometimes we only need RSEQ fences against specific CPU(s), and thus pass a subset. > +static void membarrier_private_expedited_ipi_each(void) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { > + struct task_struct *p; > + > + rcu_read_lock(); > + p = task_rcu_dereference(_rq(cpu)->curr); > + if (p && p->mm == current->mm) > + smp_call_function_single(cpu, ipi_mb, NULL, 1); > + rcu_read_unlock(); > + } > +} > + We have the (simpler imho) const struct cpumask *mask = mm_cpumask(mm); /* possibly AND it with a user requested mask */ smp_call_function_many(mask, ipi_func, ); which I think will be faster on some archs (that support broadcast) and have fewer problems with out of sync values (though we do have to check in our IPI function that we haven't context switched out. Am I missing why this won't work?
Re: Udpated sys_membarrier() speedup patch, FYI
On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenneywrote: > Hello! > But my main question is whether the throttling shown below is acceptable > for your use cases, namely only one expedited sys_membarrier() permitted > per scheduling-clock period (1 millisecond on many platforms), with any > excess being silently converted to non-expedited form. Google doesn't use sys_membarrier (that I know of...), but we do use RSEQ fences, which implements membarrier + a little extra to interrupt RSEQ critical sections (via IPI--smp_call_function_many.) One important optimization here is that we only throw IPIs to cpus running the same mm as current (or a subset if requested by userspace), as this is sufficient for the API guarantees we provide. I suspect a similar optimization would largely mitigate DOS concerns, no? I don't know if there are use cases not covered. To answer your question: throttling these (or our equivalents) would be fine in terms of userspace throughput. We haven't noticed performance problems requiring such an intervention, however. Furthermore: I wince a bit at the silent downgrade; I'd almost prefer -EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade simply wouldn't work; rcu_sched_qs() gets called at many points that aren't sufficiently quiescent for RSEQ (in particular, when userspace code is running!) This is solvable, but worth thinking about.
Re: Udpated sys_membarrier() speedup patch, FYI
On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney wrote: > Hello! > But my main question is whether the throttling shown below is acceptable > for your use cases, namely only one expedited sys_membarrier() permitted > per scheduling-clock period (1 millisecond on many platforms), with any > excess being silently converted to non-expedited form. Google doesn't use sys_membarrier (that I know of...), but we do use RSEQ fences, which implements membarrier + a little extra to interrupt RSEQ critical sections (via IPI--smp_call_function_many.) One important optimization here is that we only throw IPIs to cpus running the same mm as current (or a subset if requested by userspace), as this is sufficient for the API guarantees we provide. I suspect a similar optimization would largely mitigate DOS concerns, no? I don't know if there are use cases not covered. To answer your question: throttling these (or our equivalents) would be fine in terms of userspace throughput. We haven't noticed performance problems requiring such an intervention, however. Furthermore: I wince a bit at the silent downgrade; I'd almost prefer -EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade simply wouldn't work; rcu_sched_qs() gets called at many points that aren't sufficiently quiescent for RSEQ (in particular, when userspace code is running!) This is solvable, but worth thinking about.
Re: Migrated CFS task getting an unfair advantage
At Google, we essentially reverted 88ec22d and the subsequent tweaks to it, keeping vruntime absolute always , instead using task_move_group_fair to change the basis of relative min_vruntime between cpus. We found this made it a lot easier to reason about and work with corss-cpu computations. I could post the patch if it would be of interest... On Wed, Mar 9, 2016 at 5:06 AM, pavankumar kondetiwrote: > Hi Peter, > > On Wed, Mar 9, 2016 at 5:34 PM, Peter Zijlstra wrote: >> On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote: >> >>> When a CFS task is enqueued during migration (load balance or change in >>> affinity), its vruntime is normalized before updating the current and >>> cfs_rq->min_vruntime. >> >> static void >> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >> { >> /* >> * Update the normalized vruntime before updating min_vruntime >> * through calling update_curr(). >> */ >> if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING)) >> se->vruntime += cfs_rq->min_vruntime; >> >> update_curr(cfs_rq); >> >> This, right? Some idiot wrote a comment but forgot to explain why. >> >>> If the current entity is a low priority task or belongs to a cgroup >>> that has lower cpu.shares and it is the only entity queued, there is a >>> possibility of big update to the cfs_rq->min_vruntime. >> >>> As the migrated task is normalized before this update, it gets an >>> unfair advantage over tasks queued after this point. If the migrated >>> task is a CPU hogger, the other CFS tasks queued on this CPU gets >>> starved. >> >> Because it takes a whole while for the newly placed task to gain on the >> previous task, right? >> > > Yes. The newly woken up task vruntime is adjusted wrt the > cfs_rq->min_vruntime. > The cfs_rq->min_vruntime can potentially be hundreds of msec beyond the > migrated task. > >>> If we add the migrated task to destination CPU cfs_rq's rb tree before >>> updating the current in enqueue_entity(), the cfs_rq->min_vruntime >>> does not go beyond the newly migrated task. Is this an acceptable >>> solution? >> >> Hurm.. so I'm not sure how that would solve anything. The existing task >> would still be shot far into the future. >> > In my testing, the problem is gone with this approach. > > The update_min_vruntime() called from update_curr() has a check to make > sure that cfs_rq->min_vruntime does not go beyond the leftmost entity > (in this case it would be the migrated task) vruntime. so we don't see > the migrated task getting any advantage. > >> What you want is to normalize after update_curr()... but we cannot do >> that in the case cfs_rq->curr == se (which I suppose is what that >> comment is on about). >> >> Does something like the below work? >> > > Thanks for providing this patch. It solved the problem. > >> --- >> kernel/sched/fair.c | 20 ++-- >> 1 file changed, 14 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 33130529e9b5..3c114d971d84 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void) >> static void >> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >> { >> + bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING); >> + bool curr = cfs_rq->curr == se; >> + >> /* >> -* Update the normalized vruntime before updating min_vruntime >> -* through calling update_curr(). >> +* If we're the current task, we must renormalise before calling >> +* update_curr(). >> */ >> - if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING)) >> + if (renorm && curr) >> se->vruntime += cfs_rq->min_vruntime; >> >> + update_curr(cfs_rq); >> + >> /* >> -* Update run-time statistics of the 'current'. >> +* Otherwise, renormalise after, such that we're placed at the >> current >> +* moment in time, instead of some random moment in the past. >> */ >> - update_curr(cfs_rq); >> + if (renorm && !curr) >> + se->vruntime += cfs_rq->min_vruntime; >> + >> enqueue_entity_load_avg(cfs_rq, se); >> account_entity_enqueue(cfs_rq, se); >> update_cfs_shares(cfs_rq); >> @@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct >> sched_entity *se, int flags) >> update_stats_enqueue(cfs_rq, se); >> check_spread(cfs_rq, se); >> } >> - if (se != cfs_rq->curr) >> + if (!curr) >> __enqueue_entity(cfs_rq, se); >> se->on_rq = 1; >>
Re: Migrated CFS task getting an unfair advantage
At Google, we essentially reverted 88ec22d and the subsequent tweaks to it, keeping vruntime absolute always , instead using task_move_group_fair to change the basis of relative min_vruntime between cpus. We found this made it a lot easier to reason about and work with corss-cpu computations. I could post the patch if it would be of interest... On Wed, Mar 9, 2016 at 5:06 AM, pavankumar kondeti wrote: > Hi Peter, > > On Wed, Mar 9, 2016 at 5:34 PM, Peter Zijlstra wrote: >> On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote: >> >>> When a CFS task is enqueued during migration (load balance or change in >>> affinity), its vruntime is normalized before updating the current and >>> cfs_rq->min_vruntime. >> >> static void >> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >> { >> /* >> * Update the normalized vruntime before updating min_vruntime >> * through calling update_curr(). >> */ >> if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING)) >> se->vruntime += cfs_rq->min_vruntime; >> >> update_curr(cfs_rq); >> >> This, right? Some idiot wrote a comment but forgot to explain why. >> >>> If the current entity is a low priority task or belongs to a cgroup >>> that has lower cpu.shares and it is the only entity queued, there is a >>> possibility of big update to the cfs_rq->min_vruntime. >> >>> As the migrated task is normalized before this update, it gets an >>> unfair advantage over tasks queued after this point. If the migrated >>> task is a CPU hogger, the other CFS tasks queued on this CPU gets >>> starved. >> >> Because it takes a whole while for the newly placed task to gain on the >> previous task, right? >> > > Yes. The newly woken up task vruntime is adjusted wrt the > cfs_rq->min_vruntime. > The cfs_rq->min_vruntime can potentially be hundreds of msec beyond the > migrated task. > >>> If we add the migrated task to destination CPU cfs_rq's rb tree before >>> updating the current in enqueue_entity(), the cfs_rq->min_vruntime >>> does not go beyond the newly migrated task. Is this an acceptable >>> solution? >> >> Hurm.. so I'm not sure how that would solve anything. The existing task >> would still be shot far into the future. >> > In my testing, the problem is gone with this approach. > > The update_min_vruntime() called from update_curr() has a check to make > sure that cfs_rq->min_vruntime does not go beyond the leftmost entity > (in this case it would be the migrated task) vruntime. so we don't see > the migrated task getting any advantage. > >> What you want is to normalize after update_curr()... but we cannot do >> that in the case cfs_rq->curr == se (which I suppose is what that >> comment is on about). >> >> Does something like the below work? >> > > Thanks for providing this patch. It solved the problem. > >> --- >> kernel/sched/fair.c | 20 ++-- >> 1 file changed, 14 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 33130529e9b5..3c114d971d84 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void) >> static void >> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >> { >> + bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING); >> + bool curr = cfs_rq->curr == se; >> + >> /* >> -* Update the normalized vruntime before updating min_vruntime >> -* through calling update_curr(). >> +* If we're the current task, we must renormalise before calling >> +* update_curr(). >> */ >> - if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING)) >> + if (renorm && curr) >> se->vruntime += cfs_rq->min_vruntime; >> >> + update_curr(cfs_rq); >> + >> /* >> -* Update run-time statistics of the 'current'. >> +* Otherwise, renormalise after, such that we're placed at the >> current >> +* moment in time, instead of some random moment in the past. >> */ >> - update_curr(cfs_rq); >> + if (renorm && !curr) >> + se->vruntime += cfs_rq->min_vruntime; >> + >> enqueue_entity_load_avg(cfs_rq, se); >> account_entity_enqueue(cfs_rq, se); >> update_cfs_shares(cfs_rq); >> @@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct >> sched_entity *se, int flags) >> update_stats_enqueue(cfs_rq, se); >> check_spread(cfs_rq, se); >> } >> - if (se != cfs_rq->curr) >> + if (!curr) >> __enqueue_entity(cfs_rq, se); >> se->on_rq = 1; >>
Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
On Wed, Jan 27, 2016 at 9:36 AM, Mathieu Desnoyers wrote: > - On Jan 27, 2016, at 12:24 PM, Thomas Gleixner t...@linutronix.de wrote: > >> On Wed, 27 Jan 2016, Josh Triplett wrote: >>> With the dynamic allocation removed, this seems sensible to me. One >>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU >>> number should never need to hold a negative number. >> >> You try to block the future of computing: https://lwn.net/Articles/638673/ > > Besides impossible architectures, there is actually a use-case for > signedness here. It makes it possible to initialize the cpu number > cache to a negative value, e.g. -1, in userspace. Then, a check for > value < 0 can be used to figure out cases where the getcpu_cache > system call is not implemented, and where a fallback (vdso or getcpu > syscall) needs to be used. > > This is why I have chosen a signed type for the cpu cache so far. > In our internal version of this patch (part of the RSEQ system discussed elsewhere) we have a signed CPU id for this reason. I think it's a good idea to keep that in userspace and it makes more sense to match the user and kernel versions of the types.
Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread
On Wed, Jan 27, 2016 at 9:36 AM, Mathieu Desnoyerswrote: > - On Jan 27, 2016, at 12:24 PM, Thomas Gleixner t...@linutronix.de wrote: > >> On Wed, 27 Jan 2016, Josh Triplett wrote: >>> With the dynamic allocation removed, this seems sensible to me. One >>> minor nit: s/int32_t/uint32_t/g, since a location intended to hold a CPU >>> number should never need to hold a negative number. >> >> You try to block the future of computing: https://lwn.net/Articles/638673/ > > Besides impossible architectures, there is actually a use-case for > signedness here. It makes it possible to initialize the cpu number > cache to a negative value, e.g. -1, in userspace. Then, a check for > value < 0 can be used to figure out cases where the getcpu_cache > system call is not implemented, and where a fallback (vdso or getcpu > syscall) needs to be used. > > This is why I have chosen a signed type for the cpu cache so far. > In our internal version of this patch (part of the RSEQ system discussed elsewhere) we have a signed CPU id for this reason. I think it's a good idea to keep that in userspace and it makes more sense to match the user and kernel versions of the types.
Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski wrote: > Create an array of user-managed locks, one per cpu. Call them lock[i] > for 0 <= i < ncpus. > > To acquire, look up your CPU number. Then, atomically, check that > lock[cpu] isn't held and, if so, mark it held and record both your tid > and your lock acquisition count. If you learn that the lock *was* > held after all, signal the holder (with kill or your favorite other > mechanism), telling it which lock acquisition count is being aborted. > Then atomically steal the lock, but only if the lock acquisition count > hasn't changed. > We had to deploy the userspace percpu API (percpu sharded locks, {double,}compare-and-swap, atomic-increment, etc) universally to the fleet without waiting for 100% kernel penetration, not to mention wanting to disable the kernel acceleration in case of kernel bugs. (Since this is mostly used in core infrastructure--malloc, various statistics platforms, etc--in userspace, checking for availability isn't feasible. The primitives have to work 100% of the time or it would be too complex for our developers to bother using them.) So we did basically this (without the lock stealing...): we have a single per-cpu spin lock manipulated with atomics, which we take very briefly to implement (e.g.) compare-and-swap. The performance is hugely worse; typical overheads are in the 10x range _without_ any on-cpu contention. Uncontended atomics are much cheaper than they were on pre-Nehalem chips, but they still can't hold a candle to unsynchronized instructions. As a fallback path for userspace, this is fine--if 5% of binaries on busted kernels aren't quite as fast, we can work with that in exchange for being able to write a percpu op without worrying about what to do on -ENOSYS. But it's just not fast enough to compete as the intended way to do things. AHH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski l...@amacapital.net wrote: Create an array of user-managed locks, one per cpu. Call them lock[i] for 0 = i ncpus. To acquire, look up your CPU number. Then, atomically, check that lock[cpu] isn't held and, if so, mark it held and record both your tid and your lock acquisition count. If you learn that the lock *was* held after all, signal the holder (with kill or your favorite other mechanism), telling it which lock acquisition count is being aborted. Then atomically steal the lock, but only if the lock acquisition count hasn't changed. We had to deploy the userspace percpu API (percpu sharded locks, {double,}compare-and-swap, atomic-increment, etc) universally to the fleet without waiting for 100% kernel penetration, not to mention wanting to disable the kernel acceleration in case of kernel bugs. (Since this is mostly used in core infrastructure--malloc, various statistics platforms, etc--in userspace, checking for availability isn't feasible. The primitives have to work 100% of the time or it would be too complex for our developers to bother using them.) So we did basically this (without the lock stealing...): we have a single per-cpu spin lock manipulated with atomics, which we take very briefly to implement (e.g.) compare-and-swap. The performance is hugely worse; typical overheads are in the 10x range _without_ any on-cpu contention. Uncontended atomics are much cheaper than they were on pre-Nehalem chips, but they still can't hold a candle to unsynchronized instructions. As a fallback path for userspace, this is fine--if 5% of binaries on busted kernels aren't quite as fast, we can work with that in exchange for being able to write a percpu op without worrying about what to do on -ENOSYS. But it's just not fast enough to compete as the intended way to do things. AHH -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: fix timeval conversion to jiffies
On Thu, Sep 4, 2014 at 2:36 PM, Paul Turner wrote: > On Thu, Sep 4, 2014 at 2:30 PM, John Stultz wrote: >> This seems to be a quite old bug.. Do you think this is needed for -stable? > > Seems reasonable to me. > I have no opinion: backport or don't at your preference. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: fix timeval conversion to jiffies
timeval_to_jiffies tried to round a timeval up to an integral number of jiffies, but the logic for doing so was incorrect: intervals corresponding to exactly N jiffies would become N+1. This manifested itself particularly repeatedly stopping/starting an itimer: setitimer(ITIMER_PROF, , NULL); setitimer(ITIMER_PROF, NULL, ); would add a full tick to val, _even if it was exactly representable in terms of jiffies_ (say, the result of a previous rounding.) Doing this repeatedly would cause unbounded growth in val. So fix the math. Here's what was wrong with the conversion: we essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) >> USEC_JIFFIE_SC and rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value; that is, instead of dividing by (1 usec/ 1 jiffie), we effectively divided by (1 usec/1 jiffie)-epsilon (rounding down). This would normally be fine, but we want to round timeouts up, and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this would be fine if our division was exact, but dividing this by the slightly smaller factor was equivalent to adding just _over_ 1 to the final result (instead of just _under_ 1, as desired.) In particular, with HZ=1000, we consistently computed that 1 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec->nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 1; setitimer(ITIMER_PROF, , NULL); /* Save and restore several times. */ for (size_t i = 0; i < 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, , ); /* on old kernels, this goes up by TICK_USEC every iteration */ printf("previous value: %ld %ld %ld %ld\n", prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, , NULL); } return 0; } Signed-off-by: Andrew Hunter Reviewed-by: Paul Turner Reported-by: Aaron Jacobs Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453 --- include/linux/jiffies.h | 12 --- kernel/time.c | 54 +++-- 2 files changed, 30 insertions(+), 36 deletions(-) diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 1f44466..c367cbd 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -258,23 +258,11 @@ extern unsigned long preset_lpj; #define SEC_JIFFIE_SC (32 - SHIFT_HZ) #endif #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29) -#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19) #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) #define NSEC_CONVERSION ((unsigned long)u64)1 << NSEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) -#define USEC_CONVERSION \ -((unsigned long)u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\ -TICK_NSEC -1) / (u64)TICK_NSEC)) -/* - * USEC_ROUND is used in the timeval to jiffie conversion. See there - * for more details. It is the scaled resolution rounding value. Note - * that it is a 64-bit value. Since, when it is applied, we are already - * in jiffies (albit scaled), it is nothing but the bits we will shift - * off. - */ -#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1) /* * The maximum jiffie value is (MAX_INT >> 1). Here we translate that * into seconds. The 64-bit case will overflow if we are not careful, diff --git a/kernel/time.c b/kernel/time.c index 7c7964c..3c49ab4 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies); * that a remainder subtract here would not do the right thing as the * resolution values don't fall on second boundries. I.e. the line: * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding. + * Note that due to the small error in the multiplier here, this + * rounding is incorrect for sufficiently large values of tv_nsec, but + * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're + * OK. * * Rather, we just shift
Re: [PATCH] sched: fix timeval conversion to jiffies
On Wed, Sep 3, 2014 at 5:06 PM, John Stultz wrote: > Maybe with the next version of the patch, before you get into the > unwinding the math, you might practically describe what is broken, > then explain how its broken. > Done. > My quick read here is that we're converting a timespec -> jiffies, and > in doing so we round up by one jiffy. > > This seems actually perfectly normal, as we usually end up rounding up > by a jiffy in many cases since we don't want to undershoot any > timeout, and we're always allowed return later then specified. Well, yes, timeouts can be longer than specified, but what you said technically applies just as much to code that arbitrarily multiplies jiffies by 10 before returning, no? :) The problem isn't the rounding, it's that the rounding is _wrong_: I'm fine rounding 10100usec / (1000 usec/jiffie) = 11 jiffies. The current code rounds 1usec / (1000 usec/jiffies) to 11. I've rewritten the description to make this clearer. > >> In particular, with HZ=1000, we consistently computed that 1 usec >> was 11 jiffies; the same was true for any exact multiple of >> TICK_NSEC. This is obviously bad as a general rule, and caused >> observable user problems with setitimer() at the very least: >> >> setitimer(ITIMER_PROF, , NULL); >> setitimer(ITIMER_PROF, NULL, ); >> >> would actually add a tick to val! > > So this looks like an issue. Since we convert and store the internal > representation in jiffies, when we pull it back out, we get the > rounded up value, which is larger then the timespec value originally > submitted. This is really the core issue, correct? For the particular user bug reported to me, yes, this was the core issue: some code that stopped and restarted an itimer found the interval growing by 1ms each time. But again, it wasn't that it was rounded: if we initially passed e.g. 10500 usec and got back 11000, that'd be annoying but workable, because if we then went through another cycle of enabling/disabling itimer, we'd set it to 11000 usec and get back 11000 again. What we have now instead adds a full jiffie _every time_. > Or does this > manifest in other ways that are problematic (the patch seems to hint > that it does)? > Um: hard to be sure. I can tell you that for PROF itimers we track the difference between the requested timeval in ns and the given number of jiffies (see the logic with it->error in check_cpu_itimer() in kernel/posix-cpu-timers.c), so over long periods we'll get the right sample rate (the only problem is the timeval reported back to userspace, and the samples might not quite be uniform.) It appears there are no other uses of timeval_to_jiffies, so we're probably safe. But in any case it's wrong as currently coded so we should fix it. (Actually, one could use the same code tracking it->error to always return the exact--unrounded--interval to userspace, but that's minor compared to a constantly growing interval, and I'd rather not complicate this patch any more.) >> + nsec = nsec + TICK_NSEC - 1; > > > If we're still rounding up one tick here, does the same issue not persist? > Nope (at least not in tested configurations.) Since jiffie is effectively defined as some integral number of nsec, this rounding DTRT to round up to an integral number of jiffies. (The scaled arithmetic might, for sufficiently large intervals/right values of timeval produce some error, but I haven't been able to make it do so.) > Overall the patch doesn't look too bad and does reduce the amount of > ugly math with simpler code reuse. But I'd like the patch description > to be more clear about why this is needed and the impact. > Rewritten; sending v2. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: fix timeval conversion to jiffies
On Wed, Sep 3, 2014 at 5:06 PM, John Stultz john.stu...@linaro.org wrote: Maybe with the next version of the patch, before you get into the unwinding the math, you might practically describe what is broken, then explain how its broken. Done. My quick read here is that we're converting a timespec - jiffies, and in doing so we round up by one jiffy. This seems actually perfectly normal, as we usually end up rounding up by a jiffy in many cases since we don't want to undershoot any timeout, and we're always allowed return later then specified. Well, yes, timeouts can be longer than specified, but what you said technically applies just as much to code that arbitrarily multiplies jiffies by 10 before returning, no? :) The problem isn't the rounding, it's that the rounding is _wrong_: I'm fine rounding 10100usec / (1000 usec/jiffie) = 11 jiffies. The current code rounds 1usec / (1000 usec/jiffies) to 11. I've rewritten the description to make this clearer. In particular, with HZ=1000, we consistently computed that 1 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. This is obviously bad as a general rule, and caused observable user problems with setitimer() at the very least: setitimer(ITIMER_PROF, val, NULL); setitimer(ITIMER_PROF, NULL, val); would actually add a tick to val! So this looks like an issue. Since we convert and store the internal representation in jiffies, when we pull it back out, we get the rounded up value, which is larger then the timespec value originally submitted. This is really the core issue, correct? For the particular user bug reported to me, yes, this was the core issue: some code that stopped and restarted an itimer found the interval growing by 1ms each time. But again, it wasn't that it was rounded: if we initially passed e.g. 10500 usec and got back 11000, that'd be annoying but workable, because if we then went through another cycle of enabling/disabling itimer, we'd set it to 11000 usec and get back 11000 again. What we have now instead adds a full jiffie _every time_. Or does this manifest in other ways that are problematic (the patch seems to hint that it does)? Um: hard to be sure. I can tell you that for PROF itimers we track the difference between the requested timeval in ns and the given number of jiffies (see the logic with it-error in check_cpu_itimer() in kernel/posix-cpu-timers.c), so over long periods we'll get the right sample rate (the only problem is the timeval reported back to userspace, and the samples might not quite be uniform.) It appears there are no other uses of timeval_to_jiffies, so we're probably safe. But in any case it's wrong as currently coded so we should fix it. (Actually, one could use the same code tracking it-error to always return the exact--unrounded--interval to userspace, but that's minor compared to a constantly growing interval, and I'd rather not complicate this patch any more.) + nsec = nsec + TICK_NSEC - 1; If we're still rounding up one tick here, does the same issue not persist? Nope (at least not in tested configurations.) Since jiffie is effectively defined as some integral number of nsec, this rounding DTRT to round up to an integral number of jiffies. (The scaled arithmetic might, for sufficiently large intervals/right values of timeval produce some error, but I haven't been able to make it do so.) Overall the patch doesn't look too bad and does reduce the amount of ugly math with simpler code reuse. But I'd like the patch description to be more clear about why this is needed and the impact. Rewritten; sending v2. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: fix timeval conversion to jiffies
timeval_to_jiffies tried to round a timeval up to an integral number of jiffies, but the logic for doing so was incorrect: intervals corresponding to exactly N jiffies would become N+1. This manifested itself particularly repeatedly stopping/starting an itimer: setitimer(ITIMER_PROF, val, NULL); setitimer(ITIMER_PROF, NULL, val); would add a full tick to val, _even if it was exactly representable in terms of jiffies_ (say, the result of a previous rounding.) Doing this repeatedly would cause unbounded growth in val. So fix the math. Here's what was wrong with the conversion: we essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) USEC_JIFFIE_SC and rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value; that is, instead of dividing by (1 usec/ 1 jiffie), we effectively divided by (1 usec/1 jiffie)-epsilon (rounding down). This would normally be fine, but we want to round timeouts up, and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this would be fine if our division was exact, but dividing this by the slightly smaller factor was equivalent to adding just _over_ 1 to the final result (instead of just _under_ 1, as desired.) In particular, with HZ=1000, we consistently computed that 1 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec-nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 1; setitimer(ITIMER_PROF, initial, NULL); /* Save and restore several times. */ for (size_t i = 0; i 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, zero, prev); /* on old kernels, this goes up by TICK_USEC every iteration */ printf(previous value: %ld %ld %ld %ld\n, prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, prev, NULL); } return 0; } Signed-off-by: Andrew Hunter a...@google.com Reviewed-by: Paul Turner p...@google.com Reported-by: Aaron Jacobs jaco...@google.com Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453 --- include/linux/jiffies.h | 12 --- kernel/time.c | 54 +++-- 2 files changed, 30 insertions(+), 36 deletions(-) diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 1f44466..c367cbd 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -258,23 +258,11 @@ extern unsigned long preset_lpj; #define SEC_JIFFIE_SC (32 - SHIFT_HZ) #endif #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29) -#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19) #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC SEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) #define NSEC_CONVERSION ((unsigned long)u64)1 NSEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) -#define USEC_CONVERSION \ -((unsigned long)u64)NSEC_PER_USEC USEC_JIFFIE_SC) +\ -TICK_NSEC -1) / (u64)TICK_NSEC)) -/* - * USEC_ROUND is used in the timeval to jiffie conversion. See there - * for more details. It is the scaled resolution rounding value. Note - * that it is a 64-bit value. Since, when it is applied, we are already - * in jiffies (albit scaled), it is nothing but the bits we will shift - * off. - */ -#define USEC_ROUND (u64)(((u64)1 USEC_JIFFIE_SC) - 1) /* * The maximum jiffie value is (MAX_INT 1). Here we translate that * into seconds. The 64-bit case will overflow if we are not careful, diff --git a/kernel/time.c b/kernel/time.c index 7c7964c..3c49ab4 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies); * that a remainder subtract here would not do the right thing as the * resolution values don't fall on second boundries. I.e. the line: * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding. + * Note that due to the small error in the multiplier here, this + * rounding is incorrect for sufficiently large values of tv_nsec, but + * well formed timespecs should have tv_nsec NSEC_PER_SEC, so we're + * OK. * * Rather, we just shift the bits
Re: [PATCH] sched: fix timeval conversion to jiffies
On Thu, Sep 4, 2014 at 2:36 PM, Paul Turner p...@google.com wrote: On Thu, Sep 4, 2014 at 2:30 PM, John Stultz john.stu...@linaro.org wrote: This seems to be a quite old bug.. Do you think this is needed for -stable? Seems reasonable to me. I have no opinion: backport or don't at your preference. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: fix timeval conversion to jiffies
timeval_to_jiffies rounding was broken. It essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) >> USEC_JIFFIE_SC and it rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value: rounding the division up by adding 2^USEC_JIFFIE_SC - 1, when the error in scaling was added in, was sufficient to add one jiffie to the final result. In particular, with HZ=1000, we consistently computed that 1 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. This is obviously bad as a general rule, and caused observable user problems with setitimer() at the very least: setitimer(ITIMER_PROF, , NULL); setitimer(ITIMER_PROF, NULL, ); would actually add a tick to val! We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec->nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 1; setitimer(ITIMER_PROF, , NULL); /* Save and restore several times. */ for (size_t i = 0; i < 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, , ); /* on old kernels, this goes up by TICK_USEC every iteration */ printf("previous value: %ld %ld %ld %ld\n", prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, , NULL); } return 0; } Signed-off-by: Andrew Hunter Reviewed-by: Paul Turner Reported-by: Aaron Jacobs Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453 --- include/linux/jiffies.h | 12 --- kernel/time.c | 54 +++-- 2 files changed, 30 insertions(+), 36 deletions(-) diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 1f44466..c367cbd 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -258,23 +258,11 @@ extern unsigned long preset_lpj; #define SEC_JIFFIE_SC (32 - SHIFT_HZ) #endif #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29) -#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19) #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) #define NSEC_CONVERSION ((unsigned long)u64)1 << NSEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) -#define USEC_CONVERSION \ -((unsigned long)u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\ -TICK_NSEC -1) / (u64)TICK_NSEC)) -/* - * USEC_ROUND is used in the timeval to jiffie conversion. See there - * for more details. It is the scaled resolution rounding value. Note - * that it is a 64-bit value. Since, when it is applied, we are already - * in jiffies (albit scaled), it is nothing but the bits we will shift - * off. - */ -#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1) /* * The maximum jiffie value is (MAX_INT >> 1). Here we translate that * into seconds. The 64-bit case will overflow if we are not careful, diff --git a/kernel/time.c b/kernel/time.c index 7c7964c..3c49ab4 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies); * that a remainder subtract here would not do the right thing as the * resolution values don't fall on second boundries. I.e. the line: * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding. + * Note that due to the small error in the multiplier here, this + * rounding is incorrect for sufficiently large values of tv_nsec, but + * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're + * OK. * * Rather, we just shift the bits off the right. * * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec * value to a scaled second value. */ -unsigned long -timespec_to_jiffies(const struct timespec *value) +static unsigned long +__timespec_to_jiffies(unsigned long sec, long nsec) { - unsigned long sec = value->tv_sec; - long nsec = value->tv_nsec + TICK_NSEC - 1; + nsec = nsec + TICK_NSEC - 1; if (sec >= MAX_SEC_IN_JIFFIES){ sec = MAX_SEC_IN_JIFFIES; @@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespec *value) (NSEC_JIFFIE_SC - SEC_JIF
[PATCH] sched: fix timeval conversion to jiffies
timeval_to_jiffies rounding was broken. It essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) USEC_JIFFIE_SC and it rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value: rounding the division up by adding 2^USEC_JIFFIE_SC - 1, when the error in scaling was added in, was sufficient to add one jiffie to the final result. In particular, with HZ=1000, we consistently computed that 1 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. This is obviously bad as a general rule, and caused observable user problems with setitimer() at the very least: setitimer(ITIMER_PROF, val, NULL); setitimer(ITIMER_PROF, NULL, val); would actually add a tick to val! We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec-nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 1; setitimer(ITIMER_PROF, initial, NULL); /* Save and restore several times. */ for (size_t i = 0; i 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, zero, prev); /* on old kernels, this goes up by TICK_USEC every iteration */ printf(previous value: %ld %ld %ld %ld\n, prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, prev, NULL); } return 0; } Signed-off-by: Andrew Hunter a...@google.com Reviewed-by: Paul Turner p...@google.com Reported-by: Aaron Jacobs jaco...@google.com Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453 --- include/linux/jiffies.h | 12 --- kernel/time.c | 54 +++-- 2 files changed, 30 insertions(+), 36 deletions(-) diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index 1f44466..c367cbd 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -258,23 +258,11 @@ extern unsigned long preset_lpj; #define SEC_JIFFIE_SC (32 - SHIFT_HZ) #endif #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29) -#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19) #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC SEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) #define NSEC_CONVERSION ((unsigned long)u64)1 NSEC_JIFFIE_SC) +\ TICK_NSEC -1) / (u64)TICK_NSEC)) -#define USEC_CONVERSION \ -((unsigned long)u64)NSEC_PER_USEC USEC_JIFFIE_SC) +\ -TICK_NSEC -1) / (u64)TICK_NSEC)) -/* - * USEC_ROUND is used in the timeval to jiffie conversion. See there - * for more details. It is the scaled resolution rounding value. Note - * that it is a 64-bit value. Since, when it is applied, we are already - * in jiffies (albit scaled), it is nothing but the bits we will shift - * off. - */ -#define USEC_ROUND (u64)(((u64)1 USEC_JIFFIE_SC) - 1) /* * The maximum jiffie value is (MAX_INT 1). Here we translate that * into seconds. The 64-bit case will overflow if we are not careful, diff --git a/kernel/time.c b/kernel/time.c index 7c7964c..3c49ab4 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies); * that a remainder subtract here would not do the right thing as the * resolution values don't fall on second boundries. I.e. the line: * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding. + * Note that due to the small error in the multiplier here, this + * rounding is incorrect for sufficiently large values of tv_nsec, but + * well formed timespecs should have tv_nsec NSEC_PER_SEC, so we're + * OK. * * Rather, we just shift the bits off the right. * * The (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec * value to a scaled second value. */ -unsigned long -timespec_to_jiffies(const struct timespec *value) +static unsigned long +__timespec_to_jiffies(unsigned long sec, long nsec) { - unsigned long sec = value-tv_sec; - long nsec = value-tv_nsec + TICK_NSEC - 1; + nsec = nsec + TICK_NSEC - 1; if (sec = MAX_SEC_IN_JIFFIES){ sec = MAX_SEC_IN_JIFFIES; @@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespec *value) (NSEC_JIFFIE_SC - SEC_JIFFIE_SC
[RFC] [PATCH] x86: avoid per_cpu for APIC id tables
Hi, I have a patch (following) that modifies handling of APIC id tables, trading a small amount of space in the (NR_CPUS - nr_cpu_ids) >> 0 case for faster accesses and slightly better cache layout (as APIC ids are mostly used cross-cpu.) I'm not an APIC expert so I'd appreciate some eyes on this, but it shouldn't change any behavior whatsoever. Thoughts? (We're likely to merge this internally even if upstream judges the space loss too much of a cost, so I'd like to know if there's some other problem I've missed that this causes.) I've tested this cursorily in most of our internal configurations but not in any particularly exotic hardware/config. >From e6bf354c05d98651e8c27f96582f0ab56992e58a Mon Sep 17 00:00:00 2001 From: Andrew Hunter Date: Tue, 16 Jul 2013 16:50:36 -0700 Subject: [PATCH] x86: avoid per_cpu for APIC id tables DEFINE_PER_CPU(var) and friends go to lengths to arrange all of cpu i's per cpu variables as contiguous with each other; this requires a double indirection to reference a variable. For data that is logically per-cpu but a) rarely modified b) commonly accessed from other CPUs this is bad: no writes means we don't have to worry about cache ping pong, and cross-CPU access means there's no cache savings from not pulling in remote entries. (Actually, it's worse than "no" cache savings: instead of one cache line containing 32 useful APIC ids, it will contain 3 useful APIC ids and much other percpu data from the remote CPU we don't want.) It's also slower to access, due to the indirection. So instead use a flat array for APIC ids, most commonly used for IPIs and the like. This makes a measurable improvement (up to 10%) in some benchmarks that heavily stress remote wakeups. The one disadvantage is that we waste 8 bytes per unused CPU (NR_CPUS - actual). But this is a fairly small amount of memory for reasonable values of NR_CPUS. Tested: builds and boots, runs a suite of wakeup-intensive test without failure. --- arch/x86/include/asm/apic.h | 5 ++--- arch/x86/include/asm/smp.h| 8 +++ arch/x86/kernel/acpi/boot.c | 4 ++-- arch/x86/kernel/apic/apic.c | 42 --- arch/x86/kernel/apic/apic_numachip.c | 2 +- arch/x86/kernel/apic/bigsmp_32.c | 11 + arch/x86/kernel/apic/es7000_32.c | 12 +- arch/x86/kernel/apic/ipi.c| 14 ++-- arch/x86/kernel/apic/numaq_32.c | 2 +- arch/x86/kernel/apic/summit_32.c | 12 +- arch/x86/kernel/apic/x2apic_cluster.c | 12 +- arch/x86/kernel/apic/x2apic_phys.c| 2 +- arch/x86/kernel/apic/x2apic_uv_x.c| 4 ++-- arch/x86/kernel/setup_percpu.c| 17 -- arch/x86/kernel/smpboot.c | 2 +- arch/x86/mm/numa.c| 5 + arch/x86/platform/uv/tlb_uv.c | 2 +- 17 files changed, 60 insertions(+), 96 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index f8119b5..9a80f49 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -547,8 +547,7 @@ static inline const struct cpumask *online_target_cpus(void) return cpu_online_mask; } -DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid); - +extern u16 x86_bios_cpu_apicid[NR_CPUS]; static inline unsigned int read_apic_id(void) { @@ -660,7 +659,7 @@ static inline void default_ioapic_phys_id_map(physid_mask_t *phys_map, physid_ma static inline int __default_cpu_present_to_apicid(int mps_cpu) { if (mps_cpu < nr_cpu_ids && cpu_present(mps_cpu)) - return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu); + return (int)x86_bios_cpu_apicid[mps_cpu]; else return BAD_APICID; } diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index b073aae..25deeac0 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -53,10 +53,10 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu) return per_cpu(cpu_llc_shared_map, cpu); } -DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid); -DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid); +extern u16 x86_cpu_to_apicid[NR_CPUS]; +extern u16 x86_bios_cpu_apicid[NR_CPUS]; #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_32) -DECLARE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid); +extern u16 x86_cpu_to_logical_apicid[NR_CPUS]; #endif /* Static state in head.S used to set up a CPU */ @@ -168,7 +168,7 @@ void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle); void smp_store_boot_cpu_info(void); void smp_store_cpu_info(int id); -#define cpu_physical_id(cpu) per_cpu(x86_cpu_to_apicid, cpu) +#define cpu_physical_id(cpu) x86_cpu_to_apicid[cpu] #else /* !CONFIG_SMP */ #define wbinvd_on_cpu(cpu) wbinvd() diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index d81a972
[RFC] [PATCH] x86: avoid per_cpu for APIC id tables
Hi, I have a patch (following) that modifies handling of APIC id tables, trading a small amount of space in the (NR_CPUS - nr_cpu_ids) 0 case for faster accesses and slightly better cache layout (as APIC ids are mostly used cross-cpu.) I'm not an APIC expert so I'd appreciate some eyes on this, but it shouldn't change any behavior whatsoever. Thoughts? (We're likely to merge this internally even if upstream judges the space loss too much of a cost, so I'd like to know if there's some other problem I've missed that this causes.) I've tested this cursorily in most of our internal configurations but not in any particularly exotic hardware/config. From e6bf354c05d98651e8c27f96582f0ab56992e58a Mon Sep 17 00:00:00 2001 From: Andrew Hunter a...@google.com Date: Tue, 16 Jul 2013 16:50:36 -0700 Subject: [PATCH] x86: avoid per_cpu for APIC id tables DEFINE_PER_CPU(var) and friends go to lengths to arrange all of cpu i's per cpu variables as contiguous with each other; this requires a double indirection to reference a variable. For data that is logically per-cpu but a) rarely modified b) commonly accessed from other CPUs this is bad: no writes means we don't have to worry about cache ping pong, and cross-CPU access means there's no cache savings from not pulling in remote entries. (Actually, it's worse than no cache savings: instead of one cache line containing 32 useful APIC ids, it will contain 3 useful APIC ids and much other percpu data from the remote CPU we don't want.) It's also slower to access, due to the indirection. So instead use a flat array for APIC ids, most commonly used for IPIs and the like. This makes a measurable improvement (up to 10%) in some benchmarks that heavily stress remote wakeups. The one disadvantage is that we waste 8 bytes per unused CPU (NR_CPUS - actual). But this is a fairly small amount of memory for reasonable values of NR_CPUS. Tested: builds and boots, runs a suite of wakeup-intensive test without failure. --- arch/x86/include/asm/apic.h | 5 ++--- arch/x86/include/asm/smp.h| 8 +++ arch/x86/kernel/acpi/boot.c | 4 ++-- arch/x86/kernel/apic/apic.c | 42 --- arch/x86/kernel/apic/apic_numachip.c | 2 +- arch/x86/kernel/apic/bigsmp_32.c | 11 + arch/x86/kernel/apic/es7000_32.c | 12 +- arch/x86/kernel/apic/ipi.c| 14 ++-- arch/x86/kernel/apic/numaq_32.c | 2 +- arch/x86/kernel/apic/summit_32.c | 12 +- arch/x86/kernel/apic/x2apic_cluster.c | 12 +- arch/x86/kernel/apic/x2apic_phys.c| 2 +- arch/x86/kernel/apic/x2apic_uv_x.c| 4 ++-- arch/x86/kernel/setup_percpu.c| 17 -- arch/x86/kernel/smpboot.c | 2 +- arch/x86/mm/numa.c| 5 + arch/x86/platform/uv/tlb_uv.c | 2 +- 17 files changed, 60 insertions(+), 96 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index f8119b5..9a80f49 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -547,8 +547,7 @@ static inline const struct cpumask *online_target_cpus(void) return cpu_online_mask; } -DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid); - +extern u16 x86_bios_cpu_apicid[NR_CPUS]; static inline unsigned int read_apic_id(void) { @@ -660,7 +659,7 @@ static inline void default_ioapic_phys_id_map(physid_mask_t *phys_map, physid_ma static inline int __default_cpu_present_to_apicid(int mps_cpu) { if (mps_cpu nr_cpu_ids cpu_present(mps_cpu)) - return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu); + return (int)x86_bios_cpu_apicid[mps_cpu]; else return BAD_APICID; } diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index b073aae..25deeac0 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -53,10 +53,10 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu) return per_cpu(cpu_llc_shared_map, cpu); } -DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid); -DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid); +extern u16 x86_cpu_to_apicid[NR_CPUS]; +extern u16 x86_bios_cpu_apicid[NR_CPUS]; #if defined(CONFIG_X86_LOCAL_APIC) defined(CONFIG_X86_32) -DECLARE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid); +extern u16 x86_cpu_to_logical_apicid[NR_CPUS]; #endif /* Static state in head.S used to set up a CPU */ @@ -168,7 +168,7 @@ void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle); void smp_store_boot_cpu_info(void); void smp_store_cpu_info(int id); -#define cpu_physical_id(cpu) per_cpu(x86_cpu_to_apicid, cpu) +#define cpu_physical_id(cpu) x86_cpu_to_apicid[cpu] #else /* !CONFIG_SMP */ #define wbinvd_on_cpu(cpu) wbinvd() diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index d81a972..5bae841 100644 --- a/arch/x86
[tip:perf/core] perf/x86: Reduce stack usage of x86_schedule_events()
Commit-ID: 43b4578071c0e6d87761e113e05d45776cc75437 Gitweb: http://git.kernel.org/tip/43b4578071c0e6d87761e113e05d45776cc75437 Author: Andrew Hunter AuthorDate: Thu, 23 May 2013 11:07:03 -0700 Committer: Ingo Molnar CommitDate: Wed, 19 Jun 2013 12:50:44 +0200 perf/x86: Reduce stack usage of x86_schedule_events() x86_schedule_events() caches event constraints on the stack during scheduling. Given the number of possible events, this is 512 bytes of stack; since it can be invoked under schedule() under god-knows-what, this is causing stack blowouts. Trade some space usage for stack safety: add a place to cache the constraint pointer to struct perf_event. For 8 bytes per event (1% of its size) we can save the giant stack frame. This shouldn't change any aspect of scheduling whatsoever and while in theory the locality's a tiny bit worse, I doubt we'll see any performance impact either. Tested: `perf stat whatever` does not blow up and produces results that aren't hugely obviously wrong. I'm not sure how to run particularly good tests of perf code, but this should not produce any functional change whatsoever. Signed-off-by: Andrew Hunter Reviewed-by: Stephane Eranian Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1369332423-4400-1-git-send-email-...@google.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event.c | 28 ++- arch/x86/kernel/cpu/perf_event.h | 2 +- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++ include/linux/perf_event.h| 4 4 files changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 1025f3c..e52a9e5 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -568,7 +568,7 @@ struct sched_state { struct perf_sched { int max_weight; int max_events; - struct event_constraint **constraints; + struct perf_event **events; struct sched_state state; int saved_states; struct sched_state saved[SCHED_STATES_MAX]; @@ -577,7 +577,7 @@ struct perf_sched { /* * Initialize interator that runs through all events and counters. */ -static void perf_sched_init(struct perf_sched *sched, struct event_constraint **c, +static void perf_sched_init(struct perf_sched *sched, struct perf_event **events, int num, int wmin, int wmax) { int idx; @@ -585,10 +585,10 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** memset(sched, 0, sizeof(*sched)); sched->max_events = num; sched->max_weight = wmax; - sched->constraints = c; + sched->events = events; for (idx = 0; idx < num; idx++) { - if (c[idx]->weight == wmin) + if (events[idx]->hw.constraint->weight == wmin) break; } @@ -635,8 +635,7 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) if (sched->state.event >= sched->max_events) return false; - c = sched->constraints[sched->state.event]; - + c = sched->events[sched->state.event]->hw.constraint; /* Prefer fixed purpose counters */ if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) { idx = INTEL_PMC_IDX_FIXED; @@ -694,7 +693,7 @@ static bool perf_sched_next_event(struct perf_sched *sched) if (sched->state.weight > sched->max_weight) return false; } - c = sched->constraints[sched->state.event]; + c = sched->events[sched->state.event]->hw.constraint; } while (c->weight != sched->state.weight); sched->state.counter = 0; /* start with first counter */ @@ -705,12 +704,12 @@ static bool perf_sched_next_event(struct perf_sched *sched) /* * Assign a counter for each event. */ -int perf_assign_events(struct event_constraint **constraints, int n, +int perf_assign_events(struct perf_event **events, int n, int wmin, int wmax, int *assign) { struct perf_sched sched; - perf_sched_init(, constraints, n, wmin, wmax); + perf_sched_init(, events, n, wmin, wmax); do { if (!perf_sched_find_counter()) @@ -724,7 +723,7 @@ int perf_assign_events(struct event_constraint **constraints, int n, int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { - struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; + struct event_constraint *c; unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; int i, wmin, wmax, num = 0; struct hw_perf_ev
[tip:perf/core] perf/x86: Reduce stack usage of x86_schedule_events()
Commit-ID: 43b4578071c0e6d87761e113e05d45776cc75437 Gitweb: http://git.kernel.org/tip/43b4578071c0e6d87761e113e05d45776cc75437 Author: Andrew Hunter a...@google.com AuthorDate: Thu, 23 May 2013 11:07:03 -0700 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 19 Jun 2013 12:50:44 +0200 perf/x86: Reduce stack usage of x86_schedule_events() x86_schedule_events() caches event constraints on the stack during scheduling. Given the number of possible events, this is 512 bytes of stack; since it can be invoked under schedule() under god-knows-what, this is causing stack blowouts. Trade some space usage for stack safety: add a place to cache the constraint pointer to struct perf_event. For 8 bytes per event (1% of its size) we can save the giant stack frame. This shouldn't change any aspect of scheduling whatsoever and while in theory the locality's a tiny bit worse, I doubt we'll see any performance impact either. Tested: `perf stat whatever` does not blow up and produces results that aren't hugely obviously wrong. I'm not sure how to run particularly good tests of perf code, but this should not produce any functional change whatsoever. Signed-off-by: Andrew Hunter a...@google.com Reviewed-by: Stephane Eranian eran...@google.com Signed-off-by: Peter Zijlstra pet...@infradead.org Link: http://lkml.kernel.org/r/1369332423-4400-1-git-send-email-...@google.com Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/cpu/perf_event.c | 28 ++- arch/x86/kernel/cpu/perf_event.h | 2 +- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++ include/linux/perf_event.h| 4 4 files changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 1025f3c..e52a9e5 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -568,7 +568,7 @@ struct sched_state { struct perf_sched { int max_weight; int max_events; - struct event_constraint **constraints; + struct perf_event **events; struct sched_state state; int saved_states; struct sched_state saved[SCHED_STATES_MAX]; @@ -577,7 +577,7 @@ struct perf_sched { /* * Initialize interator that runs through all events and counters. */ -static void perf_sched_init(struct perf_sched *sched, struct event_constraint **c, +static void perf_sched_init(struct perf_sched *sched, struct perf_event **events, int num, int wmin, int wmax) { int idx; @@ -585,10 +585,10 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** memset(sched, 0, sizeof(*sched)); sched-max_events = num; sched-max_weight = wmax; - sched-constraints = c; + sched-events = events; for (idx = 0; idx num; idx++) { - if (c[idx]-weight == wmin) + if (events[idx]-hw.constraint-weight == wmin) break; } @@ -635,8 +635,7 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) if (sched-state.event = sched-max_events) return false; - c = sched-constraints[sched-state.event]; - + c = sched-events[sched-state.event]-hw.constraint; /* Prefer fixed purpose counters */ if (c-idxmsk64 (~0ULL INTEL_PMC_IDX_FIXED)) { idx = INTEL_PMC_IDX_FIXED; @@ -694,7 +693,7 @@ static bool perf_sched_next_event(struct perf_sched *sched) if (sched-state.weight sched-max_weight) return false; } - c = sched-constraints[sched-state.event]; + c = sched-events[sched-state.event]-hw.constraint; } while (c-weight != sched-state.weight); sched-state.counter = 0; /* start with first counter */ @@ -705,12 +704,12 @@ static bool perf_sched_next_event(struct perf_sched *sched) /* * Assign a counter for each event. */ -int perf_assign_events(struct event_constraint **constraints, int n, +int perf_assign_events(struct perf_event **events, int n, int wmin, int wmax, int *assign) { struct perf_sched sched; - perf_sched_init(sched, constraints, n, wmin, wmax); + perf_sched_init(sched, events, n, wmin, wmax); do { if (!perf_sched_find_counter(sched)) @@ -724,7 +723,7 @@ int perf_assign_events(struct event_constraint **constraints, int n, int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { - struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; + struct event_constraint *c; unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; int i, wmin, wmax, num = 0; struct hw_perf_event *hwc
[PATCH] perf: reduce stack usage of schedule_events
schedule_events caches event constraints on the stack during scheduling. Given the number of possible events, this is 512 bytes of stack; since it can be invoked under schedule() under god-knows-what, this is causing stack blowouts. Trade some space usage for stack safety: add a place to cache the constraint pointer to struct perf_event. For 8 bytes per event (1% of its size) we can save the giant stack frame. This shouldn't change any aspect of scheduling whatsoever and while in theory the locality's a tiny bit worse, I doubt we'll see any performance impact either. Tested: `perf stat whatever` does not blow up and produces results that aren't hugely obviously wrong. I'm not sure how to run particularly good tests of perf code, but this should not produce any functional change whatsoever. Signed-off-by: Andrew Hunter Reviewed-by: Stephane Eranian --- arch/x86/kernel/cpu/perf_event.c | 28 ++- arch/x86/kernel/cpu/perf_event.h | 2 +- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++ include/linux/perf_event.h| 4 4 files changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index bf0f01a..e4bfc2b 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -562,7 +562,7 @@ struct sched_state { struct perf_sched { int max_weight; int max_events; - struct event_constraint **constraints; + struct perf_event **events; struct sched_state state; int saved_states; struct sched_state saved[SCHED_STATES_MAX]; @@ -571,7 +571,7 @@ struct perf_sched { /* * Initialize interator that runs through all events and counters. */ -static void perf_sched_init(struct perf_sched *sched, struct event_constraint **c, +static void perf_sched_init(struct perf_sched *sched, struct perf_event **events, int num, int wmin, int wmax) { int idx; @@ -579,10 +579,10 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** memset(sched, 0, sizeof(*sched)); sched->max_events = num; sched->max_weight = wmax; - sched->constraints = c; + sched->events = events; for (idx = 0; idx < num; idx++) { - if (c[idx]->weight == wmin) + if (events[idx]->hw.constraint->weight == wmin) break; } @@ -629,8 +629,7 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) if (sched->state.event >= sched->max_events) return false; - c = sched->constraints[sched->state.event]; - + c = sched->events[sched->state.event]->hw.constraint; /* Prefer fixed purpose counters */ if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) { idx = INTEL_PMC_IDX_FIXED; @@ -688,7 +687,7 @@ static bool perf_sched_next_event(struct perf_sched *sched) if (sched->state.weight > sched->max_weight) return false; } - c = sched->constraints[sched->state.event]; + c = sched->events[sched->state.event]->hw.constraint; } while (c->weight != sched->state.weight); sched->state.counter = 0; /* start with first counter */ @@ -699,12 +698,12 @@ static bool perf_sched_next_event(struct perf_sched *sched) /* * Assign a counter for each event. */ -int perf_assign_events(struct event_constraint **constraints, int n, +int perf_assign_events(struct perf_event **events, int n, int wmin, int wmax, int *assign) { struct perf_sched sched; - perf_sched_init(, constraints, n, wmin, wmax); + perf_sched_init(, events, n, wmin, wmax); do { if (!perf_sched_find_counter()) @@ -718,7 +717,7 @@ int perf_assign_events(struct event_constraint **constraints, int n, int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { - struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; + struct event_constraint *c; unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; int i, wmin, wmax, num = 0; struct hw_perf_event *hwc; @@ -726,8 +725,10 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) bitmap_zero(used_mask, X86_PMC_IDX_MAX); for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) { + hwc = >event_list[i]->hw; c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]); - constraints[i] = c; + hwc->constraint = c; + wmin = min(wmin, c
[PATCH] perf: reduce stack usage of schedule_events
schedule_events caches event constraints on the stack during scheduling. Given the number of possible events, this is 512 bytes of stack; since it can be invoked under schedule() under god-knows-what, this is causing stack blowouts. Trade some space usage for stack safety: add a place to cache the constraint pointer to struct perf_event. For 8 bytes per event (1% of its size) we can save the giant stack frame. This shouldn't change any aspect of scheduling whatsoever and while in theory the locality's a tiny bit worse, I doubt we'll see any performance impact either. Tested: `perf stat whatever` does not blow up and produces results that aren't hugely obviously wrong. I'm not sure how to run particularly good tests of perf code, but this should not produce any functional change whatsoever. Signed-off-by: Andrew Hunter a...@google.com Reviewed-by: Stephane Eranian eran...@google.com --- arch/x86/kernel/cpu/perf_event.c | 28 ++- arch/x86/kernel/cpu/perf_event.h | 2 +- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 10 ++ include/linux/perf_event.h| 4 4 files changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index bf0f01a..e4bfc2b 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -562,7 +562,7 @@ struct sched_state { struct perf_sched { int max_weight; int max_events; - struct event_constraint **constraints; + struct perf_event **events; struct sched_state state; int saved_states; struct sched_state saved[SCHED_STATES_MAX]; @@ -571,7 +571,7 @@ struct perf_sched { /* * Initialize interator that runs through all events and counters. */ -static void perf_sched_init(struct perf_sched *sched, struct event_constraint **c, +static void perf_sched_init(struct perf_sched *sched, struct perf_event **events, int num, int wmin, int wmax) { int idx; @@ -579,10 +579,10 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** memset(sched, 0, sizeof(*sched)); sched-max_events = num; sched-max_weight = wmax; - sched-constraints = c; + sched-events = events; for (idx = 0; idx num; idx++) { - if (c[idx]-weight == wmin) + if (events[idx]-hw.constraint-weight == wmin) break; } @@ -629,8 +629,7 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) if (sched-state.event = sched-max_events) return false; - c = sched-constraints[sched-state.event]; - + c = sched-events[sched-state.event]-hw.constraint; /* Prefer fixed purpose counters */ if (c-idxmsk64 (~0ULL INTEL_PMC_IDX_FIXED)) { idx = INTEL_PMC_IDX_FIXED; @@ -688,7 +687,7 @@ static bool perf_sched_next_event(struct perf_sched *sched) if (sched-state.weight sched-max_weight) return false; } - c = sched-constraints[sched-state.event]; + c = sched-events[sched-state.event]-hw.constraint; } while (c-weight != sched-state.weight); sched-state.counter = 0; /* start with first counter */ @@ -699,12 +698,12 @@ static bool perf_sched_next_event(struct perf_sched *sched) /* * Assign a counter for each event. */ -int perf_assign_events(struct event_constraint **constraints, int n, +int perf_assign_events(struct perf_event **events, int n, int wmin, int wmax, int *assign) { struct perf_sched sched; - perf_sched_init(sched, constraints, n, wmin, wmax); + perf_sched_init(sched, events, n, wmin, wmax); do { if (!perf_sched_find_counter(sched)) @@ -718,7 +717,7 @@ int perf_assign_events(struct event_constraint **constraints, int n, int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { - struct event_constraint *c, *constraints[X86_PMC_IDX_MAX]; + struct event_constraint *c; unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; int i, wmin, wmax, num = 0; struct hw_perf_event *hwc; @@ -726,8 +725,10 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) bitmap_zero(used_mask, X86_PMC_IDX_MAX); for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i n; i++) { + hwc = cpuc-event_list[i]-hw; c = x86_pmu.get_event_constraints(cpuc, cpuc-event_list[i]); - constraints[i] = c; + hwc-constraint = c; + wmin = min(wmin, c-weight); wmax = max(wmax, c-weight); } @@ -737,7 +738,7 @@ int
Re: [PATCH 1/1] core-kernel: use multiply instead of shifts in hash_64
On Tue, Jul 10, 2012 at 6:35 AM, Michael Tokarev wrote: > On 03.07.2012 00:25, Andrew Hunter wrote: >> diff --git a/include/linux/hash.h b/include/linux/hash.h >> index b80506b..daabc3d 100644 >> --- a/include/linux/hash.h >> +++ b/include/linux/hash.h >> @@ -34,7 +34,9 @@ >> static inline u64 hash_64(u64 val, unsigned int bits) >> { >> u64 hash = val; >> - >> +#if BITS_PER_LONG == 64 >> + hash *= GOLDEN_RATIO_PRIME_64; >> +#else >> /* Sigh, gcc can't optimise this alone like it does for 32 bits. */ > > Hmm. Does this comment make sense here now? > I haven't checked what output gcc provides for 32-bit kernels with this or a literal multiply. It's not even clear what optimization is _asked_ for here (possibly the reduction of strength that we probably don't even want.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] core-kernel: use multiply instead of shifts in hash_64
On Tue, Jul 10, 2012 at 6:35 AM, Michael Tokarev m...@tls.msk.ru wrote: On 03.07.2012 00:25, Andrew Hunter wrote: diff --git a/include/linux/hash.h b/include/linux/hash.h index b80506b..daabc3d 100644 --- a/include/linux/hash.h +++ b/include/linux/hash.h @@ -34,7 +34,9 @@ static inline u64 hash_64(u64 val, unsigned int bits) { u64 hash = val; - +#if BITS_PER_LONG == 64 + hash *= GOLDEN_RATIO_PRIME_64; +#else /* Sigh, gcc can't optimise this alone like it does for 32 bits. */ Hmm. Does this comment make sense here now? I haven't checked what output gcc provides for 32-bit kernels with this or a literal multiply. It's not even clear what optimization is _asked_ for here (possibly the reduction of strength that we probably don't even want.) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/