from:"Peter Zijlstra"

Re: [PATCH 1/3] sched: Create sched_select_cpu() to give preferred CPU for power saving

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 16:06 +0530, Viresh Kumar wrote:
 +/* sched-domain levels */
 +#define SD_SIBLING 0x01/* Only for CONFIG_SCHED_SMT */
 +#define SD_MC  0x02/* Only for CONFIG_SCHED_MC */
 +#define SD_BOOK0x04/* Only for CONFIG_SCHED_BOOK 
 */
 +#define SD_CPU 0x08/* Always enabled */
 +#define SD_NUMA0x10/* Only for CONFIG_NUMA */ 

Urgh, no, not more of that nonsense.. I want to get rid of that
hardcoded stuff, not add more of it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
  We do this because nr_node_ids changed, right? This means the entire
  distance table grew/shrunk, which means we should do the level scan
  again.
 
 It seems that nr_node_ids will not change once the system is up.
 I'm not quite sure. If I am wrong, please tell me. :) 

Ah, right you are..

So the problem is that cpumask_of_node() doesn't contain offline cpus,
and we might not boot the machine with all cpus present.

In that case I guess the suggested hotplug hooks are fine. Its just that
your changelog was confusing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf/x86: Intel uncore_pmu_to_box() local variable typo

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 12:44 +0200, Stephane Eranian wrote:
 Hi,
 
 I don't understand why the local variable box needs to
 be declared static here:
 
 static struct intel_uncore_box *
 uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu)
 {
 static struct intel_uncore_box *box;
 
 box = *per_cpu_ptr(pmu-box, cpu);
 if (box)
 return box;
 
 }

Uhmm.. me neither.. that looks like a bug indeed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] workqueue: Schedule work on non-idle cpu instead of current one

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 16:06 +0530, Viresh Kumar wrote:
 @@ -1066,8 +1076,9 @@ int queue_work(struct workqueue_struct *wq,
 struct work_struct *work)
  {
 int ret;
  
 -   ret = queue_work_on(get_cpu(), wq, work);
 -   put_cpu();
 +   preempt_disable();
 +   ret = queue_work_on(wq_select_cpu(), wq, work);
 +   preempt_enable();
  
 return ret;
  }

Right, so the problem I see here is that wq_select_cpu() is horridly
expensive..

 @@ -1102,7 +1113,7 @@ static void delayed_work_timer_fn(unsigned long
 __data)
 struct delayed_work *dwork = (struct delayed_work *)__data;
 struct cpu_workqueue_struct *cwq = get_work_cwq(dwork-work);
  
 -   __queue_work(smp_processor_id(), cwq-wq, dwork-work);
 +   __queue_work(wq_select_cpu(), cwq-wq, dwork-work);
  } 

Shouldn't timer migration have sorted this one?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] perf, Add support for Xeon-Phi PMU

2012-09-25 Thread Peter Zijlstra

On Thu, 2012-09-20 at 13:03 -0400, Vince Weaver wrote:
 One additional complication:  some of the cache events map to 
 event 0.  This causes problems because the generic events code
 assumes 0 means not-available.  I'm not sure the best way to address
 that problem. 

For all except P4 we could remap the 0 value to -2, that has all high
bits set (like the -1) which aren't used by hardware.

P4 is stuffing two registers in the 64bit config space and actually has
them all in use I think.. Cyrill?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] workqueue: Schedule work on non-idle cpu instead of current one

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 17:00 +0530, Viresh Kumar wrote:
 But this is what the initial idea during LPC we had.

Yeah.. that's true.

  Any improvements here you can suggest? 

We could uhm... /me tries thinking ... reuse some of the NOHZ magic?
Would that be sufficient, not waking a NOHZ cpu, or do you really want
not waking any idle cpu?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] workqueue: Schedule work on non-idle cpu instead of current one

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 13:40 +0200, Peter Zijlstra wrote:
 On Tue, 2012-09-25 at 17:00 +0530, Viresh Kumar wrote:
  But this is what the initial idea during LPC we had.
 
 Yeah.. that's true.
 
   Any improvements here you can suggest? 
 
 We could uhm... /me tries thinking ... reuse some of the NOHZ magic?
 Would that be sufficient, not waking a NOHZ cpu, or do you really want
 not waking any idle cpu?

Depending on the trade-off we could have the NOHZ stuff track a
non-NOHZ-idle cpu and avoid having to compute one every time we need it.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:


 Let's have an example here.
 
 sched_init_numa()
 {
   ...
   // A loop set sched_domains_numa_levels to level.-1
 
   // I set sched_domains_numa_levels to 0.
   sched_domains_numa_levels = 0;2

Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.

   // A loop allocating memory for sched_domains_numa_masks[][]
   for (i = 0; i  level; i++) {
   ..
   // Allocate memory for sched_domains_numa_masks[i]3
   ..
   }
   ..
 
   // I reset sched_domains_numa_levels to level.
   sched_domains_numa_levels = level;4
 }

Yes this makes sense, it might be best to have this as a separate patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Peter Zijlstra

On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
 In the not-so-distant past, we had the intel Dunnington Xeon, which
 was iirc basically three Core 2 duo's bolted together (ie three
 clusters of two cores sharing L2, and a fully shared L3). So that was
 a true multi-core with fairly big shared L2, and it really would be
 sad to not use the second core aggressively. 

Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
without a PSU atm so checking gets a little hard) so the LLC level was
the L2 and all worked out right (it also not having SMT helped of
course).

But if there was a Xeon chip that did add a package L3 then yes, all
this would become more interesting still. We'd need to extend the
scheduler topology a bit as well, I don't think it can currently handle
this well.

So I guess we get to do some work for steamroller.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] perf, Add support for Xeon-Phi PMU

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 15:42 +0400, Cyrill Gorcunov wrote:

 Guys, letme re-read this whole mail thread first since I have no clue
 what this remapping about ;)

x86_setup_perfctr() / set_ext_hw_attr() have special purposed 0 and -1
config values to mean -ENOENT and -EINVAL resp.

This means neither config value can be a 'real' event. Now it turns out
Xeon-Phi has an actual event 0, which is masked by these special case
thingies.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 14:23 +0100, Mel Gorman wrote:
 It crashes on boot due to the fact that you created a function-scope variable
 called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
 were interested in. 

D'0h!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Peter Zijlstra

On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
 Wouldn't a clean solution be to promote a task's scheduler
 class to the spinner class when we PLE (or come from some special
 syscall
 for userspace spinlocks?)? 

Userspace spinlocks are typically employed to avoid syscalls..

 That class would be higher priority than the
 fair class and would schedule in FIFO order, but it would only run its
 tasks for short periods before switching. 

Since lock hold times aren't limited, esp. for things like userspace
'spin' locks, you've got a very good denial of service / opportunity for
abuse right there.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Peter Zijlstra

On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote:
 On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
  On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
   Wouldn't a clean solution be to promote a task's scheduler
   class to the spinner class when we PLE (or come from some special
   syscall
   for userspace spinlocks?)? 
  
  Userspace spinlocks are typically employed to avoid syscalls..
 
 I'm guessing there could be a slow path - spin N times and then give
 up and yield.

Much better they should do a blocking futex call or so, once you do the
syscall you're in kernel space anyway and have paid the transition cost.

  
   That class would be higher priority than the
   fair class and would schedule in FIFO order, but it would only run its
   tasks for short periods before switching. 
  
  Since lock hold times aren't limited, esp. for things like userspace
  'spin' locks, you've got a very good denial of service / opportunity for
  abuse right there.
 
 Maybe add some throttling to avoid overuse/maliciousness?

At which point you're pretty much back to where you started.

A much better approach is using things like priority inheritance, which
can be extended to cover the fair class just fine..

Also note that user-space spinning is inherently prone to live-locks
when combined with the static priority RT scheduling classes.

In general its a very bad idea..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
 
 For example, it starts with the maximum target scheduling domain, and
 works its way in over the scheduling groups within that domain. What
 the f*ck is the logic of that kind of crazy thing? It never makes
 sense to look at a biggest domain first. 

That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

 SMT
 MC (llc)
 CPU (package/machine-for-!numa)
 NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:core/rcu] sched: Fix load avg vs cpu-hotplug

2012-09-27 Thread Peter Zijlstra

On Wed, 2012-09-26 at 22:12 -0700, tip-bot for Peter Zijlstra wrote:
 Commit-ID:  5d18023294abc22984886bd7185344e0c2be0daf
 Gitweb: http://git.kernel.org/tip/5d18023294abc22984886bd7185344e0c2be0daf
 Author: Peter Zijlstra pet...@infradead.org
 AuthorDate: Mon, 20 Aug 2012 11:26:57 +0200
 Committer:  Paul E. McKenney paul...@linux.vnet.ibm.com
 CommitDate: Sun, 23 Sep 2012 07:43:56 -0700
 
 sched: Fix load avg vs cpu-hotplug
 
 Rabik and Paul reported two different issues related to the same few
 lines of code.
 
 Rabik's issue is that the nr_uninterruptible migration code is wrong in
 that he sees artifacts due to this (Rabik please do expand in more
 detail).
 
 Paul's issue is that this code as it stands relies on us using
 stop_machine() for unplug, we all would like to remove this assumption
 so that eventually we can remove this stop_machine() usage altogether.
 
 The only reason we'd have to migrate nr_uninterruptible is so that we
 could use for_each_online_cpu() loops in favour of
 for_each_possible_cpu() loops, however since nr_uninterruptible() is the
 only such loop and its using possible lets not bother at all.
 
 The problem Rabik sees is (probably) caused by the fact that by
 migrating nr_uninterruptible we screw rq-calc_load_active for both rqs
 involved.
 
 So don't bother with fancy migration schemes (meaning we now have to
 keep using for_each_possible_cpu()) and instead fold any nr_active delta
 after we migrate all tasks away to make sure we don't have any skewed
 nr_active accounting.
 
 [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
 miscounting noted by Rakib. ]
 
 Reported-by: Rakib Mullick rakib.mull...@gmail.com
 Reported-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
 Signed-off-by: Paul E. McKenney paul.mcken...@linaro.org
 ---
  kernel/sched/core.c |   41 -
  1 files changed, 20 insertions(+), 21 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index fbf1fd0..8c38b5e 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5304,27 +5304,17 @@ void idle_task_exit(void)
  }
  
  /*
 - * While a dead CPU has no uninterruptible tasks queued at this point,
 - * it might still have a nonzero -nr_uninterruptible counter, because
 - * for performance reasons the counter is not stricly tracking tasks to
 - * their home CPUs. So we just add the counter to another CPU's counter,
 - * to keep the global sum constant after CPU-down:
 - */
 -static void migrate_nr_uninterruptible(struct rq *rq_src)
 -{
 - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
 -
 - rq_dest-nr_uninterruptible += rq_src-nr_uninterruptible;
 - rq_src-nr_uninterruptible = 0;
 -}
 -
 -/*
 - * remove the tasks which were accounted by rq from calc_load_tasks.
 + * Since this CPU is going 'away' for a while, fold any nr_active delta
 + * we might have. Assumes we're called after migrate_tasks() so that the
 + * nr_active count is stable.
 + *
 + * Also see the comment Global load-average calculations.
   */
 -static void calc_global_load_remove(struct rq *rq)
 +static void calc_load_migrate(struct rq *rq)
  {
 - atomic_long_sub(rq-calc_load_active, calc_load_tasks);
 - rq-calc_load_active = 0;
 + long delta = calc_load_fold_active(rq);
 + if (delta)
 + atomic_long_add(delta, calc_load_tasks);
  }
  
  /*
 @@ -5617,9 +5607,18 @@ migration_call(struct notifier_block *nfb, unsigned 
 long action, void *hcpu)
   migrate_tasks(cpu);
   BUG_ON(rq-nr_running != 1); /* the migration thread */
   raw_spin_unlock_irqrestore(rq-lock, flags);
 + break;
  
 - migrate_nr_uninterruptible(rq);
 - calc_global_load_remove(rq);
 + case CPU_DEAD:
 + {
 + struct rq *dest_rq;
 +
 + local_irq_save(flags);
 + dest_rq = cpu_rq(smp_processor_id());
 + raw_spin_lock(dest_rq-lock);
 + calc_load_migrate(rq);
 + raw_spin_unlock_irqrestore(dest_rq-lock, flags);
 + }
   break;
  #endif
   }


Huh, what is this patch doing??! Didn't we merge my version of this? 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:
 I think you are bing too smart for your own good. you don't know if it's 
 best to move them further apart or not. 

Well yes and no.. You're right, however in general the load-balancer has
always tried to not use (SMT) siblings whenever possible, in that regard
not using an idle sibling is consistent here.

Also, for short running tasks the wakeup balancing is typically all we
have, the 'big' periodic load-balancer will 'never' see them, making the
multiple moves argument hard.

Measuring resource contention on the various levels is a fun research
subject, I've spoken to various people who are/were doing so, I've
always encouraged them to send their code just so we can see/learn, even
if not integrate, sadly I can't remember ever having seen any of it :/

And yeah, all the load-balancing stuff is very near to scrying or
tealeaf reading. We can't know all current state (too expensive) nor can
we know the future.

That said, I'm all for less/simpler code, pesky benchmarks aside ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 10:45 -0700, da...@lang.hm wrote:
 But I thought that this conversation (pgbench) was dealing with long 
 running processes, 

Ah, I think we've got a confusion on long vs short.. yes pgbench is a
long-running process, however the tasks might not be long in runnable
state. Ie it receives a request, computes a bit, blocks on IO, computes
a bit, replies, goes idle to wait for a new request.

If all those runnable sections are short enough, it will 'never' be
around when the periodic load-balancer does its thing, since that only
looks at the tasks in runnable state at that moment in time.

I say 'never' because while it will occasionally show up due to pure
chance, it will unlikely be a very big player in placement.

Once a cpu is overloaded enough to get real queueing they'll show up,
get dispersed and then its back to wakeup stuff.

Then again, it might be completely irrelevant to pgbench, its been a
while since I looked at how it schedules.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 11:19 -0700, Linus Torvalds wrote:
 On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov b...@alien8.de wrote:
  On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
  Or could we just improve the heuristics. What happens if the
  scheduling granularity is increased, for example? It's set to 1ms
  right now, with a logarithmic scaling by number of cpus.
 
  /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
  --
  tps = 4994.730809 (including connections establishing)
  tps = 5000.260764 (excluding connections establishing)
 
  A bit better over the default NO_WAKEUP_PREEMPTION setting.
 
 Ok, so this gives us something possible to actually play with.
 
 For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
 than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

Don't forget to run the desktop interactivity benchmarks after you're
done wriggling with this knob... wakeup preemption is important for most
those.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4] perf x86_64: Fix rsp register for system call fast path

2012-10-03 Thread Peter Zijlstra

On Wed, 2012-10-03 at 15:13 +0200, Jiri Olsa wrote:
 @@ -1190,8 +1191,8 @@ static inline void perf_sample_data_init(struct
 perf_sample_data *data,
 data-raw  = NULL;
 data-br_stack = NULL;
 data-period = period;
 -   data-regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 -   data-regs_user.regs = NULL;
 +   /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */
 +   memset(data-regs_user, 0, sizeof(data-regs_user));
 data-stack_user_size = 0;
  } 

Hmm, this will slow down all events, regardless of whether they use any
of that stuff or not. Since the one user actually does something like:

  data-regs_user = *pt_regs;

except it does a memcpy() for some obscure reason, it really doesn't
matter what is in there when uninitialized, right?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/10] compiler{,-gcc4}.h: Introduce __flatten function attribute

2012-10-03 Thread Peter Zijlstra

On Wed, 2012-10-03 at 11:14 -0400, Steven Rostedt wrote:
 
 Yep. I personally never use the get_maintainers script. I first check
 the MAINTAINERS file. If the subsystem I'm working on exists there, I
 only email those that are listed there, including any mailing lists that
 are mentioned (as well as LKML). If it's not listed, I then do a git log
 and see who does the most sign offs to changes there, and to what kind
 of changes. I usually ignore the trivial stuff. 

I also tend to suggest doing git-blame to see who touched the code being
changed last.

As a maintainer I frequently get to fwd/bounce patches because of
missing CCs like that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

2012-10-04 Thread Peter Zijlstra

On Thu, 2012-10-04 at 01:05 +0200, Andrea Righi wrote:
 +++ b/kernel/sched/core.c
 @@ -727,15 +727,17 @@ static void dequeue_task(struct rq *rq, struct 
 task_struct *p, int flags)
  void activate_task(struct rq *rq, struct task_struct *p, int flags)
  {
 if (task_contributes_to_load(p))
 -   rq-nr_uninterruptible--;
 +   cpu_rq(p-on_cpu_uninterruptible)-nr_uninterruptible--;
  
 enqueue_task(rq, p, flags);
  }

That's completely broken, you cannot do non-atomic cross-cpu
modifications like that. Also, adding an atomic op to the wakeup/sleep
paths isn't going to be popular at all.

  void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
  {
 -   if (task_contributes_to_load(p))
 -   rq-nr_uninterruptible++;
 +   if (task_contributes_to_load(p)) {
 +   task_rq(p)-nr_uninterruptible++;
 +   p-on_cpu_uninterruptible = task_cpu(p);
 +   }
  
 dequeue_task(rq, p, flags);
  } 

This looks pointless, at deactivate time task_rq() had better be rq or
something is terribly broken.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

uncore doing kfree() from CPU_{STARTING,DYING}

2012-10-04 Thread Peter Zijlstra

Hi Zheng,

Can you have a look at 7fdba1ca and do a similar patch for uncore?
Thomas is sad because uncore breaks -rt, the problem (as described in
that patch) is that kfree() must schedule on -rt while
CPU_{STARTING,DYING} really are atomic.

~ Peter


---
commit 7fdba1ca10462f42ad2246b918fe6368f5ce488e
Author: Peter Zijlstra a.p.zijls...@chello.nl
Date:   Fri Jul 22 13:41:54 2011 +0200

perf, x86: Avoid kfree() in CPU_STARTING

On -rt kfree() can schedule, but CPU_STARTING is before the CPU is
fully up and running. These are contradictory, so avoid it. Instead
push the kfree() to CPU_ONLINE where we're free to schedule.

Reported-by: Thomas Gleixner t...@linutronix.de
Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Link: http://lkml.kernel.org/n/tip-kwd4j6ayld5thrscvaxgj...@git.kernel.org
Signed-off-by: Ingo Molnar mi...@elte.hu

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4ee3abf..594d425 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -129,6 +129,8 @@ struct cpu_hw_events {
 * AMD specific bits
 */
struct amd_nb   *amd_nb;
+
+   void*kfree_on_online;
 };
 
 #define __EVENT_CONSTRAINT(c, n, m, w) {\
@@ -1466,10 +1468,12 @@ static int __cpuinit
 x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
 {
unsigned int cpu = (long)hcpu;
+   struct cpu_hw_events *cpuc = per_cpu(cpu_hw_events, cpu);
int ret = NOTIFY_OK;
 
switch (action  ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
+   cpuc-kfree_on_online = NULL;
if (x86_pmu.cpu_prepare)
ret = x86_pmu.cpu_prepare(cpu);
break;
@@ -1479,6 +1483,10 @@ x86_pmu_notifier(struct notifier_block *self, unsigned 
long action, void *hcpu)
x86_pmu.cpu_starting(cpu);
break;
 
+   case CPU_ONLINE:
+   kfree(cpuc-kfree_on_online);
+   break;
+
case CPU_DYING:
if (x86_pmu.cpu_dying)
x86_pmu.cpu_dying(cpu);
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
b/arch/x86/kernel/cpu/perf_event_amd.c
index 941caa2..ee9436c 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -350,7 +350,7 @@ static void amd_pmu_cpu_starting(int cpu)
continue;
 
if (nb-nb_id == nb_id) {
-   kfree(cpuc-amd_nb);
+   cpuc-kfree_on_online = cpuc-amd_nb;
cpuc-amd_nb = nb;
break;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index f88af2c..3751494 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1362,7 +1362,7 @@ static void intel_pmu_cpu_starting(int cpu)
 
pc = per_cpu(cpu_hw_events, i).shared_regs;
if (pc  pc-core_id == core_id) {
-   kfree(cpuc-shared_regs);
+   cpuc-kfree_on_online = cpuc-shared_regs;
cpuc-shared_regs = pc;
break;
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

2012-10-04 Thread Peter Zijlstra

On Thu, 2012-10-04 at 11:43 +0200, Andrea Righi wrote:
 
 Right, the update must be atomic to have a coherent nr_uninterruptible
 value. And AFAICS the only way to account a coherent
 nr_uninterruptible
 value per-cpu is to go with atomic ops... mmh... I'll think more on
 this. 

You could stick it in the cpu controller instead of cpuset, add a
per-cpu nr_uninterruptible counter to struct task_group and update it
from the enqueue/dequeue paths. Those already are per-cgroup (through
cfs_rq, which has a tg pointer).

That would also give you better semantics since it would really be the
load of the tasks of the cgroup, not whatever happened to run on a
particular cpu regardless of groups. Then again, it might be 'fun' to
get the hierarchical semantics right :-)

OTOH it would also make calculating the load-avg O(nr_cgroups) and since
we do this from the tick and people are known to create a shitload (on
the order of 1e3 and upwards) of those this might not actually be a very
good idea.

Also, your patch 2 relies on the load avg function to be additive yet
your completely fail to mention this and state whether this is so or
not.

Furthermore, please look at PER_CPU() and friends as alternatives to
[NR_CPUS] arrays.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] Update sched_domains_numa_masks when new cpus are onlined.

2012-10-04 Thread Peter Zijlstra

On Tue, 2012-09-25 at 21:12 +0800, Tang Chen wrote:
 +static int sched_domains_numa_masks_update(struct notifier_block
 *nfb,
 +  unsigned long action,
 +  void *hcpu)
 +{
 +   int cpu = (int)hcpu; 

kernel/sched/core.c: In function ‘sched_domains_numa_masks_update’:
kernel/sched/core.c:6299:12: warning: cast from pointer to integer of different 
size [-Wpointer-to-int-cast]

I've changed it to:

int cpu = (long)hcpu;



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Peter Zijlstra

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
 Again the numbers are ridiculously high for arch_local_irq_restore.
 Maybe there's a bad perf/kvm interaction when we're injecting an
 interrupt, I can't believe we're spending 84% of the time running the
 popf instruction. 

Smells like a software fallback that doesn't do NMI, hrtimer based
sampling typically hits popf where we re-enable interrupts.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Seems like sched: Add missing call to calc_load_exit_idle() should be reverted in 3.5 branch

2012-10-04 Thread Peter Zijlstra

On Thu, 2012-10-04 at 10:46 -0700, Greg Kroah-Hartman wrote:
 On Thu, Oct 04, 2012 at 12:11:01PM +0800, Huacai Chen wrote:
  Hi, Greg
  
  I found that Linux-3.5.5 accept this commit sched: Add missing call
  to calc_load_exit_idle() but I think this isn't needed. Because
  5167e8d5417b sched/nohz: Rewrite and fix load-avg computation --
  again not fully applied is true for 3.6 branch, but not for 3.5
  branch.
 
 But 5167e8d5417b is in 3.5, so shouldn't this commit still be necessary?
 
  In 3.5 branch, calc_load_exit_idle() is already called in
  tick_nohz_idle_exit(), it doesn't need to be called at
  tick_nohz_update_jiffies() again. In 3.6 branch, some code of
  tick_nohz_idle_exit() is splitted to tick_nohz_restart_sched_tick()
  and calc_load_exit_idle() is missing by accident, so commit sched:
  Add missing call to calc_load_exit_idle() is needed.
 
 So this really should be dropped from 3.5?  Charles, Peter, Ingo, any
 thoughts here?

Bah, lots of code movement there recently.. let me try and untangle all
that afresh.. /me checks out v3.5.5.

OK, assuming -tick_stopped means what the label says, then we only want
to call calc_load_enter_idle() when it flips to 1 and
calc_load_exit_idle() when it flips back to 0, such that when an actual
tick happens its got correct state.

Now the actual patch 5167e8d5417b sched/nohz: Rewrite and fix load-avg
computation -- again not fully applied modifies
tick_nohz_restart_sched_tick() which doesn't appear to exist in v3.5.5
and the patch fobbed it into tick_nohz_update_jiffies() which is called
from interrupt entry when nohz-idle so that the interrupt (and possible
tailing softirq) see a valid jiffies count.

However, since we don't restart the tick, we won't be sampling load muck
and calling calc_load_exit_idle() from there is bound to confuse state.

I hope.. damn this code ;-)

I can't find wtf went wrong either, the initial patch 5167e8d5417bf5c
contains both hunks, but in that case the fixup 749c8814f0 doesn't make
sense, not can I find anything in merge commits using:

  git log -S calc_load_exit_idle kernel/time/tick-sched.c

/me puzzled
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Seems like sched: Add missing call to calc_load_exit_idle() should be reverted in 3.5 branch

2012-10-05 Thread Peter Zijlstra

On Thu, 2012-10-04 at 15:27 -0700, Greg Kroah-Hartman wrote:
 I'm puzzled as well.  Any ideas if I should do anything here or not?

So I think the current v3.5.5 code is fine. I'm just not smart enough to
figure out how 3.6 got fuzzed, this git thing is confusing as hell.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] perf: perf_event_attr anon unions and static initializer issue

2012-10-05 Thread Peter Zijlstra

On Fri, 2012-10-05 at 12:36 +0200, Stephane Eranian wrote:

 struct perf_event_attr attr = { .config = 0x1234, .config1 = 0x456 };

 Does anyone have a better solution to propose?


  struct perf_event_attr attr = { 
.config = 0x1234, 
{ .config1 = 0x5678 },
  };

sometimes works, but apparently not in this case.. its a bit unfortunate
indeed. EDG based compilers and the latest C std use your version --
hence also 4.6 supporting it.

Yeah, I'm afraid we're stuck with this until a future where modern C
isn't modern anymore.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Seems like sched: Add missing call to calc_load_exit_idle() should be reverted in 3.5 branch

2012-10-05 Thread Peter Zijlstra

On Fri, 2012-10-05 at 10:10 -0700, Jonathan Nieder wrote:
 Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 15:27 -0700, Greg Kroah-Hartman wrote:
 
  I'm puzzled as well.  Any ideas if I should do anything here or not?
 
  So I think the current v3.5.5 code is fine.
 
 Now I'm puzzled.  You wrote:
 
 | However, since we don't restart the tick, we won't be sampling load muck
 | and calling calc_load_exit_idle() from there is bound to confuse state.
 
 Doesn't that mean 900404e5d201 sched: Add missing call to
 calc_load_exit_idle() which is part of 3.5.5 was problematic?  Or
 did I just miscount the number of nots?


Argh, yeah, so now I've managed to confuse everyone I'm afraid.

You are right, v3.5.5 has one calc_load_exit_idle() too many, the one in
tick_nohz_update_jiffies() needs to go.

Sorry.. I got entirely confused figuring out wth happened with 3.6.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched: per-entity load-tracking

2012-10-08 Thread Peter Zijlstra

On Sat, 2012-10-06 at 09:39 +0200, Ingo Molnar wrote:

Thanks Ingo! Paul,

  tip/kernel/sched/fair.c |   28 ++--
  1 file changed, 18 insertions(+), 10 deletions(-)
 
 Index: tip/kernel/sched/fair.c
 ===
 --- tip.orig/kernel/sched/fair.c
 +++ tip/kernel/sched/fair.c
  
 @@ -1151,7 +1156,7 @@ static inline void update_cfs_shares(str
   * Note: The tables below are dependent on this value.
   */
  #define LOAD_AVG_PERIOD 32
 -#define LOAD_AVG_MAX 47765 /* maximum possible load avg */
 +#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
  #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG 
 */
  
  /* Precomputed fixed inverse multiplies for multiplication by y^n */
 @@ -1203,7 +1208,8 @@ static __always_inline u64 decay_load(u6
   }
  
   val *= runnable_avg_yN_inv[local_n];
 - return SRR(val, 32);
 + /* We don't use SRR here since we always want to round down. */
 + return val  32;
  }
  
  /*

This is from patches/sched-fast_decay.patch and I changed both the
patch and the changelog to reflect your changes.


@@ -262,6 +262,9 @@ static inline struct cfs_rq *group_cfs_r
   return grp-my_q;
  }
  
 +static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
 +int force_update);
 +
  static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
  {
   if (!cfs_rq-on_list) {
 @@ -281,6 +284,8 @@ static inline void list_add_leaf_cfs_rq(
   }
  
   cfs_rq-on_list = 1;
 + /* We should have no load, but we need to update last_decay. */
 + update_cfs_rq_blocked_load(cfs_rq, 0);
   }
  }


This seems to have come from patches/sched-cfs_rq_blocked_load.patch, and I 
merged
it in there. With subsequent updates in 
patches/sched-wakeup_load.patch. I didn't find any changes to the Changelogs 
for either.


 @@ -4236,13 +4242,15 @@ static void __update_blocked_averages_cp
   if (se) {
   update_entity_load_avg(se, 1);
   /*
 -  * We can pivot on the runnable average decaying to zero for
 -  * list removal since the parent average will always be =
 -  * child.
 +  * We pivot on our runnable average having decayed to zero for
 +  * list removal.  This generally implies that all our children
 +  * have also been removed (modulo rounding error or bandwidth
 +  * control); however, such cases are rare and we can fix these
 +  * at enqueue.
 +  *
 +  * TODO: fix up out-of-order children on enqueue.
*/
 - if (se-avg.runnable_avg_sum)
 - update_cfs_shares(cfs_rq);
 - else
 + if (!se-avg.runnable_avg_sum  !cfs_rq-nr_running)
   list_del_leaf_cfs_rq(cfs_rq);
   } else {
   struct rq *rq = rq_of(cfs_rq);


This hunk seems to have come from
patches/sched-swap_update_shares.patch, merged in there, no Changelog
changes afaict.


I've updated my queue (at the usual location:
http://programming.kicks-ass.net/sekrit/patches.tar.bz2), Paul please
verify I didn't miss anything.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Do not use cpu_to_node() to find an offlined cpu's node.

2012-10-19 Thread Peter Zijlstra

On Wed, 2012-10-17 at 20:29 -0700, David Rientjes wrote:
 
 Ok, thanks for the update.  I agree that we should be clearing the mapping 
 at node hot-remove since any cpu that would subsequently get onlined and 
 assume one of the previous cpu's ids is not guaranteed to have the same 
 affinity.

Would this mean we have to remap (and memcpy) per-cpu memory on
node-plug?

 I'm just really hoping that we don't touch the acpi code and that we can 
 remove both cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch 
 and cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch from 
 -mm.

Yeah, none of this should be anywhere near ACPI, its got nothing to do
with ACPI. Furthermore it should be be same across all archs, not just
be weird and wonderful for x86.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] rename NUMA fault handling functions

2012-10-19 Thread Peter Zijlstra

On Thu, 2012-10-18 at 17:20 -0400, Rik van Riel wrote:
 Having the function name indicate what the function is used
 for makes the code a little easier to read.  Furthermore,
 the fault handling code largely consists of do__page
 functions. 

I don't much care either way, but I was thinking walken might want to
use something similar to do WSS estimation, in which case the NUMA name
is just as wrong.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] add credits for NUMA placement

2012-10-19 Thread Peter Zijlstra

On Thu, 2012-10-18 at 17:19 -0400, Rik van Riel wrote:
 The NUMA placement code has been rewritten several times, but
 the basic ideas took a lot of work to develop. The people who
 put in the work deserve credit for it. Thanks Andrea  Peter :)
 
 The Documentation/scheduler/numa-problem.txt file should
 probably be rewritten once we figure out the final details of
 what the NUMA code needs to do, and why.
 
 Signed-off-by: Rik van Riel r...@redhat.com

Acked-by: Peter Zijlstra a.p.zijls...@chello.nl

Thanks Rik!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Peter Zijlstra

On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
 
 On Thu, 18 Oct 2012, Oleg Nesterov wrote:
 
  Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
  something similar. Certainly it was not in my tree when I started
  this patch... percpu_down_write() doesn't allow multiple writers,
  but the main problem it uses msleep(1). It should not, I think.
 
 synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not 
 a big problem.

That code is beyond ugly though.. it should really not have been merged.

There's absolutely no reason for it to use RCU except to make it more
complicated. And as Oleg pointed out that msleep() is very ill
considered.

The very worst part of it seems to be that nobody who's usually involved
with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
tglx, dhowells and me). It doesn't even have lockdep annotations :/

So the only reason you appear to use RCU is because you don't actually
have a sane way to wait for count==0. And I'm contesting rcu_sync() is
sane here -- for the very simple reason you still need while (count)
loop right after it.

So it appears you want an totally reader biased, sleepable rw-lock like
thing?

So did you consider keeping the inc/dec on the same per-cpu variable?
Yes this adds a potential remote access to dec and requires you to use
atomics, but I would not be surprised if the inc/dec were mostly on the
same cpu most of the times -- which might be plenty fast for what you
want.

If you've got coherent per-cpu counts, you can better do the
waitqueue/wake condition for write_down.

It might also make sense to do away with the mutex, there's no point in
serializing the wakeups in the p-locked case of down_read. Furthermore,
p-locked seems a complete duplicate of the mutex state, so removing the
mutex also removes that duplication.

Also, that CONFIG_x86 thing.. *shudder*...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MAX_LOCKDEP_ENTRIES too low (called from ioc_release_fn)

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 01:21 -0400, Dave Jones wrote:
   Not sure why you are CC'ing a call site, rather than the maintainers of
   the code. Just looks like lockdep is using too small a static value.
   Though it is pretty darn large...
 
 You're right, it's a huge chunk of memory.
 It looks like I can trigger this from multiple callsites..
 Another different trace below.
 
 Not sure why this suddenly got a lot worse in 3.7 

Did we add a static array of structures with locks in somewhere? Doing
that is a great way of blowing up the number of lock classes and the
resulting amount of lock dependency chains.

From Documentation/lockdep-design.txt; it talks about overflowing
MAX_LOCKDEP_KEYS, but I suppose its a good starts for overflowing the
dependency entries too, more classes means more dependencies after all.

---
Of course, if you do run out of lock classes, the next thing to do is
to find the offending lock classes.  First, the following command gives
you the number of lock classes currently in use along with the maximum:

grep lock-classes /proc/lockdep_stats

This command produces the following output on a modest system:

 lock-classes:  748 [max: 8191]

If the number allocated (748 above) increases continually over time,
then there is likely a leak.  The following command can be used to
identify the leaking lock classes:

grep BD /proc/lockdep

Run the command and save the output, then compare against the output from
a later run of this command to identify the leakers.  This same output
can also help you find situations where runtime lock initialization has
been omitted.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched, autogroup: fix kernel crashes caused by runtime disable autogroup

2012-10-19 Thread Peter Zijlstra

Always try and CC people who wrote the code..

On Fri, 2012-10-19 at 16:36 +0800, Xiaotian Feng wrote:
 There's a regression from commit 800d4d30, in autogroup_move_group()
 
   p-signal-autogroup = autogroup_kref_get(ag);
 
   if (!ACCESS_ONCE(sysctl_sched_autogroup_enabled))
   goto out;
   ...
 out:
   autogroup_kref_put(prev);
 
 So kernel changed p's autogroup to ag, but never sched_move_task(p).
 Then previous autogroup of p is released, which may release task_group
 related with p. After commit 8323f26ce, p-sched_task_group might point
 to this stale value, and thus caused kernel crashes.
 
 This is very easy to reproduce, add kernel.sched_autogroup_enabled = 0
 to your /etc/sysctl.conf, your system will never boot up. It is not reasonable
 to put the sysctl enabled check in autogroup_move_group(), kernel should check
 it before autogroup_create in sched_autogroup_create_attach().
 
 Reported-by: cwillu cwi...@cwillu.com
 Reported-by: Luis Henriques luis.henriq...@canonical.com
 Signed-off-by: Xiaotian Feng dannyf...@tencent.com
 Cc: Ingo Molnar mi...@redhat.com
 Cc: Peter Zijlstra pet...@infradead.org
 ---
  kernel/sched/auto_group.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)
 
 diff --git a/kernel/sched/auto_group.c b/kernel/sched/auto_group.c
 index 0984a21..ac62415 100644
 --- a/kernel/sched/auto_group.c
 +++ b/kernel/sched/auto_group.c
 @@ -143,15 +143,11 @@ autogroup_move_group(struct task_struct *p, struct 
 autogroup *ag)
  
   p-signal-autogroup = autogroup_kref_get(ag);
  
 - if (!ACCESS_ONCE(sysctl_sched_autogroup_enabled))
 - goto out;
 -
   t = p;
   do {
   sched_move_task(t);
   } while_each_thread(p, t);
  
 -out:
   unlock_task_sighand(p, flags);
   autogroup_kref_put(prev);
  }

So I've looked at this for all of 1 minute, but why isn't moving that
check up one line to be above the p-signal-autogroup assignment
enough?

 @@ -159,8 +155,12 @@ out:
  /* Allocates GFP_KERNEL, cannot be called under any spinlock */
  void sched_autogroup_create_attach(struct task_struct *p)
  {
 - struct autogroup *ag = autogroup_create();
 + struct autogroup *ag;
 +
 + if (!ACCESS_ONCE(sysctl_sched_autogroup_enabled))
 + return;
  
 + ag = autogroup_create();
   autogroup_move_group(p, ag);
   /* drop extra reference added by autogroup_create() */
   autogroup_kref_put(ag);

Man,.. so on memory allocation fail we'll put the group in
autogroup_default, which I think ends up being the root cgroup.

But what happens when sysctl_sched_autogroup_enabled is false?

It looks like sched_autogroup_fork() is effective in that case, which
would mean we'll stay in whatever group our parent is in, which is not
the same as being disabled.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: perf: p6 PMU working by accident, should we fix it and KNC?

2012-10-19 Thread Peter Zijlstra

On Wed, 2012-10-17 at 11:35 -0400, Vince Weaver wrote:
 
 This is by accident; it looks like the code does 
val |= ARCH_PERFMON_EVENTSEL_ENABLE;
 in p6_pmu_disable_event() so that events are never truly disabled
 (is this a bug?  should it be =~ instead?).  

I think that's on purpose.. from what I can remember p6 only has a
single EN bit (on PMC0) that acts for both counters. So what I did was
treat that as a global enable/disable (which it is) and did the local
enable/disable by using the NOP events.

There really might be bugs in there, its not like I use this class of
hardware very frequently (nor do anybody much it seems).


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] sched: boost throttled entities on wakeups

2012-10-19 Thread Peter Zijlstra

On Thu, 2012-10-18 at 11:32 +0400, Vladimir Davydov wrote:
 
 1) Do you agree that the problem exists and should be sorted out?

This is two questions.. yes it exists, I'm absolutely sure I pointed it
out as soon as people even started talking about this nonsense (bw
cruft).

Should it be sorted, dunno, in general !PREEMPT_RT is very susceptible
to all this and in general we don't fix it.

 2) If so, does the general approach proposed (unthrottling on wakeups) suits
 you? Why or why not?

its a quick hack similar to existing hacks done for rt, preferably we'd
do smarter things though.

 3) If you think that the approach proposed is sane, what you dislike about the
 patch? 

its not inlined, its got coding style issues, but worst of all, you
added yet another callback from the schedule() path and did it wrong ;-)

Also, it adds even more bw cruft overhead to regular scheduling paths,
we took some pains to limit that when we introduced the fail^Wfeature.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RT] slab: Fix up stable merge of slab init_lock_keys()

2012-10-19 Thread Peter Zijlstra

On Thu, 2012-10-18 at 09:40 -0400, Steven Rostedt wrote:
 Peter,
 
 There was a little conflict with my merge of 3.4.14 due to the backport
 of this patch:
 
 commit 947ca1856a7e60aa6d20536785e6a42dff25aa6e
 Author: Michael Wang wang...@linux.vnet.ibm.com
 Date:   Wed Sep 5 10:33:18 2012 +0800
 
 slab: fix the DEADLOCK issue on l3 alien lock
 
 Could you just confirm that my fix is correct.

We forever keep getting that issue wrong it seems.. so I'd not put too
much trust into whatever my sleep addled brain thinks today :/

Looks about right, but who knows ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:numa/core] sched/numa/mm: Improve migration

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 09:51 -0400, Johannes Weiner wrote:
 Of course I'm banging my head into a wall for not seeing earlier
 through the existing migration path how easy this could be.  

There's a reason I keep promoting the idea of 'someone' rewriting all
that page-migration code :-) I forever get lost in there.

Also note that the proposed code will do 'wasted' work in case the THP
page gets split from under us, given that splits are relatively rare
(and if they're not, we should make them so) this didn't seem a problem.

Also, this code very much relies on our PROT_NONE marking, it avoids the
whole migration-PTE dance usually done, further the assumption that THP
pages are anonymous only did help keep it simpler -- if someone 'fixes'
that this needs more TLC.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:numa/core] sched/numa/mm: Improve migration

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 09:51 -0400, Johannes Weiner wrote:
 Right now, unlike the traditional migration path, this breaks COW for
 every migration, but maybe you don't care about shared pages in the
 first place.  And fixing that should be nothing more than grabbing the
 anon_vma lock and using rmap to switch more than one pmd over, right?
 

This patch was very much about getting _something_ rather than caring
(too much) about all the weird corner cases.

And yes, your suggestion sounds about right.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:numa/core] sched/numa/mm: Improve migration

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 09:51 -0400, Johannes Weiner wrote:
 It's slightly ugly that migrate_page_copy() actually modifies the
 existing page (deactivation, munlock) when you end up having to revert
 back to it. 

The worst is actually calling copy_huge_page() on a THP.. it seems to
work though ;-)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] perf tools: add event modifier to request exclusive PMU access

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 16:52 +0200, Stephane Eranian wrote:
 -modifier_event [ukhpGH]{1,8}
 +modifier_event [ukhpGHx]{1,8} 

wouldn't the max modifier sting length grow by adding another possible
modifier?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] perf: SNB exclusive PMU access for INST_RETIRED:PREC_DIST

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 16:52 +0200, Stephane Eranian wrote:
 +static int intel_pebs_aliases_snb(struct perf_event *event)
 +{
 +   u64 cfg = event-hw.config;
 +   /*
 +* for INST_RETIRED.PREC_DIST to work correctly with PEBS, it must
 +* be measured alone on SNB (exclusive PMU access) as per Intel SDM.
 +*/
 +   if ((cfg  INTEL_ARCH_EVENT_MASK) == 0x01c0  
 !event-attr.exclusive) {
 +   pr_info(perf: INST_RETIRED.PREC_DIST only works in exclusive 
 mode\n);
 +   return -EINVAL;

This isn't limited to admin, right? So the above turns into a DoS on the
console.

 +   }
 +
 +   return intel_pebs_aliases_ivb(event);
  } 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: question on NUMA page migration

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 11:53 -0400, Rik van Riel wrote:
 
 If we do need the extra refcount, why is normal
 page migration safe? :) 

Its mostly a matter of how convoluted you make the code, regular page
migration is about as bad as you can get

Normal does:

  follow_page(FOLL_GET) +1

  isolate_lru_page() +1

  put_page() -1

ending up with a page with a single reference (for anon, or one extra
each for the mapping and buffer).

And while I suppose I could do a put_page() in migrate_misplaced_page()
that makes the function frob the page-count depending on the return
value.

I always try and avoid conditional locks/refs, therefore the code ends
up doing:

  page = vm_normal_page()
  if (page) {
get_page()

migrate_misplaced_page()

put_page()
  }


where migrate_misplaced_page() does isolate_lru_page()/putback_lru_page,
and this leaves the page-count invariant.

We got a ref, therefore we must put a ref, is easier than we got a ref
and must put except when...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] perf: SNB exclusive PMU access for INST_RETIRED:PREC_DIST

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 18:31 +0200, Stephane Eranian wrote:
 On Fri, Oct 19, 2012 at 6:27 PM, Peter Zijlstra pet...@infradead.org wrote:
  On Fri, 2012-10-19 at 16:52 +0200, Stephane Eranian wrote:
  +static int intel_pebs_aliases_snb(struct perf_event *event)
  +{
  +   u64 cfg = event-hw.config;
  +   /*
  +* for INST_RETIRED.PREC_DIST to work correctly with PEBS, it must
  +* be measured alone on SNB (exclusive PMU access) as per Intel 
  SDM.
  +*/
  +   if ((cfg  INTEL_ARCH_EVENT_MASK) == 0x01c0  
  !event-attr.exclusive) {
  +   pr_info(perf: INST_RETIRED.PREC_DIST only works in 
  exclusive mode\n);
  +   return -EINVAL;
 
  This isn't limited to admin, right? So the above turns into a DoS on the
  console.
 
 Ok, so how about a WARN_ON_ONCE() instead?

That should be fine I guess ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 11:32 -0400, Mikulas Patocka wrote:

 So if you can do an alternative implementation without RCU, show it.

Uhm,,. no that's not how it works. You just don't push through crap like
this and then demand someone else does it better.

But using preempt_{disable,enable} and using synchronize_sched() would
be better (for PREEMPT_RCU) although it wouldn't fix anything
fundamental.

  The 
 goal is - there should be no LOCK instructions on the read path and as 
 few barriers as possible.

Fine goal, although somewhat arch specific. Also note that there's a
relation between atomics and memory barriers, one isn't necessarily
worse than the other, they all require synchronization of sorts.  

  So did you consider keeping the inc/dec on the same per-cpu variable?
  Yes this adds a potential remote access to dec and requires you to use
  atomics, but I would not be surprised if the inc/dec were mostly on the
  same cpu most of the times -- which might be plenty fast for what you
  want.
 
 Yes, I tried this approach - it involves doing LOCK instruction on read 
 lock, remembering the cpu and doing another LOCK instruction on read 
 unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
 happens in the common case). It was slower than the approach without any 
 LOCK instructions (43.3 seconds seconds for the implementation with 
 per-cpu LOCKed access, 42.7 seconds for this implementation without atomic 
 instruction; the benchmark involved doing 512-byte direct-io reads and 
 writes on a ramdisk with 8 processes on 8-core machine).

So why is that a problem? Surely that's already tons better then what
you've currently got. Also uncontended LOCK is something all x86 vendors
keep optimizing, they'll have to if they want to keep adding CPUs.

  If you've got coherent per-cpu counts, you can better do the
  waitqueue/wake condition for write_down.
 
 synchronize_rcu() is way slower than msleep(1) - so I don't see a reason 
 why should it be complicated to avoid msleep(1).

Its not about slow, a polling write side is just fscking ugly. Also, if
you're already polling that *_sync() is bloody pointless.

  It might also make sense to do away with the mutex, there's no point in
  serializing the wakeups in the p-locked case of down_read.
 
 The advantage of a mutex is that it is already protected against 
 starvation. If I replace the mutex with a wait queue and retry, there is 
 no starvation protection.

Which starvation? writer-writer order? What stops you from adding a list
there yourself? Also, writers had better be rare for this thing, so who
gives a crap?

  Furthermore,
  p-locked seems a complete duplicate of the mutex state, so removing the
  mutex also removes that duplication.
 
 We could replace if (p-locked) with if (mutex_is_locked(p-mtx))

Quite so.. 

You're also still lacking lockdep annotations...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: question on NUMA page migration

2012-10-19 Thread Peter Zijlstra

On Fri, 2012-10-19 at 13:13 -0400, Rik van Riel wrote:

 Would it make sense to have the normal page migration code always
 work with the extra refcount, so we do not have to introduce a new
 MIGRATE_FAULT migration mode?
 
 On the other hand, compaction does not take the extra reference...

Right, it appears to not do this, it gets pages from the pfn and
zone-lock and the isolate_lru_page() call is the first reference.

 Another alternative might be to do the put_page inside
 do_prot_none_numa().  That would be analogous to do_wp_page
 disposing of the old page for the caller.

It'd have to be inside migrate_misplaced_page(), can't do before
isolate_lru_page() or the page might disappear. Doing it after is
(obviously) too late.

 I am not real happy about NUMA migration introducing its own
 migration mode...

You didn't seem to mind too much earlier, but I can remove it if you
want.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the final tree (tip/s390 trees related)

2012-10-19 Thread Peter Zijlstra

On Thu, 2012-10-18 at 17:02 +0200, Ralf Baechle wrote:
   CC  mm/huge_memory.o
 mm/huge_memory.c: In function ‘do_huge_pmd_prot_none’:
 mm/huge_memory.c:789:3: error: incompatible type for argument 3 of 
 ‘update_mmu_cache’

That appears to have become update_mmu_cache_pmd(), which makes sense
given that there's now architectures that care about it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] perf, x86: Basic Haswell LBR call stack support

2012-10-22 Thread Peter Zijlstra

On Mon, 2012-10-22 at 14:11 +0800, Yan, Zheng wrote:
 +   /* LBR callstack does not work well with FREEZE_LBRS_ON_PMI */
 +   if (!cpuc-lbr_sel || !(cpuc-lbr_sel-config  LBR_CALL_STACK))
 +   debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI; 

How useful it is without this? How many calls between PMI and us getting
to intel_pmu_lbr_read()?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] perf, x86: Basic Haswell LBR call stack support

2012-10-22 Thread Peter Zijlstra

On Mon, 2012-10-22 at 14:11 +0800, Yan, Zheng wrote:
 --- a/include/uapi/linux/perf_event.h
 +++ b/include/uapi/linux/perf_event.h
 @@ -160,8 +160,9 @@ enum perf_branch_sample_type {
 PERF_SAMPLE_BRANCH_ABORT= 1U  7, /* transaction aborts */
 PERF_SAMPLE_BRANCH_INTX = 1U  8, /* in transaction (flag) */
 PERF_SAMPLE_BRANCH_NOTX = 1U  9, /* not in transaction 
 (flag) */
 +   PERF_SAMPLE_BRANCH_CALL_STACK   = 1U  10, /* call stack */
  
 -   PERF_SAMPLE_BRANCH_MAX  = 1U  10, /* non-ABI */
 +   PERF_SAMPLE_BRANCH_MAX  = 1U  11, /* non-ABI */
  }; 

You add an ABI sample type without mentioning it in your changelog.. I
think I'll stop reading here.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 5/8] irq_work: Make self-IPIs optable

2012-10-22 Thread Peter Zijlstra

On Sat, 2012-10-20 at 12:22 -0400, Frederic Weisbecker wrote:
 +   if (empty) {
 +   /*
 +* If an IPI is requested, raise it right away. Otherwise wait
 +* for the next tick unless it's stopped. Now if the arch uses
 +* some other obscure way than IPI to raise an irq work, just 
 raise
 +* and don't think further.
 +*/
 +   if (ipi || !arch_irq_work_has_ipi() || 
 tick_nohz_tick_stopped())
 +   arch_irq_work_raise();
 +   }
 preempt_enable();
  } 

Doesn't this have a problem where we enqueue the first lazy and then one
with ipi? In that case it appears we won't send the IPI because the
queue wasn't empty.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:numa/core] x86, mm: Prevent gcc to re-read the pagetables

2012-10-22 Thread Peter Zijlstra

On Sun, 2012-10-21 at 05:56 -0700, tip-bot for Andrea Arcangeli wrote:

 In get_user_pages_fast() the TLB shootdown code can clear the pagetables
 before firing any TLB flush (the page can't be freed until the TLB
 flushing IPI has been delivered but the pagetables will be cleared well
 before sending any TLB flushing IPI).

I think we want to do this for all gup_fast() implementations. When I
reported this issue I also proposed adding something like
page_table_deref() which we could use through-out. Not sure we want to,
but at least all archs need an audit for this.


 ---
  arch/x86/mm/gup.c |   23 ---
  mm/memory.c   |2 +-
  2 files changed, 21 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
 index dd74e46..6dc9921 100644
 --- a/arch/x86/mm/gup.c
 +++ b/arch/x86/mm/gup.c
 @@ -150,7 +150,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
 unsigned long end,
  
   pmdp = pmd_offset(pud, addr);
   do {
 - pmd_t pmd = *pmdp;
 + /*
 +  * With THP and hugetlbfs the pmd can change from
 +  * under us and it can be cleared as well by the TLB
 +  * shootdown, so read it with ACCESS_ONCE to do all
 +  * computations on the same sampling.
 +  */
 + pmd_t pmd = ACCESS_ONCE(*pmdp);
  
   next = pmd_addr_end(addr, end);
   /*
 @@ -220,7 +226,13 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
 unsigned long end,
  
   pudp = pud_offset(pgd, addr);
   do {
 - pud_t pud = *pudp;
 + /*
 +  * With hugetlbfs giga pages the pud can change from
 +  * under us and it can be cleared as well by the TLB
 +  * shootdown, so read it with ACCESS_ONCE to do all
 +  * computations on the same sampling.
 +  */
 + pud_t pud = ACCESS_ONCE(*pudp);
  
   next = pud_addr_end(addr, end);
   if (pud_none(pud))
 @@ -280,7 +292,12 @@ int __get_user_pages_fast(unsigned long start, int 
 nr_pages, int write,
   local_irq_save(flags);
   pgdp = pgd_offset(mm, addr);
   do {
 - pgd_t pgd = *pgdp;
 + /*
 +  * The pgd could be cleared by the TLB shootdown from
 +  * under us so read it with ACCESS_ONCE to do all
 +  * computations on the same sampling.
 +  */
 + pgd_t pgd = ACCESS_ONCE(*pgdp);
  
   next = pgd_addr_end(addr, end);
   if (pgd_none(pgd))
 diff --git a/mm/memory.c b/mm/memory.c
 index cc8e280..c0de477 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -3555,7 +3555,7 @@ int handle_pte_fault(struct mm_struct *mm,
   pte_t entry;
   spinlock_t *ptl;
  
 - entry = *pte;
 + entry = ACCESS_ONCE(*pte);
   if (!pte_present(entry)) {
   if (pte_none(entry)) {
   if (vma-vm_ops) {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] sched: introduce distinct per-cpu load average

2012-10-22 Thread Peter Zijlstra

On Sat, 2012-10-20 at 21:06 +0200, Andrea Righi wrote:
 @@ -383,13 +383,7 @@ struct rq {
 struct list_head leaf_rt_rq_list;
  #endif
  

 +   unsigned long __percpu *nr_uninterruptible; 

This is O(nr_cpus^2) memory..


 +unsigned long nr_uninterruptible_cpu(int cpu)
 +{
 +   struct rq *this = cpu_rq(cpu);
 +   unsigned long val = 0;
 +   int i;
 +
 +   for_each_online_cpu(i)
 +   val += per_cpu(*this-nr_uninterruptible, i);
 +
 +   return val;
 +}
 
 
I suspect you've got an accounting leak here on hot-plug.
 
  unsigned long nr_uninterruptible(void)
  {
 unsigned long i, sum = 0;
  
 for_each_possible_cpu(i)
 -   sum += cpu_rq(i)-nr_uninterruptible;
 +   sum += nr_uninterruptible_cpu(i);
  
 /*
  * Since we read the counters lockless, it might be slightly

And this makes O(n^2) runtime!


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

re: sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate

2012-10-22 Thread Peter Zijlstra

On Mon, 2012-10-22 at 14:55 +0300, Dan Carpenter wrote:
 Hello Peter Zijlstra,
 
 The patch 3d049f8a5398: sched, numa, mm: Implement constant, per
 task Working Set Sampling (WSS) rate from Oct 14, 2012, leads to the
 following warning:
 kernel/sched/fair.c:954 task_numa_work()
error: we previously assumed 'vma' could be null (see line 948)
 
943  if (!vma) {
944  ACCESS_ONCE(mm-numa_scan_seq)++;
945  offset = 0;
946  vma = mm-mmap;
947  }
948  while (vma  !vma_migratable(vma)) {
^^^
 If this is NULL,
949  vma = vma-vm_next;
950  if (!vma)
951  goto again;
952  }
953  
954  offset = max(offset, vma-vm_start);
  ^
 then it leads to a NULL dereference here.

Ah.. indeed so. There's also what looks like an infinite loop in there
if nothing is migratable or if length is stupid large. The below should
avoid the reported NULL deref as well as break out when we've reached
the end of the address space.


---
 kernel/sched/fair.c |   23 ---
 1 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c68b877..5a6d8f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -939,29 +939,22 @@ void task_numa_work(struct callback_head *work)
 
down_read(mm-mmap_sem);
vma = find_vma(mm, offset);
-again:
if (!vma) {
ACCESS_ONCE(mm-numa_scan_seq)++;
offset = 0;
vma = mm-mmap;
}
-   while (vma  !vma_migratable(vma)) {
-   vma = vma-vm_next;
-   if (!vma)
-   goto again;
-   }
-
-   offset = max(offset, vma-vm_start);
-   end = min(ALIGN(offset + length, HPAGE_SIZE), vma-vm_end);
-   length -= end - offset;
+   for (; vma  length  0; vma = vma-vm_next) {
+   if (!vma_migratable(vma))
+   continue;
 
-   change_prot_none(vma, offset, end);
+   offset = max(offset, vma-vm_start);
+   end = min(ALIGN(offset + length, HPAGE_SIZE), vma-vm_end);
+   length -= end - offset;
 
-   offset = end;
+   change_prot_none(vma, offset, end);
 
-   if (length  0) {
-   vma = vma-vm_next;
-   goto again;
+   offset = end;
}
mm-numa_scan_offset = offset;
up_read(mm-mmap_sem);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] perf tools: add event modifier to request exclusive PMU access

2012-10-22 Thread Peter Zijlstra

On Mon, 2012-10-22 at 17:44 +0200, Stephane Eranian wrote:
 
 I know the answer, because I know what's going on under the
 hood. But what about the average user? 

I'm still wondering if the avg user really thinks 'instructions' is a
useful metric for other than obtaining ipc measurements.

The avg user wants time domain measurements, like what's the most
expensive function, or most expensive cache-miss etc..


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] perf tools: add event modifier to request exclusive PMU access

2012-10-22 Thread Peter Zijlstra

On Mon, 2012-10-22 at 18:08 +0200, Stephane Eranian wrote:
  I'm still wondering if the avg user really thinks 'instructions' is
 a
  useful metric for other than obtaining ipc measurements.
 
 Yeah, for many users CPI (or IPC) is a useful metric. 

Right but you don't get that using instruction sampling, right? Or you
need to pair it with a cycle count one way or another. And I don't
really see PDIST helping with that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] perf, x86: Basic Haswell LBR call stack support

2012-10-23 Thread Peter Zijlstra

On Tue, 2012-10-23 at 13:41 +0800, Yan, Zheng wrote:
 On 10/22/2012 06:35 PM, Peter Zijlstra wrote:
  On Mon, 2012-10-22 at 14:11 +0800, Yan, Zheng wrote:
  --- a/include/uapi/linux/perf_event.h
  +++ b/include/uapi/linux/perf_event.h
  @@ -160,8 +160,9 @@ enum perf_branch_sample_type {
  PERF_SAMPLE_BRANCH_ABORT= 1U  7, /* transaction aborts */
  PERF_SAMPLE_BRANCH_INTX = 1U  8, /* in transaction 
  (flag) */
  PERF_SAMPLE_BRANCH_NOTX = 1U  9, /* not in transaction 
  (flag) */
  +   PERF_SAMPLE_BRANCH_CALL_STACK   = 1U  10, /* call stack */
   
  -   PERF_SAMPLE_BRANCH_MAX  = 1U  10, /* non-ABI */
  +   PERF_SAMPLE_BRANCH_MAX  = 1U  11, /* non-ABI */
   }; 
  
  You add an ABI sample type without mentioning it in your changelog.. I
  think I'll stop reading here.
  
 Ok, I will add the ABI change to the change log. Do you think we should hide 
 this
 branch sample type from user?

Possibly not, I haven't really thought about it. But never hide an ABI
change. Explicitly mention it and preferably explain why you felt it
needed to extend the ABI.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] perf, stat: Add --pre and --post command

2012-10-23 Thread Peter Zijlstra

In order to measure kernel builds, one has to do some pre/post cleanup
work in order to do the repeat build.

So provide --pre and --post command hooks to allow doing just that.
  
  perf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/ clean' 
-- make -s -j64 O=defconfig-build/ bzImage

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
---
 tools/perf/builtin-stat.c |   42 --
 1 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 93b9011..6888960 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -57,6 +57,7 @@
 #include util/thread.h
 #include util/thread_map.h
 
+#include stdlib.h
 #include sys/prctl.h
 #include locale.h
 
@@ -83,6 +84,9 @@ static const char *csv_sep
= NULL;
 static boolcsv_output  = false;
 static boolgroup   = false;
 static FILE*output = NULL;
+static const char  *pre_cmd= NULL;
+static const char  *post_cmd   = NULL;
+static boolsync_run= false;
 
 static volatile int done = 0;
 
@@ -265,7 +269,7 @@ static int read_counter(struct perf_evsel *counter)
return 0;
 }
 
-static int run_perf_stat(int argc __maybe_unused, const char **argv)
+static int __run_perf_stat(int argc __maybe_unused, const char **argv)
 {
unsigned long long t0, t1;
struct perf_evsel *counter, *first;
@@ -405,6 +409,32 @@ static int run_perf_stat(int argc __maybe_unused, const 
char **argv)
return WEXITSTATUS(status);
 }
 
+static int run_perf_stat(int argc __maybe_unused, const char **argv)
+{
+   int ret;
+
+   if (pre_cmd) {
+   ret = system(pre_cmd);
+   if (ret)
+   return ret;
+   }
+
+   if (sync_run)
+   sync();
+
+   ret = __run_perf_stat(argc, argv);
+   if (ret)
+   return ret;
+
+   if (post_cmd) {
+   ret = system(post_cmd);
+   if (ret)
+   return ret;
+   }
+
+   return ret;
+}
+
 static void print_noise_pct(double total, double avg)
 {
double pct = rel_stddev_stats(total, avg);
@@ -1069,8 +1099,7 @@ static int add_default_attributes(void)
 
 int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
 {
-   bool append_file = false,
-sync_run = false;
+   bool append_file = false;
int output_fd = 0;
const char *output_name = NULL;
const struct option options[] = {
@@ -1114,6 +1143,10 @@ int cmd_stat(int argc, const char **argv, const char 
*prefix __maybe_unused)
OPT_BOOLEAN(0, append, append_file, append to the output file),
OPT_INTEGER(0, log-fd, output_fd,
log output to fd, instead of stderr),
+   OPT_STRING(0, pre, pre_cmd, command,
+   command to run prior to the measured command),
+   OPT_STRING(0, post, post_cmd, command,
+   command to run after to the measured command),
OPT_END()
};
const char * const stat_usage[] = {
@@ -1238,9 +1271,6 @@ int cmd_stat(int argc, const char **argv, const char 
*prefix __maybe_unused)
fprintf(output, [ perf stat: executing run #%d ... 
]\n,
run_idx + 1);
 
-   if (sync_run)
-   sync();
-
status = run_perf_stat(argc, argv);
}
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/34] perf, x86: Basic Haswell PEBS support v2

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 +struct event_constraint intel_hsw_pebs_event_constraints[] = {
 +   INTEL_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PRECDIST */
 +   INTEL_UEVENT_CONSTRAINT(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
 +   INTEL_UEVENT_CONSTRAINT(0x02c2, 0xf), /* UOPS_RETIRED.RETIRE_SLOTS */
 +   INTEL_EVENT_CONSTRAINT(0xc4, 0xf),/* BR_INST_RETIRED.* */
 +   INTEL_EVENT_CONSTRAINT(0x01c5, 0xf),  /* BR_MISP_RETIRED.CONDITIONAL 
 */
 +   INTEL_EVENT_CONSTRAINT(0x04c5, 0xf),  /* BR_MISP_RETIRED.ALL_BRANCHES 
 */
 +   INTEL_EVENT_CONSTRAINT(0x20c5, 0xf),  /* BR_MISP_RETIRED.NEAR_TAKEN */
 +   INTEL_EVENT_CONSTRAINT(0xcd, 0x8),/* MEM_TRANS_RETIRED.* */
 +   INTEL_UEVENT_CONSTRAINT(0x11d0, 0xf), /* 
 MEM_UOPS_RETIRED.STLB_MISS_LOADS */
 +   INTEL_UEVENT_CONSTRAINT(0x12d0, 0xf), /* 
 MEM_UOPS_RETIRED.STLB_MISS_STORES */
 +   INTEL_UEVENT_CONSTRAINT(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS 
 */
 +   INTEL_UEVENT_CONSTRAINT(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS 
 */
 +   INTEL_UEVENT_CONSTRAINT(0x42d0, 0xf), /* 
 MEM_UOPS_RETIRED.SPLIT_STORES */
 +   INTEL_UEVENT_CONSTRAINT(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
 +   INTEL_UEVENT_CONSTRAINT(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES 
 */
 +   INTEL_UEVENT_CONSTRAINT(0x01d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L1_HIT 
 */
 +   INTEL_UEVENT_CONSTRAINT(0x02d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L2_HIT 
 */
 +   INTEL_UEVENT_CONSTRAINT(0x04d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L3_HIT 
 */
 +   INTEL_UEVENT_CONSTRAINT(0x40d1, 0xf), /* 
 MEM_LOAD_UOPS_RETIRED.HIT_LFB */
 +   INTEL_UEVENT_CONSTRAINT(0x01d2, 0xf), /* 
 MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS */
 +   INTEL_UEVENT_CONSTRAINT(0x02d2, 0xf), /* 
 MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT */
 +   INTEL_UEVENT_CONSTRAINT(0x02d3, 0xf), /* 
 MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM */
 +   INTEL_UEVENT_CONSTRAINT(0x04c8, 0xf), /* HLE_RETIRED.Abort */
 +   INTEL_UEVENT_CONSTRAINT(0x04c9, 0xf), /* RTM_RETIRED.Abort */
 +
 +   EVENT_CONSTRAINT_END
 +}; 

ISTR asking this before, but I cannot find an answer in the previous
threads just now.

Do things like the 0xd0 event really need all the UEVENT things spelled
out? And 0x22d0 (LOCK_STORES) still appears missing going by the pattern
in there.

Also, it looks like the 0xc5 things want to be UEVENT, the current EVENT
thing will mask the umask.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/34] perf, x86: Support the TSX intx/intx_cp qualifiers

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 @@ -826,7 +827,8 @@ static inline bool intel_pmu_needs_lbr_smpl(struct 
 perf_event *event)
 return true;
  
 /* implicit branch sampling to correct PEBS skid */
 -   if (x86_pmu.intel_cap.pebs_trap  event-attr.precise_ip  1)
 +   if (x86_pmu.intel_cap.pebs_trap  event-attr.precise_ip  1 
 +   x86_pmu.intel_cap.pebs_format  2)
 return true;
  
 return false; 

This seems unrelated to TSX, I suppose it belongs in the previous patch?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/34] perf, x86: Report PEBS event in a raw format

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 +   if (event-attr.sample_type  PERF_SAMPLE_RAW) {
 +   raw.size = x86_pmu.pebs_record_size;
 +   raw.data = __pebs;
 +   data.raw = raw;
 +   } 

The Changelog babbles about registers, yet you export the entire record.

There's the PERF_SAMPLE_REGS thing which has been pointed out to you,
but you fail to include this in your arguments.

Also, there's an alignment issue there, the raw.data is 32bit offset,
the record is u64 aligned, leaving the output stream offset, wrecking
things.

And as with any ABI extension, it should come with useful userspace to
make use of it.

I'll hold off on this one for now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/34] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v2

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 This is not arch perfmon, but older CPUs will just ignore it. This makes
 it possible to do at least some TSX measurements from a KVM guest

Please, always CC people who wrote the code as well, in this case that's
Gleb.

 Cc: a...@redhat.com
 v2: Various fixes to address review feedback
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/kvm/pmu.c |   15 +++
  1 files changed, 11 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
 index cfc258a..81c1632 100644
 --- a/arch/x86/kvm/pmu.c
 +++ b/arch/x86/kvm/pmu.c

 @@ -173,6 +173,11 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
 type,
   .exclude_kernel = exclude_kernel,
   .config = config,
   };
 + /* Will be ignored on CPUs that don't support this. */
 + if (intx)
 + attr.config |= HSW_INTX;
 + if (intx_cp)
 + attr.config |= HSW_INTX_CHECKPOINTED;
  
   attr.sample_period = (-pmc-counter)  pmc_bitmask(pmc);
  

So I forgot how all this worked, but will the KVM emulation not pass
this straight down to the hardware?

Don't we have a problem where we're emulating arch perfmon v1 on AMD
hardware? On AMD hardware those bits do have meaning.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/34] perf, x86: Support Haswell v4 LBR format

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 Haswell has two additional LBR from flags for TSX: intx and abort, implemented
 as a new v4 version of the PEBS record.

s/PEBS record/LBR format/ I presume ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 11/34] perf, tools: Add abort,notx,intx branch filter options to perf report -j

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 +   BRANCH_OPT(abort, PERF_SAMPLE_BRANCH_ABORT),
 +   BRANCH_OPT(intx, PERF_SAMPLE_BRANCH_INTX),
 +   BRANCH_OPT(notx, PERF_SAMPLE_BRANCH_NOTX), 

I think we want tx in the abort name, its very much a transaction abort,
not any other kind of abort.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 14/34] perf, x86: Avoid checkpointed counters causing excessive TSX aborts

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 @@ -1079,6 +1079,17 @@ static void intel_pmu_enable_event(struct perf_event 
 *event)
  int intel_pmu_save_and_restart(struct perf_event *event)
  {
 x86_perf_event_update(event);
 +   /*
 +* For a checkpointed counter always reset back to 0.  This
 +* avoids a situation where the counter overflows, aborts the
 +* transaction and is then set back to shortly before the
 +* overflow, and overflows and aborts again.
 +*/
 +   if (event-hw.config  HSW_INTX_CHECKPOINTED) {

Would an unlikely() make sense there? Most events won't have this set.

 +   /* No race with NMIs because the counter should not be armed 
 */
 +   wrmsrl(event-hw.event_base, 0);
 +   local64_set(event-hw.prev_count, 0);
 +   }
 return x86_perf_event_set_period(event);
  } 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 14/34] perf, x86: Avoid checkpointed counters causing excessive TSX aborts

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 +   /* XXX move somewhere else. */
 +   if (cpuc-events[2]  (cpuc-events[2]-hw.config  
 HSW_INTX_CHECKPOINTED))
 +   status |= (1ULL  2); 

A comment explaining about those 'spurious' PMIs would go along with
this nicely, no? I'm very sure I'd keep having to look up the commit
introducing this to figure out wtf its for.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 15/34] perf, core: Add a concept of a weightened sample

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 @@ -601,6 +602,7 @@ static inline void perf_sample_data_init(struct 
 perf_sample_data *data,
 data-regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 data-regs_user.regs = NULL;
 data-stack_user_size = 0;
 +   data-weight = 0;
  }
  
  extern void perf_output_sample(struct perf_output_handle *handle,

 @@ -562,6 +565,7 @@ enum perf_event_type {
  *  { u64   stream_id;}  PERF_SAMPLE_STREAM_ID
  *  { u32   cpu, res; }  PERF_SAMPLE_CPU
  *  { u64   period;   }  PERF_SAMPLE_PERIOD
 +*  { u64   weight;   }  PERF_SAMPLE_WEIGHT
  *
  *  { struct read_formatvalues;   }  PERF_SAMPLE_READ
  * 

So the only issues I have are that his makes every sample more expensive
by having to 0 out that weight data and the sample placement.

I don't think avoiding that weight init is really worth the pain that'll
cause, so we'll just leave it there.

As to the placement, I suppose it makes sense, although Stephane once
complained about these fields not being in PERF_SAMPLE numeric order.
Since we're not that anyway, he'll have to deal with it.. Stephane, any
strong arguments against this placement?

Also, Stephane, you said you had something similar in you LL patches, do
you mean to re-use this or should we re-base this on top of your
patches.. ?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 16/34] perf, x86: Support weight samples for PEBS

2012-10-23 Thread Peter Zijlstra

On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 When a weighted sample is requested, first try to report the TSX abort cost
 on Haswell. If that is not available report the memory latency. This
 allows profiling both by abort cost and by memory latencies.
 
 Memory latencies requires enabling a different PEBS mode (LL).
 When both address and weight is requested address wins.
 
 The LL mode only works for memory related PEBS events, so add a
 separate event constraint table for those.
 
 I only did this for Haswell for now, but it could be added
 for several other Intel CPUs too by just adding the right
 table for them.

This looks like it will interfere with Stephane's LL patches -- which
should be out any day now ;-)

Stephane, any comments on how we should deal with all that?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/34] perf, x86: Report PEBS event in a raw format

2012-10-23 Thread Peter Zijlstra

On Tue, 2012-10-23 at 15:30 +0200, Andi Kleen wrote:
  Also, there's an alignment issue there, the raw.data is 32bit offset,
  the record is u64 aligned, leaving the output stream offset, wrecking
  things.
 
 Can you explain more? Not sure I understand.

PERF_SAMPLE_RAW has a u32 size header and starts the data after that.
This means you PERF_SAMPLE_RAW output ends up on a u32 aligned end
address -- assuming the data is a u64 multiple, this is not good.

 It appears to work at least.

It would on x86, I'm fairly sure it'll break on things like SPARC. We
used to have checks in the userspace code to warn for this on x86 as
well. Not sure if that's still there.

Hmm, so in kernel/events/core.c:perf_prepare_sample() there's a
WARN_ON_ONCE() in the PERF_SAMPLE_RAW branch that should trigger with
this.

  
  And as with any ABI extension, it should come with useful userspace to
  make use of it.
 
 There are already scripts available to use it, see Feng's patchkit.

That's a whole other patchkit, meaning this doesn't belong here.

I never can seem to figure out how to use the scripts mess :/ acme, is
there anything we could do to make that stuff usable? There's a ton of
crap under scripts/ but I don't even know how to get that stuff to run.

What's more, all that nonsense is in weird languages I don't do either,
making it less useful than usual.

In short, I don't consider script/ to be proper userspace since its
bloody useless.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 0/8] perf, tool: Allow to use hw events in PMU syntax

2012-10-23 Thread Peter Zijlstra

On Wed, 2012-10-10 at 14:53 +0200, Jiri Olsa wrote:
  arch/x86/kernel/cpu/perf_event.c   | 121
 +
  arch/x86/kernel/cpu/perf_event.h   |   2 ++
  arch/x86/kernel/cpu/perf_event_amd.c   |   9 +++
  arch/x86/kernel/cpu/perf_event_intel.c |   9 +++
  arch/x86/kernel/cpu/perf_event_p6.c|   2 ++
  include/linux/perf_event.h |   3 +++

These bits look like they want to live in
arch/x86/kernel/cpu/perf_event.h, they're rather x86 specific. I moved
them over.

  tools/perf/util/parse-events-test.c|  68
 ++
  tools/perf/util/parse-events.c |  18 ++
  tools/perf/util/parse-events.h |   2 ++
  tools/perf/util/parse-events.y |  18 ++
  tools/perf/util/pmu.c  |   7 +++---
  11 files changed, 256 insertions(+), 3 deletions(-) 

Acme, you ok with these bits?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/11] perf: Do not get values from disabled counters in group format read

2012-10-23 Thread Peter Zijlstra

On Sat, 2012-10-20 at 16:33 +0200, Jiri Olsa wrote:
 It's possible some of the counters in the group could be
 disabled when sampling member of the event group is reading
 the rest via PERF_SAMPLE_READ sample type processing. Disabled
 counters could then produce wrong numbers.
 
 Fixing that by reading only enabled counters for PERF_SAMPLE_READ
 sample type processing.
 

However did you run into this?

 ---
  kernel/events/core.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 32aec40..5220d01 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -4012,12 +4012,24 @@ static void perf_output_read_group(struct 
 perf_output_handle *handle,
   __output_copy(handle, values, n * sizeof(u64));
  
   list_for_each_entry(sub, leader-sibling_list, group_entry) {
 + u64 value = 0;
   n = 0;
  
 - if (sub != event)
 - sub-pmu-read(sub);
 + /*
 +  * We are NOT interested in disabled counters,
 +  * giving us strange values and keeping us from
 +  * good night sleep!!!
 +  */
 + if (sub-state  PERF_EVENT_STATE_OFF) {
 +

superfluous whitespace there... ;-)

 + if (sub != event)
 + sub-pmu-read(sub);
 +
 + value = perf_event_count(sub);
 + }
 +
 + values[n++] = value;
  
 - values[n++] = perf_event_count(sub);
   if (read_format  PERF_FORMAT_ID)
   values[n++] = primary_event_id(sub);
  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers

2012-10-23 Thread Peter Zijlstra

On Mon, 2012-10-22 at 19:37 -0400, Mikulas Patocka wrote:
 -   /*
 -* On X86, write operation in this_cpu_dec serves as a memory unlock
 -* barrier (i.e. memory accesses may be moved before the write, but
 -* no memory accesses are moved past the write).
 -* On other architectures this may not be the case, so we need 
 smp_mb()
 -* there.
 -*/
 -#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
 !defined(CONFIG_X86_OOSTORE))
 -   barrier();
 -#else
 -   smp_mb();
 -#endif
 +   light_mb(); /* B, between read of the data and write to p-counter, 
 paired with C */ 

If we're going to invent new primitives for this, shouldn't we call
this: smp_unlock_barrier() or something? That at least has well defined
semantics.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers

2012-10-23 Thread Peter Zijlstra

On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote:
 I have to admit, I have
 no idea how much cli/sti is slower compared to preempt_disable/enable.
 
A lot.. esp on stupid hardware (insert pentium-4 reference), but I think
its more expensive for pretty much all hardware, preempt_disable() is
only a non-atomic cpu local increment and a compiler barrier, enable the
same and a single conditional.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] percpu-rw-semaphores: use light/heavy barriers

2012-10-23 Thread Peter Zijlstra

On Tue, 2012-10-23 at 21:23 +0200, Oleg Nesterov wrote:
 
 static void mb_ipi(void *arg)
 {
 smp_mb(); /* unneeded ? */
 }
 
 static void force_mb_on_each_cpu(void)
 {
 smp_mb();
 smp_call_function(mb_ipi, NULL, 1);
 } 

You know we're spending an awful lot of time and effort to get rid of
such things, right? RT and HPC people absolutely hate these random IPI
things.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc 0/2] Introducing VmFlags field into smaps output

2012-10-23 Thread Peter Zijlstra

On Wed, 2012-10-24 at 01:59 +0400, Cyrill Gorcunov wrote:
 [ilog2(VM_WRITE)]   = { {'w', 'r'} },

since we're being awfully positive about crazy late night ideas, how
about something like:

#define MNEM(_VM, _mn) [ilog2(_VM)] = {(const char [2]){_mn}}

MNEM(VM_WRITE, wr),


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + procfs-add-vmflags-field-in-smaps-output-v3-fix-2.patch added to -mm tree

2012-10-24 Thread Peter Zijlstra

On Wed, 2012-10-24 at 12:45 +0400, Cyrill Gorcunov wrote:
 for (i = 0; i  BITS_PER_LONG; i++) {
 -   if (vma-vm_flags  (1  i))
 +   if (vma-vm_flags  (1ul  i)) {

for_each_set_bit(i, vma-vm_flags, BITS_PER_LONG) {

 seq_printf(m, %c%c ,
mnemonics[i].l[0],
mnemonics[i].l[1]);
 +   }
 } 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] sched: Fix a deadlock of cpu-hotplug

2012-10-24 Thread Peter Zijlstra

On Wed, 2012-10-24 at 17:25 +0800, Huacai Chen wrote:
 We found poweroff sometimes fails on our computers, so we have the
 lock debug options configured. Then, when we do poweroff or take a
 cpu down via cpu-hotplug, kernel complain as below. To resove this,
 we modify sched_ttwu_pending(), disable the local irq when acquire
 rq-lock.
 
 [   83.066406] =
 [   83.066406] [ INFO: inconsistent lock state ]
 [   83.066406] 3.5.0-3.lemote #428 Not tainted
 [   83.066406] -
 [   83.066406] inconsistent {IN-HARDIRQ-W} - {HARDIRQ-ON-W} usage.
 [   83.066406] migration/1/7 [HC0[0]:SC0[0]:HE1:SE1] takes:
 [   83.066406]  (rq-lock){?.-.-.}, at: [802585ac] 
 sched_ttwu_pending+0x64/0x98
 [   83.066406] {IN-HARDIRQ-W} state was registered at:
 [   83.066406]   [8027c9ac] __lock_acquire+0x80c/0x1cc0
 [   83.066406]   [8027e3d0] lock_acquire+0x60/0x9c
 [   83.066406]   [8074ba04] _raw_spin_lock+0x3c/0x50
 [   83.066406]   [8025a2fc] scheduler_tick+0x48/0x178
 [   83.066406]   [8023b334] update_process_times+0x54/0x70
 [   83.066406]   [80277568] tick_handle_periodic+0x2c/0x9c
 [   83.066406]   [8020a818] c0_compare_interrupt+0x8c/0x94
 [   83.066406]   [8029ec8c] handle_irq_event_percpu+0x7c/0x248
 [   83.066406]   [802a2774] handle_percpu_irq+0x8c/0xc0
 [   83.066406]   [8029e2c8] generic_handle_irq+0x48/0x58
 [   83.066406]   [80205c04] do_IRQ+0x18/0x24
 [   83.066406]   [802016e4] mach_irq_dispatch+0xe4/0x124
 [   83.066406]   [80203ca0] ret_from_irq+0x0/0x4
 [   83.066406]   [8022d114] console_unlock+0x3e8/0x4c0
 [   83.066406]   [811ff0d0] con_init+0x370/0x398
 [   83.066406]   [811fe3e0] console_init+0x34/0x50
 [   83.066406]   [811e4844] start_kernel+0x2f8/0x4e0
 [   83.066406] irq event stamp: 971
 [   83.066406] hardirqs last  enabled at (971): [8021c384] 
 local_flush_tlb_all+0x134/0x17c
 [   83.066406] hardirqs last disabled at (970): [8021c298] 
 local_flush_tlb_all+0x48/0x17c
 [   83.066406] softirqs last  enabled at (0): [802298a4] 
 copy_process+0x510/0x117c
 [   83.066406] softirqs last disabled at (0): [  (null)] (null)
 [   83.066406]
 [   83.066406] other info that might help us debug this:
 [   83.066406]  Possible unsafe locking scenario:
 [   83.066406]
 [   83.066406]CPU0
 [   83.066406]
 [   83.066406]   lock(rq-lock);
 [   83.066406]   Interrupt
 [   83.066406] lock(rq-lock);
 [   83.066406]
 [   83.066406]  *** DEADLOCK ***
 [   83.066406]
 [   83.066406] no locks held by migration/1/7.
 [   83.066406]
 [   83.066406] stack backtrace:
 [   83.066406] Call Trace:
 [   83.066406] [80747544] dump_stack+0x8/0x34
 [   83.066406] [8027ba04] print_usage_bug+0x2ec/0x314
 [   83.066406] [8027be28] mark_lock+0x3fc/0x774
 [   83.066406] [8027ca48] __lock_acquire+0x8a8/0x1cc0
 [   83.066406] [8027e3d0] lock_acquire+0x60/0x9c
 [   83.066406] [8074ba04] _raw_spin_lock+0x3c/0x50
 [   83.066406] [802585ac] sched_ttwu_pending+0x64/0x98
 [   83.066406] [80745ff4] migration_call+0x10c/0x2e0
 [   83.066406] [80253110] notifier_call_chain+0x44/0x94
 [   83.066406] [8022eae0] __cpu_notify+0x30/0x5c
 [   83.066406] [8072b598] take_cpu_down+0x5c/0x70
 [   83.066406] [80299ba4] stop_machine_cpu_stop+0x104/0x1e8
 [   83.066406] [802997cc] cpu_stopper_thread+0x110/0x1ac
 [   83.066406] [8024c940] kthread+0x88/0x90
 [   83.066406] [80205ee4] kernel_thread_helper+0x10/0x18

Weird, that's from a CPU_DYING call, I thought those were with IRQs
disabled. 

Look at how __stop_machine() calls the function with IRQs disabled for !
stop_machine_initialized or !SMP. Also stop_machine_cpu_stop() seems to
disabled interrupts, so how do we end up calling take_cpu_down() with
IRQs enabled?

That simply doesn't make any sense.

 Signed-off-by: Huacai Chen che...@lemote.com
 ---
  kernel/sched/core.c |5 +++--
  1 files changed, 3 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 36e2666..703754a 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1468,9 +1468,10 @@ static void sched_ttwu_pending(void)
  {
   struct rq *rq = this_rq();
   struct llist_node *llist = llist_del_all(rq-wake_list);
 + unsigned long flags;
   struct task_struct *p;
  
 - raw_spin_lock(rq-lock);
 + raw_spin_lock_irqsave(rq-lock, flags);
  
   while (llist) {
   p = llist_entry(llist, struct task_struct, wake_entry);
 @@ -1478,7 +1479,7 @@ static void sched_ttwu_pending(void)
   ttwu_do_activate(rq, p, 0);
   }
  
 - raw_spin_unlock(rq-lock);
 + raw_spin_unlock_irqrestore(rq-lock, flags);
  }
  
  void scheduler_ipi(void)


That's wrong though, you add the cost

Re: [PATCH 02/11] perf: Do not get values from disabled counters in group format read

2012-10-24 Thread Peter Zijlstra

On Tue, 2012-10-23 at 18:50 +0200, Jiri Olsa wrote:
 On Tue, Oct 23, 2012 at 06:13:09PM +0200, Peter Zijlstra wrote:
  On Sat, 2012-10-20 at 16:33 +0200, Jiri Olsa wrote:
   It's possible some of the counters in the group could be
   disabled when sampling member of the event group is reading
   the rest via PERF_SAMPLE_READ sample type processing. Disabled
   counters could then produce wrong numbers.
   
   Fixing that by reading only enabled counters for PERF_SAMPLE_READ
   sample type processing.
   
  
  However did you run into this?
 
 yep, with perf record -a
 
 hm, I just checked and we enable/disable event groups atomicaly..
 I haven't checked that before because it seemed obvious :-/
 
 So, I'm not sure now about the exact code path that triggered it
 in my test..  however you could always disable child event from
 group and hit this issue, but thats not what happened in perf.
 
 might be some other bug... I'll check 

Right, so I don't object to the patch per-se, I was just curious how you
ran into it, because ISTR what you just said, we enable all this stuff
together.

Also, why would disabled counters give strange values? They'd simply
return the same old value time after time, right?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events

2012-10-24 Thread Peter Zijlstra

On Tue, 2012-10-16 at 11:31 -0700, Sukadev Bhattiprolu wrote:
 On a side note, how does the kernel on x86 use the 'config' information in 
 say /sys/bus/event_source/devices/cpu/format/cccr ? On Power7, the raw
 code encodes the information such as the PMC to use for the event. Is that
 how the 'config' info in Intel is used ?
 
 Does the 'config' info change from system to system or is it static for
 a given event on a given CPU ? 

Have a look at commits (tip/master):

  641cc938815dfd09f8fa1ec72deb814f0938ac33
  a47473939db20e3961b200eb00acf5fcf084d755
  43c032febde48aabcf6d59f47cdcb7b5debbdc63


So basically

 /sys/bus/event_source/devices/cpu/format/event

contains something like:

  config:0-7

Which says that for the 'cpu' PMU, field 'event' fills
perf_event_attr::config bits 0 through 7 (for type=PERF_TYPE_RAW).

The perf tool syntax for this is:

  perf stat -e 'cpu/event=0x3c/'

This basically allows you to expose bitfields in the 'raw' event format
for ease of writing raw events. I do not know if the Power PMU has such
or not.

Using this,

  /sys/bus/event_source/devices/cpu/events/cpu-cycles

would contain something like:

  event=0x3c

which one can use as:

  perf stat -e 'cpu/event=cpu-cycles/'
  perf stat -e 'cpu/cpu-cycles/'

The tool will then read the sysfs file, substitute the content to
obtain:

  perf stat -e 'cpu/event=0x3c/'

and run with that.

Within all this, the perf_event_attr::config* field names are hard-coded
special, so 'cpu/config=0x/' will always work, even without sysfs
format/ specification and is equivalent to the raw event stuff we had
before.


If the Power PMU lacks any structure to the raw config, you could simply
provide sysfs event/ files with:

  config=0xdeadbeef

like content.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/11] perf: Do not get values from disabled counters in group format read

2012-10-24 Thread Peter Zijlstra

On Wed, 2012-10-24 at 14:14 +0200, Jiri Olsa wrote:
 
 well, x86_pmu_read calls x86_perf_event_update, which expects the
 event
 is active.. if it's not it'll update the count from whatever left in
 event.hw.idx counter.. could be uninitialized or used by others..
 
Oh right, we shouldn't call -read() unless -state ==
PERF_EVENT_STATE_ACTIVE. Aside from that, perf_event_count() should
return the 'right' value regardless state (although excluding ERROR
might make sense).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] sched: Fix a deadlock of cpu-hotplug

2012-10-24 Thread Peter Zijlstra

On Wed, 2012-10-24 at 20:34 +0800, 陈华才 wrote:
 I see, this is an arch-specific bug, sorry for my carelessness and thank
 you for your tips. 

What arch are you using? And what exactly did the arch do wrong? Most of
the code involved seems to be common code.

Going by c0_compare_interrupt, this is some MIPS device.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Do not use cpu_to_node() to find an offlined cpu's node.

2012-10-09 Thread Peter Zijlstra

On Mon, 2012-10-08 at 10:59 +0800, Tang Chen wrote:
 If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
 return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
 find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.

Hurm,. this is new, right? Who is changing all these semantics without
auditing the tree and informing all affected people?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] task_work: avoid unneeded cmpxchg() in task_work_run()

2012-10-09 Thread Peter Zijlstra

On Mon, 2012-10-08 at 14:38 +0200, Oleg Nesterov wrote:
 But the code looks more complex, and the only advantage is that
 non-exiting task does xchg() instead of cmpxchg(). Not sure this
 worth the trouble, in this case task_work_run() will likey run
 the callbacks (the caller checks -task_works != NULL), I do not
 think this can add any noticeable speedup. 

Yeah, I agree, the patch doesn't seem worth the trouble. It makes tricky
code unreadable at best.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REPOST] RFC: sched: Prevent wakeup to enter critical section needlessly

2012-10-09 Thread Peter Zijlstra

On Tue, 2012-10-09 at 06:37 -0700, Andi Kleen wrote:
 Ivo Sieben meltedpiano...@gmail.com writes:
 
  Check the waitqueue task list to be non empty before entering the critical
  section. This prevents locking the spin lock needlessly in case the queue
  was empty, and therefor also prevent scheduling overhead on a PREEMPT_RT
  system.
 
  Signed-off-by: Ivo Sieben meltedpiano...@gmail.com
  ---
 
   REPOST:
   Request for comments:
   - Does this make any sense?
 
 Looks good to me. Avoiding any spinlock is good.  I don't even think you
 need careful, if you miss an update it was just as it happened a
 little later.

Yeah, so I've started looking at this several times, but never had the
time to actually think it through. I think I'll agree with you that
using the careful list empty check isn't needed.

In general there's already the race of doing a wakeup before the wait
and if the code using the waitqueue doesn't deal with that its already
broken, so in that respect you should be good, since you're simply
shifting the timing a little.

One thing you might need to consider is the memory ordering, will the
list_empty -- either careful or not -- observe the right list pointer,
or could it -- when racing with wait_event()/prepare_to_wait() --
observe a stale value. Or.. is that all already covered in on the use
site.

I started auditing a few users to see what they all assume, if they're
already broken and or if they would now be broken.. but like said, I
didn't get anywhere.


I'd like to see this patch/Changelog amended to at least cover these
points so that I feel all warm and fuzzy when I read it and not have to
think too hard ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86/perf: Fix virtualization sanity check

2012-10-09 Thread Peter Zijlstra

On Tue, 2012-10-09 at 17:38 +0200, Andre Przywara wrote:
 First you need an AMD family 10h/12h CPU. These do not reset the
 PERF_CTR registers on a reboot.
 Now you boot bare metal Linux, which goes successfully through this
 check, but leaves the magic value of 0xabcd in the register. You
 don't use the performance counters, but do a reboot (warm reset).
 Then you choose to boot Xen. The check will be triggered with a
 recent Linux kernel as Dom0 again, trying to write 0xabcd into the
 MSR. Xen silently drops the write (expected), but the subsequent read
 will return the value in the register, which just happens to be the
 expected magic value. Thus the test misleadingly succeeds, leaving
 the kernel in the belief that the PMU is available 

Wow.. ! that's uhm.. shees!

Bit weird of Xen to trap writes but not reads of MSRs though.

The patchs looks fine though, thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Do not use cpu_to_node() to find an offlined cpu's node.

2012-10-09 Thread Peter Zijlstra

On Tue, 2012-10-09 at 13:36 -0700, David Rientjes wrote:
 On Tue, 9 Oct 2012, Peter Zijlstra wrote:
 
  On Mon, 2012-10-08 at 10:59 +0800, Tang Chen wrote:
   If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
   return -1. As a result, cpumask_of_node(nid) will return NULL. In this 
   case,
   find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.
  
  Hurm,. this is new, right? Who is changing all these semantics without
  auditing the tree and informing all affected people?
  
 
 I've nacked the patch that did it because I think it should be done from 
 the generic cpu hotplug code only at the CPU_DEAD level with a per-arch 
 callback to fixup whatever cpu-to-node mappings they maintain since 
 processes can reenter the scheduler at CPU_DYING.

Well the code they were patching is in the wakeup path. As I think Tang
said, we leave !runnable tasks on whatever cpu they ran on last, even if
that cpu is offlined, we try and fix up state when we get a wakeup.

On wakeup, it tries to find a cpu to run on and will try a cpu of the
same node first.

Now if that node's entirely gone away, it appears the cpu_to_node() map
will not return a valid node number.

I think that's a change in behaviour, it didn't used to do that afaik.
Certainly this code hasn't change in a while.


 The whole issue seems to be because alloc_{fair,rt}_sched_group() does an 
 iteration over all possible cpus (not all online cpus) and does 
 kzalloc_node() which references a now-offlined node.  Changing it to -1 
 makes the slab code fallback to any online node.

Right, that's because the rq structures are assumed always present. What
I cannot remember is why I'm not using per-cpu allocations there,
because that's exactly what it looks like it wants to be.

 What I think we need to do instead of hacking only the acpi code and not 
 standardizing this across the kernel is:

Right, what I don't understand is wtf ACPI has to do with anything. We
have plenty cpu hotplug code, ACPI isn't involved in any of that last
time I checked.

  - reset cpu-to-node with a per-arch callback in generic cpu hotplug code 
at CPU_DEAD, and
 
  - do an iteration over all possible cpus for node hot-remove ensuring 
there are no stale references.

Why do we need to clear cpu-to-node maps? are we going to change the
topology at runtime? What are you going to do with per-cpu stuff,
per-cpu memory isn't freed on hotplug, so its node relation is static.

/me confused..

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Do not use cpu_to_node() to find an offlined cpu's node.

2012-10-10 Thread Peter Zijlstra

On Tue, 2012-10-09 at 16:27 -0700, David Rientjes wrote:
 On Tue, 9 Oct 2012, Peter Zijlstra wrote:
 
  Well the code they were patching is in the wakeup path. As I think Tang
  said, we leave !runnable tasks on whatever cpu they ran on last, even if
  that cpu is offlined, we try and fix up state when we get a wakeup.
  
  On wakeup, it tries to find a cpu to run on and will try a cpu of the
  same node first.
  
  Now if that node's entirely gone away, it appears the cpu_to_node() map
  will not return a valid node number.
  
  I think that's a change in behaviour, it didn't used to do that afaik.
  Certainly this code hasn't change in a while.
  
 
 If cpu_to_node() always returns a valid node id even if all cpus on the 
 node are offline, then the cpumask_of_node() implementation, which the 
 sched code is using, should either return an empty cpumask (if 
 node_to_cpumask_map[nid] isn't freed) or cpu_online_mask.  The change in 
 behavior here occurred because 
 cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't 
 return a valid node id and forces it to return -1 so a kzalloc_node(..., 
 -1) fallsback to allocate anywhere.

I think that's broken semantics.. so far the entire cpu-node mapping
was invariant during hotplug. Changing that is going to be _very_
interesting and cannot be done lightly.

Because as I said, per-cpu memory is preserved over hotplug, and that
has numa affinity.

So for now, let me NACK that patch. You cannot go change stuff like
that.

 
 But if you only need cpu_to_node() when waking up to find a runnable cpu 
 for this NUMA information, then I think you can just change the 
 kzalloc_node() in alloc_{fair,rt}_sched_group() to do 
 kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).

That's a confusing statement, the wakeup stuff and the
alloc_{fair,rt}_sched_group() stuff are unrelated, although both sites
might need fixing if we're going to go ahead with this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Do not use cpu_to_node() to find an offlined cpu's node.

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 17:33 +0800, Wen Congyang wrote:
 
 Hmm, if per-cpu memory is preserved, and we can't offline and remove
 this memory. So we can't offline the node.
 
 But, if the node is hot added, and per-cpu memory doesn't use the
 memory on this node. We can hotremove cpu/memory on this node, and then
 offline this node.
 
 Before the cpu is hotadded, cpu's node is -1. We set cpu-node mapping
 when it is hotadded. So the entire cpu-node mapping was not invariant
 during hotplug.
 
 So it is why I try to clear it when the cpu is hot-removed.
 
 As we need the mapping to migrate a task to the cpu on the same node first,
 I think we can clear the mapping when the node is offlined.

Hmm maybe, but hardware that can hot-add is rare and nobody has it so
nobody cares ;-)

But by clearing cpu_to_node on every hotplug you change semantics for
all hardware and everybody gets to feel the pain.

I'm not saying you cannot change things, I'm only saying you should be
far more careful about it, not change it and wait for things to break.
Put in some effort to find things that might break and warn people --
sure, you'll always miss some, and that's ok.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Do not use cpu_to_node() to find an offlined cpu's node.

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 18:10 +0800, Wen Congyang wrote:
 I use ./scripts/get_maintainer.pl, and it doesn't tell me that I should cc
 you when I post that patch. 

That script doesn't look at all usage sites of the code you modify does
it?

You need to audit the entire tree for usage of the interfaces you change
and email all relevant people for whoem you're planning to break stuff
-- preferably with a proposed fix.

That's manual work, no script will ever do this for you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Netperf UDP_STREAM regression due to not sending IPIs in ttwu_queue()

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 13:29 +0100, Mel Gorman wrote:
 Do we really switch more though?
 
 Look at the difference in interrupts vs context switch. IPIs are an interrupt
 so if TTWU_QUEUE wakes process B using an IPI, does that count as a context
 switch?

Nope. Nor would it for NO_TTWU_QUEUE. A process waking another is just
that, a wakeup.

A context switch is when we stop running a process and start running
anther. A wakeup can lead to us deciding the newly woken task is a
better task to run, however its not a given.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] perf x86: Adding hardware events translations for amd cpus

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 14:53 +0200, Jiri Olsa wrote:
 +static ssize_t amd_event_sysfs_show(char *page, u64 config)
 +{
 +   u64 event = (config  ARCH_PERFMON_EVENTSEL_EVENT) |
 +   (config  AMD64_EVENTSEL_EVENT)  24;
 +
 +   return x86_event_sysfs_show(page, config, event);
 +} 

You'll need to filter out 0xF32 bits before passing them on in
@config, Intel has a different meaning for them.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] perf x86: Adding hardware events translations for amd cpus

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 16:25 +0200, Jiri Olsa wrote:
 On Wed, Oct 10, 2012 at 04:11:42PM +0200, Peter Zijlstra wrote:
  On Wed, 2012-10-10 at 14:53 +0200, Jiri Olsa wrote:
   +static ssize_t amd_event_sysfs_show(char *page, u64 config)
   +{
   +   u64 event = (config  ARCH_PERFMON_EVENTSEL_EVENT) |
   +   (config  AMD64_EVENTSEL_EVENT)  24;
   +
   +   return x86_event_sysfs_show(page, config, event);
   +} 
  
  You'll need to filter out 0xF32 bits before passing them on in
  @config, Intel has a different meaning for them.
 
 Right, that would be those 'intx and intx_cp' bits we discussed, right?

Right.

 My thinking was to customize this once those bits are introduced and
 part of the format stuff. Until that time the x86_event_sysfs_show
 function shows proper data for both amd and intel. Or is it already
 on its way in?

No thats fine, just something we shouldn't forget about. They're in Andi
Kleen's HSW patches, I need to go over the v2 of that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Meaningless load?

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 17:44 +0200, Simon Klinkert wrote:
 I'm just wondering if the 'load' is really meaningful in this
 scenario. The machine is the whole time fully responsive and looks
 fine to me but maybe I didn't understand correctly what the load
 should mean. Is there any sensible interpretation of the load?

I'll leave meaningful aside, but uninterruptible (D state) is part of
how the load thing is defined, so your 500 result is correct.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] task_work: avoid unneeded cmpxchg() in task_work_run()

2012-10-10 Thread Peter Zijlstra

On Wed, 2012-10-10 at 19:50 +0200, Oleg Nesterov wrote:
 
 But you did not answer, and I am curious. What was your original
 motivation? Is xchg really faster than cmpxchg? 

And is this true over multiple architectures? Or are we optimizing for
x86_64 (again) ?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps

2012-10-25 Thread Peter Zijlstra

On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
 Ok, this looks the same but it's actually a different issue: 
 mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
 calls get_vma_policy() which may take the shared policy mutex.  This 
 happens while holding page_table_lock from do_huge_pmd_numa_page() but 
 also from do_numa_page() while holding a spinlock on the ptl, which is 
 coming from the sched/numa branch.
 
 Is there anyway that we can avoid changing the shared policy mutex back 
 into a spinlock (it was converted in b22d127a39dd [mempolicy: fix a race 
 in shared_policy_replace()])?
 
 Adding Peter, Rik, and Mel to the cc. 

Urgh, crud I totally missed that.

So the problem is that we need to compute if the current page is placed
'right' while holding pte_lock in order to avoid multiple pte_lock
acquisitions on the 'fast' path.

I'll look into this in a bit, but one thing that comes to mind is having
both a spnilock and a mutex and require holding both for modification
while either one is sufficient for read.

That would allow sp_lookup() to use the spinlock, while insert and
replace can hold both.

Not sure it will work for this, need to stare at this code a little
more.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390

2012-10-25 Thread Peter Zijlstra

From: Gerald Schaefer gerald.schae...@de.ibm.com

This patch adds an implementation of pmd_pgprot() for s390,
in preparation to future THP changes.

Reported-by: Stephen Rothwell s...@canb.auug.org.au
Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ralf Baechle r...@linux-mips.org
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/s390/include/asm/pgtable.h |   13 +
 1 file changed, 13 insertions(+)

Index: tip/arch/s390/include/asm/pgtable.h
===
--- tip.orig/arch/s390/include/asm/pgtable.h
+++ tip/arch/s390/include/asm/pgtable.h
@@ -1240,6 +1240,19 @@ static inline void set_pmd_at(struct mm_
*pmdp = entry;
 }
 
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+   pgprot_t prot = PAGE_RW;
+
+   if (pmd_val(pmd)  _SEGMENT_ENTRY_RO) {
+   if (pmd_val(pmd)  _SEGMENT_ENTRY_INV)
+   prot = PAGE_NONE;
+   else
+   prot = PAGE_RO;
+   }
+   return prot;
+}
+
 static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot)
 {
unsigned long pgprot_pmd = 0;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally

2012-10-25 Thread Peter Zijlstra

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: H. Peter Anvin h...@zytor.com
Cc: Mike Galbraith efa...@gmx.de
Rik van Riel r...@redhat.com
[ Next step: generate the kernel source from such formal descriptions and 
retire to a tropical island! ]
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 Documentation/scheduler/numa-problem.txt |  230 +++
 1 file changed, 230 insertions(+)
 create mode 100644 Documentation/scheduler/numa-problem.txt

Index: tip/Documentation/scheduler/numa-problem.txt
===
--- /dev/null
+++ tip/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].  
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page 
placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+   because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly 
+  restrict p/s above to be the working-set. (It also makes explicit the 
+  requirement for C0,M0 to change about a change in the working set.)
+
+  Doing this does have the nice property that it lets you use your 
frequency
+  measurement as a weak-ordering for the benefit a task would receive when
+  we can't fit everything.
+
+  e.g. task1 has working set 10mb, f=90%
+   task2 has working set 90mb, f=10%
+
+  Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+  from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+  C: t_i - {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+  T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == 
l
+
+And our goal is to obtain C0 and M0 such that:
+
+  T_k,l(C0, M0) = T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+   on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+  usual systems given factors like things like Haswell's enormous 35mb l3
+  cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+  L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+  L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+  L_n,m(C0) = L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+   | L_n(C0) - L_m(C0) | = 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+  Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+  the worst partition we should accept; but having it gives us a useful 
+  bound on how much we can reasonably adjust L_n/L_m at a Pareto point to 
+  favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+  min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+   particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+   traffic, the more complicated solution could pick another Pareto point using
+   an aggregate objective function such that we balance the loss of work
+   efficiency against the gain of running, we'd want to more or less suggest
+   there to be a fixed bound on the error from the Pareto line for any
+   such solution.
+
+References:
+
+  http://en.wikipedia.org/wiki/Mathematical_optimization
+  http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements

[PATCH 06/31] mm: Only flush the TLB when clearing an accessible pte

2012-10-25 Thread Peter Zijlstra

From: Rik van Riel r...@redhat.com

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Andrew Morton a...@linux-foundation.org
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 mm/pgtable-generic.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: tip/mm/pgtable-generic.c
===
--- tip.orig/mm/pgtable-generic.c
+++ tip/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_st
 {
pte_t pte;
pte = ptep_get_and_clear((vma)-vm_mm, address, ptep);
-   flush_tlb_page(vma, address);
+   if (pte_accessible(pte))
+   flush_tlb_page(vma, address);
return pte;
 }
 #endif


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/31] x86/mm: Introduce pte_accessible()

2012-10-25 Thread Peter Zijlstra

From: Rik van Riel r...@redhat.com

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page.  This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/include/asm/pgtable.h |6 ++
 include/asm-generic/pgtable.h  |4 
 2 files changed, 10 insertions(+)

Index: tip/arch/x86/include/asm/pgtable.h
===
--- tip.orig/arch/x86/include/asm/pgtable.h
+++ tip/arch/x86/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline int pte_present(pte_t a)
return pte_flags(a)  (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define __HAVE_ARCH_PTE_ACCESSIBLE
+static inline int pte_accessible(pte_t a)
+{
+   return pte_flags(a)  _PAGE_PRESENT;
+}
+
 static inline int pte_hidden(pte_t pte)
 {
return pte_flags(pte)  _PAGE_HIDDEN;
Index: tip/include/asm-generic/pgtable.h
===
--- tip.orig/include/asm-generic/pgtable.h
+++ tip/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a,
 #define move_pte(pte, prot, old_addr, new_addr)(pte)
 #endif
 
+#ifndef __HAVE_ARCH_PTE_ACCESSIBLE
+#define pte_accessible(pte)pte_present(pte)
+#endif
+
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 42293 matches

Mail list logo