subject:"Re\: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6\-rc5 on AMD chipsets \- bisected"

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-28 Thread Borislav Petkov

On Fri, Sep 28, 2012 at 05:50:20AM +0200, Mike Galbraith wrote:
> And wakeup preemption is still disabled as well, correct?

Yes it is by default anyway:

$ cat /mnt/dbg/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY 
NO_WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS OWNER_SPIN 
NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN

NO_WAKEUP_PREEMPTION brings 9% improvement with pgbench, btw:

http://marc.info/?l=linux-kernel=134876312310048

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-28 Thread Peter Zijlstra

On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote:
> I wonder about this comment, for example:
> 
>  * By using 'se' instead of 'curr' we penalize light tasks, so
>  * they get preempted easier. That is, if 'se' < 'curr' then
>  * the resulting gran will be larger, therefore penalizing the
>  * lighter, if otoh 'se' > 'curr' then the resulting gran will
>  * be smaller, again penalizing the lighter task.
> 
> why would we want to preempt light tasks easier? It sounds backwards
> to me. If they are light, we have *less* reason to preempt them, since
> they are more likely to just go to sleep on their own, no?

No, weight is nice, you nicing a task doesn't make it want to run less.
So preempting them sooner means they disturb the heavier less, which is
I think what you want with nice.

> Another question is whether the fact that this same load interacts
> with select_idle_sibling() is perhaps a sign that maybe the preemption
> logic is all fine, but it interacts badly with the "pick new cpu"
> code. In particular, after having changed rq's, is the vruntime really
> comparable? IOW, maybe this is an interaction between "place_entity()"
> and then the immediately following (?) call to check wakeup
> preemption? 

No, the vruntime comparison between cpus is dubious, its not complete
nonsense but its not 'correct' either. PJT has patches to improve that
based on his per-entity tracking stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-28 Thread Peter Zijlstra

On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote:
 I wonder about this comment, for example:
 
  * By using 'se' instead of 'curr' we penalize light tasks, so
  * they get preempted easier. That is, if 'se'  'curr' then
  * the resulting gran will be larger, therefore penalizing the
  * lighter, if otoh 'se'  'curr' then the resulting gran will
  * be smaller, again penalizing the lighter task.
 
 why would we want to preempt light tasks easier? It sounds backwards
 to me. If they are light, we have *less* reason to preempt them, since
 they are more likely to just go to sleep on their own, no?

No, weight is nice, you nicing a task doesn't make it want to run less.
So preempting them sooner means they disturb the heavier less, which is
I think what you want with nice.

 Another question is whether the fact that this same load interacts
 with select_idle_sibling() is perhaps a sign that maybe the preemption
 logic is all fine, but it interacts badly with the pick new cpu
 code. In particular, after having changed rq's, is the vruntime really
 comparable? IOW, maybe this is an interaction between place_entity()
 and then the immediately following (?) call to check wakeup
 preemption? 

No, the vruntime comparison between cpus is dubious, its not complete
nonsense but its not 'correct' either. PJT has patches to improve that
based on his per-entity tracking stuff.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-28 Thread Borislav Petkov

On Fri, Sep 28, 2012 at 05:50:20AM +0200, Mike Galbraith wrote:
 And wakeup preemption is still disabled as well, correct?

Yes it is by default anyway:

$ cat /mnt/dbg/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY 
NO_WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS OWNER_SPIN 
NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN

NO_WAKEUP_PREEMPTION brings 9% improvement with pgbench, btw:

http://marc.info/?l=linux-kernelm=134876312310048

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: 
> On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra  
> wrote:
> >
> > Don't forget to run the desktop interactivity benchmarks after you're
> > done wriggling with this knob... wakeup preemption is important for most
> > those.
> 
> So I don't think we want to *just* wiggle that knob per se. We
> definitely don't want to hurt latency on actual interactive asks. But
> it's interesting that it helps psql so much, and that there seems to
> be some interaction with the select_idle_sibling().
> 
> So I do have a few things I react to when looking at that wakeup granularity..
> 
> I wonder about this comment, for example:
> 
>  * By using 'se' instead of 'curr' we penalize light tasks, so
>  * they get preempted easier. That is, if 'se' < 'curr' then
>  * the resulting gran will be larger, therefore penalizing the
>  * lighter, if otoh 'se' > 'curr' then the resulting gran will
>  * be smaller, again penalizing the lighter task.
> 
> why would we want to preempt light tasks easier? It sounds backwards
> to me. If they are light, we have *less* reason to preempt them, since
> they are more likely to just go to sleep on their own, no?

At, that particular 'light' refers to se->load.weight.

> Another question is whether the fact that this same load interacts
> with select_idle_sibling() is perhaps a sign that maybe the preemption
> logic is all fine, but it interacts badly with the "pick new cpu"
> code. In particular, after having changed rq's, is the vruntime really
> comparable? IOW, maybe this is an interaction between "place_entity()"
> and then the immediately following (?) call to check wakeup
> preemption?

I think vruntime should be fine.  We set take the delta between the
task's vruntime when it went to sleep and it's previous rq min_vruntime
to capture progress made while it slept, and apply the relative offset
in the task's new home so a task can migrate and still have a chance to
preempt on wakeup.

> The fact that *either* changing select_idle_sibling() *or* changing
> the wakeup preemption granularity seems to have such a huge impact
> does seem to tie them together somehow for this particular load. No?

The way I read it, Boris had wakeup preemption disabled.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 21:24 +0200, Borislav Petkov wrote: 
> On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote:
> > > >> Or could we just improve the heuristics. What happens if the
> > > >> scheduling granularity is increased, for example? It's set to 1ms
> > > >> right now, with a logarithmic scaling by number of cpus.
> > > >
> > > > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
> > > > --
> > > > tps = 4994.730809 (including connections establishing)
> > > > tps = 5000.260764 (excluding connections establishing)
> > > >
> > > > A bit better over the default NO_WAKEUP_PREEMPTION setting.
> > > 
> > > Ok, so this gives us something possible to actually play with.
> > > 
> > > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
> > > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?
> > 
> > Don't forget to run the desktop interactivity benchmarks after you're
> > done wriggling with this knob... wakeup preemption is important for most
> > those.
> 
> Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made
> wakeup_granularity go to 4ms:
> 
> sched_autogroup_enabled:1
> sched_child_runs_first:0
> sched_latency_ns:2400
> sched_migration_cost_ns:50
> sched_min_granularity_ns:300
> sched_nr_migrate:32
> sched_rt_period_us:100
> sched_rt_runtime_us:95
> sched_shares_window_ns:1000
> sched_time_avg_ms:1000
> sched_tunable_scaling:2
> sched_wakeup_granularity_ns:400
> 
> pgbench results look good:
> 
> tps = 4997.675331 (including connections establishing)
> tps = 5003.256870 (excluding connections establishing)
> 
> This is still with Ingo's NO_WAKEUP_PREEMPTION patch.

And wakeup preemption is still disabled as well, correct?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra  wrote:
>
> Don't forget to run the desktop interactivity benchmarks after you're
> done wriggling with this knob... wakeup preemption is important for most
> those.

So I don't think we want to *just* wiggle that knob per se. We
definitely don't want to hurt latency on actual interactive asks. But
it's interesting that it helps psql so much, and that there seems to
be some interaction with the select_idle_sibling().

So I do have a few things I react to when looking at that wakeup granularity..

I wonder about this comment, for example:

 * By using 'se' instead of 'curr' we penalize light tasks, so
 * they get preempted easier. That is, if 'se' < 'curr' then
 * the resulting gran will be larger, therefore penalizing the
 * lighter, if otoh 'se' > 'curr' then the resulting gran will
 * be smaller, again penalizing the lighter task.

why would we want to preempt light tasks easier? It sounds backwards
to me. If they are light, we have *less* reason to preempt them, since
they are more likely to just go to sleep on their own, no?

Another question is whether the fact that this same load interacts
with select_idle_sibling() is perhaps a sign that maybe the preemption
logic is all fine, but it interacts badly with the "pick new cpu"
code. In particular, after having changed rq's, is the vruntime really
comparable? IOW, maybe this is an interaction between "place_entity()"
and then the immediately following (?) call to check wakeup
preemption?

The fact that *either* changing select_idle_sibling() *or* changing
the wakeup preemption granularity seems to have such a huge impact
does seem to tie them together somehow for this particular load. No?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote:
> > >> Or could we just improve the heuristics. What happens if the
> > >> scheduling granularity is increased, for example? It's set to 1ms
> > >> right now, with a logarithmic scaling by number of cpus.
> > >
> > > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
> > > --
> > > tps = 4994.730809 (including connections establishing)
> > > tps = 5000.260764 (excluding connections establishing)
> > >
> > > A bit better over the default NO_WAKEUP_PREEMPTION setting.
> > 
> > Ok, so this gives us something possible to actually play with.
> > 
> > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
> > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?
> 
> Don't forget to run the desktop interactivity benchmarks after you're
> done wriggling with this knob... wakeup preemption is important for most
> those.

Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made
wakeup_granularity go to 4ms:

sched_autogroup_enabled:1
sched_child_runs_first:0
sched_latency_ns:2400
sched_migration_cost_ns:50
sched_min_granularity_ns:300
sched_nr_migrate:32
sched_rt_period_us:100
sched_rt_runtime_us:95
sched_shares_window_ns:1000
sched_time_avg_ms:1000
sched_tunable_scaling:2
sched_wakeup_granularity_ns:400

pgbench results look good:

tps = 4997.675331 (including connections establishing)
tps = 5003.256870 (excluding connections establishing)

This is still with Ingo's NO_WAKEUP_PREEMPTION patch.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 11:19 -0700, Linus Torvalds wrote:
> On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov  wrote:
> > On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
> >> Or could we just improve the heuristics. What happens if the
> >> scheduling granularity is increased, for example? It's set to 1ms
> >> right now, with a logarithmic scaling by number of cpus.
> >
> > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
> > --
> > tps = 4994.730809 (including connections establishing)
> > tps = 5000.260764 (excluding connections establishing)
> >
> > A bit better over the default NO_WAKEUP_PREEMPTION setting.
> 
> Ok, so this gives us something possible to actually play with.
> 
> For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
> than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

Don't forget to run the desktop interactivity benchmarks after you're
done wriggling with this knob... wakeup preemption is important for most
those.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 10:45:06AM -0700, da...@lang.hm wrote:
> On Thu, 27 Sep 2012, Peter Zijlstra wrote:
> 
> >On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:
> >>I think you are bing too smart for your own good. you don't know if it's
> >>best to move them further apart or not.
> >
> >Well yes and no.. You're right, however in general the load-balancer has
> >always tried to not use (SMT) siblings whenever possible, in that regard
> >not using an idle sibling is consistent here.
> >
> >Also, for short running tasks the wakeup balancing is typically all we
> >have, the 'big' periodic load-balancer will 'never' see them, making the
> >multiple moves argument hard.
> 
> For the initial starup of a new process, finding as idle and remote
> a core to start on (minimum sharing with existing processes) is
> probably the smart thing to do.

Right,

but we don't schedule to the SMT siblings, as Peter says above. So we
can't get to the case where two SMT siblings are not overloaded and the
processes remain on the same L2.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov  wrote:
> On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
>> Or could we just improve the heuristics. What happens if the
>> scheduling granularity is increased, for example? It's set to 1ms
>> right now, with a logarithmic scaling by number of cpus.
>
> /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
> --
> tps = 4994.730809 (including connections establishing)
> tps = 5000.260764 (excluding connections establishing)
>
> A bit better over the default NO_WAKEUP_PREEMPTION setting.

Ok, so this gives us something possible to actually play with.

For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

(Btw, "linear" right now looks like 1:1. That's linear, but it's a
very aggressive linearity. Something like "factor = (cpus+1)/2" would
also be linear, but by a less extreme factor.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 10:45 AM,   wrote:
>
> For the initial starup of a new process, finding as idle and remote a core
> to start on (minimum sharing with existing processes) is probably the smart
> thing to do.

Actually, no.

It's *exec* that should go remote. New processes (fork, vfork or
clone) absolutely should *not* go remote at all.

vfork() should stay on the same CPU (synchronous wakeup), fork()
should possibly go SMT (likely exec in the near future will spread it
out), and clone should likely just stay close too.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 10:45 -0700, da...@lang.hm wrote:
> But I thought that this conversation (pgbench) was dealing with long 
> running processes, 

Ah, I think we've got a confusion on long vs short.. yes pgbench is a
long-running process, however the tasks might not be long in runnable
state. Ie it receives a request, computes a bit, blocks on IO, computes
a bit, replies, goes idle to wait for a new request.

If all those runnable sections are short enough, it will 'never' be
around when the periodic load-balancer does its thing, since that only
looks at the tasks in runnable state at that moment in time.

I say 'never' because while it will occasionally show up due to pure
chance, it will unlikely be a very big player in placement.

Once a cpu is overloaded enough to get real queueing they'll show up,
get dispersed and then its back to wakeup stuff.

Then again, it might be completely irrelevant to pgbench, its been a
while since I looked at how it schedules.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
> Or could we just improve the heuristics. What happens if the
> scheduling granularity is increased, for example? It's set to 1ms
> right now, with a logarithmic scaling by number of cpus.

/proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
--
tps = 4994.730809 (including connections establishing)
tps = 5000.260764 (excluding connections establishing)

A bit better over the default NO_WAKEUP_PREEMPTION setting.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Thu, 27 Sep 2012, Peter Zijlstra wrote:


On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:

I think you are bing too smart for your own good. you don't know if it's
best to move them further apart or not.


Well yes and no.. You're right, however in general the load-balancer has
always tried to not use (SMT) siblings whenever possible, in that regard
not using an idle sibling is consistent here.

Also, for short running tasks the wakeup balancing is typically all we
have, the 'big' periodic load-balancer will 'never' see them, making the
multiple moves argument hard.


For the initial starup of a new process, finding as idle and remote a core 
to start on (minimum sharing with existing processes) is probably the 
smart thing to do.


But I thought that this conversation (pgbench) was dealing with long 
running processes, and how to deal with the overload where one master 
process is kicking off many child processes and the core that the master 
process starts off on gets overloaded as a result, with the question being 
how to spread the load out from this one core as it gets overloaded.


David Lang


Measuring resource contention on the various levels is a fun research
subject, I've spoken to various people who are/were doing so, I've
always encouraged them to send their code just so we can see/learn, even
if not integrate, sadly I can't remember ever having seen any of it :/

And yeah, all the load-balancing stuff is very near to scrying or
tealeaf reading. We can't know all current state (too expensive) nor can
we know the future.

That said, I'm all for less/simpler code, pesky benchmarks aside ;-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 12:10 AM, Ingo Molnar  wrote:
>
> Just in case someone prefers patches to user-space approaches (I
> certainly do!), here's one that turns off wakeup driven
> preemption by default.

Ok, so apparently this fixes performance in a big way, and might allow
us to simplify select_idle_sibling(), which is clearly way too random.

That is, if we could make it automatic, some way. Not the "let the
user tune it" - that's just fundamentally broken.

What is the common pattern for the wakeups for psql?

Can we detect this somehow? Are they sync? It looks wrong to preempt
for sync wakeups, for example, but we seem to do that.

Or could we just improve the heuristics. What happens if the
scheduling granularity is increased, for example? It's set to 1ms
right now, with a logarithmic scaling by number of cpus.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:
> I think you are bing too smart for your own good. you don't know if it's 
> best to move them further apart or not. 

Well yes and no.. You're right, however in general the load-balancer has
always tried to not use (SMT) siblings whenever possible, in that regard
not using an idle sibling is consistent here.

Also, for short running tasks the wakeup balancing is typically all we
have, the 'big' periodic load-balancer will 'never' see them, making the
multiple moves argument hard.

Measuring resource contention on the various levels is a fun research
subject, I've spoken to various people who are/were doing so, I've
always encouraged them to send their code just so we can see/learn, even
if not integrate, sadly I can't remember ever having seen any of it :/

And yeah, all the load-balancing stuff is very near to scrying or
tealeaf reading. We can't know all current state (too expensive) nor can
we know the future.

That said, I'm all for less/simpler code, pesky benchmarks aside ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Thu, 27 Sep 2012, Borislav Petkov wrote:


On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote:

It seems to me that trying to figure out if you are going to
overload the L2 is an impossible task, so just assume that it will
all fit, and the worst case is you have one balancing cycle where
you can't do as much work and then the normal balancing will kick in
and move something anyway.


Right, and this implies that when the load balancer runs, it will
definitely move the task away from the L2. But what do I do in the cases
where the two tasks don't overload the L2 and it is actually beneficial
to keep them there? How does the load balancer know that?


no, I'm saying that you should assume that the two tasks won't overload 
the L2, try it, and if they do overload the L2, move one of the tasks 
again the next balancing cycle.


there is a lot of possible sharing going on between 'cores'

shared everything (a single core)
different registers, shared everything else (HT core)
shared floating point, shared cache, different everything else
shared L2/L3/Memory, different everything else
shared L3/Memory, different everything else
shared Memory, different everything else
different everything

and just wait a couple of years and someone will add a new entry to this 
list (if I haven't already missed a few :-)


the more that is shared, the cheaper it is to move the process (the less 
cached state you throw away), so ideally you want to move the process as 
little as possible, just enough to eliminate whatever the contended 
resource it. But since you really don't know the footprint of each process 
in each of these layers, all you can measure is what percentage of the 
total core time the process used, just move it a little and see if that 
was enough.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Thu, 27 Sep 2012, Peter Zijlstra wrote:


On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:


For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first.


That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

SMT
MC (llc)
CPU (package/machine-for-!numa)
NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.


I think you are bing too smart for your own good. you don't know if it's 
best to move them further apart or not. I'm arguing that you can't know.


so I'm saying do the simple thing.

if a core is overloaded, move to an idle core that is as close as possible 
to the core you start from (as much shared as possible).


if this does not overload the shared resource, you did the right thing.

if this does overload the shared resource, it's still no worse than 
leaving it on the original core (which was shared everything, so you've 
reduced the sharing a little bit)


the next balancing cycle you then work to move something again, and since 
both the original and new core show as overloaded (due to the contention 
on the shared resources), you move something to another core that shares 
just a little less.


Yes, this means that it may take more balancing cycles to move things far 
enough apartto reduce the sharing enough to avoid overload of the shared 
resource, but I don't see any way that you can possibly guess if two 
processes are going to overload the shared resource ahead of time.


It may be that simply moving to a HT core (and no longer contending for 
registers) is enough to let both processes fly, or it may be that the 
overload is in a shared floating point unit or L1 cache and you need to 
move further away, or you may find the contention is in the L2 cache and 
move further away, or it could be in the L3 cache, or it could be in the 
memory interface (NUMA)


Without being able to predict the future, you don't know how far away you 
need to move the tasks to have them operate at th eoptimal level. All that 
you do know is that the shorter the move, the less expensive the move. So 
make each move be as short as possible, and measure again to see if that 
was enough.


For some workloads, it will be. For many workloads the least expensive 
move won't be.


The question is if doing multiple, cheap moves (requiring simple checking 
for each moves) ends up being a win compared to do better guessing over 
when the more expensive moves are worth it.


Give how chips change from year to year, I don't see how the 'better 
guessing' is going to survive more than a couple of chip releases in any 
case.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 09:10:11AM +0200, Ingo Molnar wrote:
> The theory would be that this patch fixes psql performance, with CPU
> selection being a measurable but second order of magnitude effect. How
> well does practice match theory in this case?

Yeah, it looks a bit better than default linux. A whopping 9% perf delta
:-).

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance 
governor
==

plain
-
tps = 4574.570857 (including connections establishing)
tps = 4579.166159 (excluding connections establishing)

kill select_idle_sibling

tps = 2230.354093 (including connections establishing)
tps = 2231.412169 (excluding connections establishing)

NO_WAKEUP_PREEMPTION

tps = 4991.206742 (including connections establishing)
tps = 4996.743622 (excluding connections establishing)

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 12:20 +0200, Borislav Petkov wrote: 
> On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote:
> > It seems to me that trying to figure out if you are going to
> > overload the L2 is an impossible task, so just assume that it will
> > all fit, and the worst case is you have one balancing cycle where
> > you can't do as much work and then the normal balancing will kick in
> > and move something anyway.
> 
> Right, and this implies that when the load balancer runs, it will
> definitely move the task away from the L2. But what do I do in the cases
> where the two tasks don't overload the L2 and it is actually beneficial
> to keep them there? How does the load balancer know that?

It doesn't, but it has task_hot().  A preempted buddy may be pulled, but
the next wakeup will try to bring buddies back together.

-Mike 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote:
> It seems to me that trying to figure out if you are going to
> overload the L2 is an impossible task, so just assume that it will
> all fit, and the worst case is you have one balancing cycle where
> you can't do as much work and then the normal balancing will kick in
> and move something anyway.

Right, and this implies that when the load balancer runs, it will
definitely move the task away from the L2. But what do I do in the cases
where the two tasks don't overload the L2 and it is actually beneficial
to keep them there? How does the load balancer know that?

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
> 
> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. 

That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

 SMT
 MC (llc)
 CPU (package/machine-for-!numa)
 NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 00:17 -0700, da...@lang.hm wrote:

> over the long term, the work lost due to not moving optimally right away 
> is probably much less than the work lost due to trying to figure out the 
> perfect thing to do.

Yeah, "Perfect is the enemy of good" definitely applies.  Once you're
ramped, less is more.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Wed, 26 Sep 2012, Borislav Petkov wrote:


It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying "does this group actually cover the cpu I am
interested in?"

Please tell me I am mis-reading this?


First of all, I'm so *not* a scheduler guy so take this with a great
pinch of salt.

The way I understand it is, you either want to share L2 with a process,
because, for example, both working sets fit in the L2 and/or there's
some sharing which saves you moving everything over the L3. This is
where selecting a core on the same L2 is actually a good thing.

Or, they're too big to fit into the L2 and they start kicking each-other
out. Then you want to spread them out to different L2s - i.e., different
HT groups in Intel-speak.


an observation from an outsider here.

if you do overload a L2 cache, then the core will be busy all the time and 
you will end up migrating a task away from that core.


It seems to me that trying to figure out if you are going to overload the 
L2 is an impossible task, so just assume that it will all fit, and the 
worst case is you have one balancing cycle where you can't do as much work 
and then the normal balancing will kick in and move something anyway.


over the long term, the work lost due to not moving optimally right away 
is probably much less than the work lost due to trying to figure out the 
perfect thing to do.


and since the perfect thing to do is going to be both workload and chip 
specific, trying to model that in your decision making is a lost cause.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Ingo Molnar


* Mike Galbraith  wrote:

> > Do you have an easy-to-apply hack patch by chance that has 
> > the effect of turning off all such preemption, which people 
> > could try?
> 
> They don't need any hacks, all they have to do is start 
> postgreqsl SCHED_BATCH, then run pgbench the same way.
> 
> I use schedctl, but in chrt speak, chrt -b 0 
> /etc/init.d/postgresql start, and then the same for pgbench 
> itself.

Just in case someone prefers patches to user-space approaches (I 
certainly do!), here's one that turns off wakeup driven 
preemption by default.

It can be turned back on via:

  echo WAKEUP_PREEMPTION > /debug/sched_features

and off again via:

  echo NO_WAKEUP_PREEMPTION > /debug/sched_features 

(the patch is completely untested and such.)

The theory would be that this patch fixes psql performance, with 
CPU selection being a measurable but second order of magnitude 
effect. How well does practice match theory in this case?

Thanks,

Ingo

-
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..f936552 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2907,7 +2907,7 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 * is driven by the tick):
 */
-   if (unlikely(p->policy != SCHED_NORMAL))
+   if (unlikely(p->policy != SCHED_NORMAL) || 
!sched_feat(WAKEUP_PREEMPTION))
return;
 
find_matching_se(, );
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..e68e69a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,11 @@ SCHED_FEAT(LAST_BUDDY, true)
 SCHED_FEAT(CACHE_HOT_BUDDY, true)
 
 /*
+ * Allow wakeup-time preemption of the current task:
+ */
+SCHED_FEAT(WAKEUP_PREEMPTION, false)
+
+/*
  * Use arch dependent cpu power functions
  */
 SCHED_FEAT(ARCH_POWER, true)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 08:41 +0200, Ingo Molnar wrote: 
> * Mike Galbraith  wrote:
> 
> > > Just to confirm, if you turn off all preemption via a hack 
> > > (basically if you turn SCHED_OTHER into SCHED_BATCH), does 
> > > psql perform and scale much better, with the quality of 
> > > sibling selection and spreading of processes only being a 
> > > secondary effect?
> > 
> > That has always been the case here.  Preemption dominates.
> 
> Yes, so we get the best psql performance if we allow the central 
> proxy process to dominate a single CPU (IIRC it can easily go up 
> to 100% CPU utilization on that CPU - it is what determines max 
> psql throughput), and not let any worker run there much, right?

Running the thing RT didn't cut it iirc (will try that again).  For RT,
we won't look for an empty spot on wakeup, we'll just squash an ant.

> > Others should play with it too, and let their boxen speak.
> 
> Do you have an easy-to-apply hack patch by chance that has the 
> effect of turning off all such preemption, which people could 
> try?

They don't need any hacks, all they have to do is start postgreqsl
SCHED_BATCH, then run pgbench the same way.

I use schedctl, but in chrt speak, chrt -b 0 /etc/init.d/postgresql
start, and then the same for pgbench itself.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Ingo Molnar

* Mike Galbraith  wrote:

> > Just to confirm, if you turn off all preemption via a hack 
> > (basically if you turn SCHED_OTHER into SCHED_BATCH), does 
> > psql perform and scale much better, with the quality of 
> > sibling selection and spreading of processes only being a 
> > secondary effect?
> 
> That has always been the case here.  Preemption dominates.

Yes, so we get the best psql performance if we allow the central 
proxy process to dominate a single CPU (IIRC it can easily go up 
to 100% CPU utilization on that CPU - it is what determines max 
psql throughput), and not let any worker run there much, right?

> Others should play with it too, and let their boxen speak.

Do you have an easy-to-apply hack patch by chance that has the 
effect of turning off all such preemption, which people could 
try?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 07:47 +0200, Ingo Molnar wrote: 
> * Mike Galbraith  wrote:
> 
> > I think the pgbench problem is more about latency for the 1 in 
> > 1:N than spinlocks.
> 
> So my understanding of the psql workload is that basically we've 
> got a central psql proxy process that is distributing work to 
> worker psql processes. If a freshly woken worker process ever 
> preempts the central proxy process then it is preventing a lot 
> of new work from getting distributed.
> 
> Correct?

Yeah, that's my understanding of the thing, and I played with it quite a
bit in the past (only refreshed memories briefly in present).

> So the central proxy psql process is 'much more important' to 
> run than any of the worker processes - an importance that is not 
> (currently) visible from the behavioral statistics the scheduler 
> keeps on tasks.

Yeah.  We had the adaptive waker thing, but it stopped being a winner at
the one load it originally did help quite a lot, and it didn't help
pgbench all that much in it's then form anyway iirc.

> So the scheduler has the following problem here: a new wakee 
> might be starved enough and the proxy might have run long enough 
> to really justify the preemption here and now. The buddy 
> statistics help avoid some of these cases - but not all and the 
> difference is measurable.
> 
> Yet the 'best' way for psql to run is for this proxy process to 
> never be preempted. Your SCHED_BATCH experiments confirmed that.

Yes.

> The way remote CPU selection affects it is that if we ever get 
> more aggressive in selecting a remote CPU then we, as a side 
> effect, also reduce the chance of harmful preemption of the 
> central proxy psql process.

Right.

> So in that sense sibling selection is somewhat of an indirect 
> red herring: it really only helps psql indirectly by preventing 
> the harmful preemption. It also, somewhat paradoxially argues 
> for suboptimal code: for example tearing apart buddies is 
> beneficial in the psql workload, because it also allows the more 
> important part of the buddy to run more (the proxy).

Yes, I believe preemption dominates, but it's not alone, you can see
that in the numbers.

> In that sense the *real* problem isnt even parallelism (although 
> we obviously should improve the decisions there - and the logic 
> has suffered in the past from the psql dilemma outlined above), 
> but whether the scheduler can (and should) identify the central 
> proxy and keep it running as much as possible, deprioritizing 
> fairness, wakeup buddies, runtime overlap and cache affinity 
> considerations.
> 
> There's two broad solutions that I can see:
> 
>  - Add a kernel solution to somehow identify 'central' processes
>and bias them. Xorg is a similar kind of process, so it would
>help other workloads as well. That way lie dragons, but might
>be worth an attempt or two. We already try to do a couple of
>robust metrics, like overlap statistics to identify buddies.

What we do now works well for X and friends I think, because there
aren't so many buddies  It might work better though, and for the same
reasons.  I've in fact [re]invented a SCHED_SERVER class a few times,
but never one that survived my own scrutiny for long.

Arrr, here there be dragons is true ;-)

> - Let user-space occasionally identify its important (and less
>important) tasks - say psql could mark it worker processes as
>SCHED_BATCH and keep its central process(es) higher prio. A
>single line of obvious code in 100 KLOCs of user-space code.
> 
> Just to confirm, if you turn off all preemption via a hack 
> (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
> perform and scale much better, with the quality of sibling 
> selection and spreading of processes only being a secondary 
> effect?

That has always been the case here.  Preemption dominates.  Others
should play with it too, and let their boxen speak.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 07:47 +0200, Ingo Molnar wrote: 
 * Mike Galbraith efa...@gmx.de wrote:
 
  I think the pgbench problem is more about latency for the 1 in 
  1:N than spinlocks.
 
 So my understanding of the psql workload is that basically we've 
 got a central psql proxy process that is distributing work to 
 worker psql processes. If a freshly woken worker process ever 
 preempts the central proxy process then it is preventing a lot 
 of new work from getting distributed.
 
 Correct?

Yeah, that's my understanding of the thing, and I played with it quite a
bit in the past (only refreshed memories briefly in present).

 So the central proxy psql process is 'much more important' to 
 run than any of the worker processes - an importance that is not 
 (currently) visible from the behavioral statistics the scheduler 
 keeps on tasks.

Yeah.  We had the adaptive waker thing, but it stopped being a winner at
the one load it originally did help quite a lot, and it didn't help
pgbench all that much in it's then form anyway iirc.

 So the scheduler has the following problem here: a new wakee 
 might be starved enough and the proxy might have run long enough 
 to really justify the preemption here and now. The buddy 
 statistics help avoid some of these cases - but not all and the 
 difference is measurable.
 
 Yet the 'best' way for psql to run is for this proxy process to 
 never be preempted. Your SCHED_BATCH experiments confirmed that.

Yes.

 The way remote CPU selection affects it is that if we ever get 
 more aggressive in selecting a remote CPU then we, as a side 
 effect, also reduce the chance of harmful preemption of the 
 central proxy psql process.

Right.

 So in that sense sibling selection is somewhat of an indirect 
 red herring: it really only helps psql indirectly by preventing 
 the harmful preemption. It also, somewhat paradoxially argues 
 for suboptimal code: for example tearing apart buddies is 
 beneficial in the psql workload, because it also allows the more 
 important part of the buddy to run more (the proxy).

Yes, I believe preemption dominates, but it's not alone, you can see
that in the numbers.

 In that sense the *real* problem isnt even parallelism (although 
 we obviously should improve the decisions there - and the logic 
 has suffered in the past from the psql dilemma outlined above), 
 but whether the scheduler can (and should) identify the central 
 proxy and keep it running as much as possible, deprioritizing 
 fairness, wakeup buddies, runtime overlap and cache affinity 
 considerations.
 
 There's two broad solutions that I can see:
 
  - Add a kernel solution to somehow identify 'central' processes
and bias them. Xorg is a similar kind of process, so it would
help other workloads as well. That way lie dragons, but might
be worth an attempt or two. We already try to do a couple of
robust metrics, like overlap statistics to identify buddies.

What we do now works well for X and friends I think, because there
aren't so many buddies  It might work better though, and for the same
reasons.  I've in fact [re]invented a SCHED_SERVER class a few times,
but never one that survived my own scrutiny for long.

Arrr, here there be dragons is true ;-)

 - Let user-space occasionally identify its important (and less
important) tasks - say psql could mark it worker processes as
SCHED_BATCH and keep its central process(es) higher prio. A
single line of obvious code in 100 KLOCs of user-space code.
 
 Just to confirm, if you turn off all preemption via a hack 
 (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
 perform and scale much better, with the quality of sibling 
 selection and spreading of processes only being a secondary 
 effect?

That has always been the case here.  Preemption dominates.  Others
should play with it too, and let their boxen speak.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Ingo Molnar


* Mike Galbraith efa...@gmx.de wrote:

  Just to confirm, if you turn off all preemption via a hack 
  (basically if you turn SCHED_OTHER into SCHED_BATCH), does 
  psql perform and scale much better, with the quality of 
  sibling selection and spreading of processes only being a 
  secondary effect?
 
 That has always been the case here.  Preemption dominates.

Yes, so we get the best psql performance if we allow the central 
proxy process to dominate a single CPU (IIRC it can easily go up 
to 100% CPU utilization on that CPU - it is what determines max 
psql throughput), and not let any worker run there much, right?

 Others should play with it too, and let their boxen speak.

Do you have an easy-to-apply hack patch by chance that has the 
effect of turning off all such preemption, which people could 
try?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 08:41 +0200, Ingo Molnar wrote: 
 * Mike Galbraith efa...@gmx.de wrote:
 
   Just to confirm, if you turn off all preemption via a hack 
   (basically if you turn SCHED_OTHER into SCHED_BATCH), does 
   psql perform and scale much better, with the quality of 
   sibling selection and spreading of processes only being a 
   secondary effect?
  
  That has always been the case here.  Preemption dominates.
 
 Yes, so we get the best psql performance if we allow the central 
 proxy process to dominate a single CPU (IIRC it can easily go up 
 to 100% CPU utilization on that CPU - it is what determines max 
 psql throughput), and not let any worker run there much, right?

Running the thing RT didn't cut it iirc (will try that again).  For RT,
we won't look for an empty spot on wakeup, we'll just squash an ant.

  Others should play with it too, and let their boxen speak.
 
 Do you have an easy-to-apply hack patch by chance that has the 
 effect of turning off all such preemption, which people could 
 try?

They don't need any hacks, all they have to do is start postgreqsl
SCHED_BATCH, then run pgbench the same way.

I use schedctl, but in chrt speak, chrt -b 0 /etc/init.d/postgresql
start, and then the same for pgbench itself.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Ingo Molnar


* Mike Galbraith efa...@gmx.de wrote:

  Do you have an easy-to-apply hack patch by chance that has 
  the effect of turning off all such preemption, which people 
  could try?
 
 They don't need any hacks, all they have to do is start 
 postgreqsl SCHED_BATCH, then run pgbench the same way.
 
 I use schedctl, but in chrt speak, chrt -b 0 
 /etc/init.d/postgresql start, and then the same for pgbench 
 itself.

Just in case someone prefers patches to user-space approaches (I 
certainly do!), here's one that turns off wakeup driven 
preemption by default.

It can be turned back on via:

  echo WAKEUP_PREEMPTION  /debug/sched_features

and off again via:

  echo NO_WAKEUP_PREEMPTION  /debug/sched_features 

(the patch is completely untested and such.)

The theory would be that this patch fixes psql performance, with 
CPU selection being a measurable but second order of magnitude 
effect. How well does practice match theory in this case?

Thanks,

Ingo

-
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..f936552 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2907,7 +2907,7 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 * is driven by the tick):
 */
-   if (unlikely(p-policy != SCHED_NORMAL))
+   if (unlikely(p-policy != SCHED_NORMAL) || 
!sched_feat(WAKEUP_PREEMPTION))
return;
 
find_matching_se(se, pse);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..e68e69a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,11 @@ SCHED_FEAT(LAST_BUDDY, true)
 SCHED_FEAT(CACHE_HOT_BUDDY, true)
 
 /*
+ * Allow wakeup-time preemption of the current task:
+ */
+SCHED_FEAT(WAKEUP_PREEMPTION, false)
+
+/*
  * Use arch dependent cpu power functions
  */
 SCHED_FEAT(ARCH_POWER, true)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Wed, 26 Sep 2012, Borislav Petkov wrote:


It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying does this group actually cover the cpu I am
interested in?

Please tell me I am mis-reading this?


First of all, I'm so *not* a scheduler guy so take this with a great
pinch of salt.

The way I understand it is, you either want to share L2 with a process,
because, for example, both working sets fit in the L2 and/or there's
some sharing which saves you moving everything over the L3. This is
where selecting a core on the same L2 is actually a good thing.

Or, they're too big to fit into the L2 and they start kicking each-other
out. Then you want to spread them out to different L2s - i.e., different
HT groups in Intel-speak.


an observation from an outsider here.

if you do overload a L2 cache, then the core will be busy all the time and 
you will end up migrating a task away from that core.


It seems to me that trying to figure out if you are going to overload the 
L2 is an impossible task, so just assume that it will all fit, and the 
worst case is you have one balancing cycle where you can't do as much work 
and then the normal balancing will kick in and move something anyway.


over the long term, the work lost due to not moving optimally right away 
is probably much less than the work lost due to trying to figure out the 
perfect thing to do.


and since the perfect thing to do is going to be both workload and chip 
specific, trying to model that in your decision making is a lost cause.


David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 00:17 -0700, da...@lang.hm wrote:

 over the long term, the work lost due to not moving optimally right away 
 is probably much less than the work lost due to trying to figure out the 
 perfect thing to do.

Yeah, Perfect is the enemy of good definitely applies.  Once you're
ramped, less is more.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
 
 For example, it starts with the maximum target scheduling domain, and
 works its way in over the scheduling groups within that domain. What
 the f*ck is the logic of that kind of crazy thing? It never makes
 sense to look at a biggest domain first. 

That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

 SMT
 MC (llc)
 CPU (package/machine-for-!numa)
 NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote:
 It seems to me that trying to figure out if you are going to
 overload the L2 is an impossible task, so just assume that it will
 all fit, and the worst case is you have one balancing cycle where
 you can't do as much work and then the normal balancing will kick in
 and move something anyway.

Right, and this implies that when the load balancer runs, it will
definitely move the task away from the L2. But what do I do in the cases
where the two tasks don't overload the L2 and it is actually beneficial
to keep them there? How does the load balancer know that?

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 12:20 +0200, Borislav Petkov wrote: 
 On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote:
  It seems to me that trying to figure out if you are going to
  overload the L2 is an impossible task, so just assume that it will
  all fit, and the worst case is you have one balancing cycle where
  you can't do as much work and then the normal balancing will kick in
  and move something anyway.
 
 Right, and this implies that when the load balancer runs, it will
 definitely move the task away from the L2. But what do I do in the cases
 where the two tasks don't overload the L2 and it is actually beneficial
 to keep them there? How does the load balancer know that?

It doesn't, but it has task_hot().  A preempted buddy may be pulled, but
the next wakeup will try to bring buddies back together.

-Mike 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 09:10:11AM +0200, Ingo Molnar wrote:
 The theory would be that this patch fixes psql performance, with CPU
 selection being a measurable but second order of magnitude effect. How
 well does practice match theory in this case?

Yeah, it looks a bit better than default linux. A whopping 9% perf delta
:-).

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance 
governor
==

plain
-
tps = 4574.570857 (including connections establishing)
tps = 4579.166159 (excluding connections establishing)

kill select_idle_sibling

tps = 2230.354093 (including connections establishing)
tps = 2231.412169 (excluding connections establishing)

NO_WAKEUP_PREEMPTION

tps = 4991.206742 (including connections establishing)
tps = 4996.743622 (excluding connections establishing)

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Thu, 27 Sep 2012, Peter Zijlstra wrote:


On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:


For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first.


That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

SMT
MC (llc)
CPU (package/machine-for-!numa)
NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.


I think you are bing too smart for your own good. you don't know if it's 
best to move them further apart or not. I'm arguing that you can't know.


so I'm saying do the simple thing.

if a core is overloaded, move to an idle core that is as close as possible 
to the core you start from (as much shared as possible).


if this does not overload the shared resource, you did the right thing.

if this does overload the shared resource, it's still no worse than 
leaving it on the original core (which was shared everything, so you've 
reduced the sharing a little bit)


the next balancing cycle you then work to move something again, and since 
both the original and new core show as overloaded (due to the contention 
on the shared resources), you move something to another core that shares 
just a little less.


Yes, this means that it may take more balancing cycles to move things far 
enough apartto reduce the sharing enough to avoid overload of the shared 
resource, but I don't see any way that you can possibly guess if two 
processes are going to overload the shared resource ahead of time.


It may be that simply moving to a HT core (and no longer contending for 
registers) is enough to let both processes fly, or it may be that the 
overload is in a shared floating point unit or L1 cache and you need to 
move further away, or you may find the contention is in the L2 cache and 
move further away, or it could be in the L3 cache, or it could be in the 
memory interface (NUMA)


Without being able to predict the future, you don't know how far away you 
need to move the tasks to have them operate at th eoptimal level. All that 
you do know is that the shorter the move, the less expensive the move. So 
make each move be as short as possible, and measure again to see if that 
was enough.


For some workloads, it will be. For many workloads the least expensive 
move won't be.


The question is if doing multiple, cheap moves (requiring simple checking 
for each moves) ends up being a win compared to do better guessing over 
when the more expensive moves are worth it.


Give how chips change from year to year, I don't see how the 'better 
guessing' is going to survive more than a couple of chip releases in any 
case.


David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Thu, 27 Sep 2012, Borislav Petkov wrote:


On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote:

It seems to me that trying to figure out if you are going to
overload the L2 is an impossible task, so just assume that it will
all fit, and the worst case is you have one balancing cycle where
you can't do as much work and then the normal balancing will kick in
and move something anyway.


Right, and this implies that when the load balancer runs, it will
definitely move the task away from the L2. But what do I do in the cases
where the two tasks don't overload the L2 and it is actually beneficial
to keep them there? How does the load balancer know that?


no, I'm saying that you should assume that the two tasks won't overload 
the L2, try it, and if they do overload the L2, move one of the tasks 
again the next balancing cycle.


there is a lot of possible sharing going on between 'cores'

shared everything (a single core)
different registers, shared everything else (HT core)
shared floating point, shared cache, different everything else
shared L2/L3/Memory, different everything else
shared L3/Memory, different everything else
shared Memory, different everything else
different everything

and just wait a couple of years and someone will add a new entry to this 
list (if I haven't already missed a few :-)


the more that is shared, the cheaper it is to move the process (the less 
cached state you throw away), so ideally you want to move the process as 
little as possible, just enough to eliminate whatever the contended 
resource it. But since you really don't know the footprint of each process 
in each of these layers, all you can measure is what percentage of the 
total core time the process used, just move it a little and see if that 
was enough.


David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:
 I think you are bing too smart for your own good. you don't know if it's 
 best to move them further apart or not. 

Well yes and no.. You're right, however in general the load-balancer has
always tried to not use (SMT) siblings whenever possible, in that regard
not using an idle sibling is consistent here.

Also, for short running tasks the wakeup balancing is typically all we
have, the 'big' periodic load-balancer will 'never' see them, making the
multiple moves argument hard.

Measuring resource contention on the various levels is a fun research
subject, I've spoken to various people who are/were doing so, I've
always encouraged them to send their code just so we can see/learn, even
if not integrate, sadly I can't remember ever having seen any of it :/

And yeah, all the load-balancing stuff is very near to scrying or
tealeaf reading. We can't know all current state (too expensive) nor can
we know the future.

That said, I'm all for less/simpler code, pesky benchmarks aside ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 12:10 AM, Ingo Molnar mi...@kernel.org wrote:

 Just in case someone prefers patches to user-space approaches (I
 certainly do!), here's one that turns off wakeup driven
 preemption by default.

Ok, so apparently this fixes performance in a big way, and might allow
us to simplify select_idle_sibling(), which is clearly way too random.

That is, if we could make it automatic, some way. Not the let the
user tune it - that's just fundamentally broken.

What is the common pattern for the wakeups for psql?

Can we detect this somehow? Are they sync? It looks wrong to preempt
for sync wakeups, for example, but we seem to do that.

Or could we just improve the heuristics. What happens if the
scheduling granularity is increased, for example? It's set to 1ms
right now, with a logarithmic scaling by number of cpus.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread david


On Thu, 27 Sep 2012, Peter Zijlstra wrote:


On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:

I think you are bing too smart for your own good. you don't know if it's
best to move them further apart or not.


Well yes and no.. You're right, however in general the load-balancer has
always tried to not use (SMT) siblings whenever possible, in that regard
not using an idle sibling is consistent here.

Also, for short running tasks the wakeup balancing is typically all we
have, the 'big' periodic load-balancer will 'never' see them, making the
multiple moves argument hard.


For the initial starup of a new process, finding as idle and remote a core 
to start on (minimum sharing with existing processes) is probably the 
smart thing to do.


But I thought that this conversation (pgbench) was dealing with long 
running processes, and how to deal with the overload where one master 
process is kicking off many child processes and the core that the master 
process starts off on gets overloaded as a result, with the question being 
how to spread the load out from this one core as it gets overloaded.


David Lang


Measuring resource contention on the various levels is a fun research
subject, I've spoken to various people who are/were doing so, I've
always encouraged them to send their code just so we can see/learn, even
if not integrate, sadly I can't remember ever having seen any of it :/

And yeah, all the load-balancing stuff is very near to scrying or
tealeaf reading. We can't know all current state (too expensive) nor can
we know the future.

That said, I'm all for less/simpler code, pesky benchmarks aside ;-)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
 Or could we just improve the heuristics. What happens if the
 scheduling granularity is increased, for example? It's set to 1ms
 right now, with a logarithmic scaling by number of cpus.

/proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
--
tps = 4994.730809 (including connections establishing)
tps = 5000.260764 (excluding connections establishing)

A bit better over the default NO_WAKEUP_PREEMPTION setting.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 10:45 -0700, da...@lang.hm wrote:
 But I thought that this conversation (pgbench) was dealing with long 
 running processes, 

Ah, I think we've got a confusion on long vs short.. yes pgbench is a
long-running process, however the tasks might not be long in runnable
state. Ie it receives a request, computes a bit, blocks on IO, computes
a bit, replies, goes idle to wait for a new request.

If all those runnable sections are short enough, it will 'never' be
around when the periodic load-balancer does its thing, since that only
looks at the tasks in runnable state at that moment in time.

I say 'never' because while it will occasionally show up due to pure
chance, it will unlikely be a very big player in placement.

Once a cpu is overloaded enough to get real queueing they'll show up,
get dispersed and then its back to wakeup stuff.

Then again, it might be completely irrelevant to pgbench, its been a
while since I looked at how it schedules.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 10:45 AM,  da...@lang.hm wrote:

 For the initial starup of a new process, finding as idle and remote a core
 to start on (minimum sharing with existing processes) is probably the smart
 thing to do.

Actually, no.

It's *exec* that should go remote. New processes (fork, vfork or
clone) absolutely should *not* go remote at all.

vfork() should stay on the same CPU (synchronous wakeup), fork()
should possibly go SMT (likely exec in the near future will spread it
out), and clone should likely just stay close too.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov b...@alien8.de wrote:
 On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
 Or could we just improve the heuristics. What happens if the
 scheduling granularity is increased, for example? It's set to 1ms
 right now, with a logarithmic scaling by number of cpus.

 /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
 --
 tps = 4994.730809 (including connections establishing)
 tps = 5000.260764 (excluding connections establishing)

 A bit better over the default NO_WAKEUP_PREEMPTION setting.

Ok, so this gives us something possible to actually play with.

For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

(Btw, linear right now looks like 1:1. That's linear, but it's a
very aggressive linearity. Something like factor = (cpus+1)/2 would
also be linear, but by a less extreme factor.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 10:45:06AM -0700, da...@lang.hm wrote:
 On Thu, 27 Sep 2012, Peter Zijlstra wrote:
 
 On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote:
 I think you are bing too smart for your own good. you don't know if it's
 best to move them further apart or not.
 
 Well yes and no.. You're right, however in general the load-balancer has
 always tried to not use (SMT) siblings whenever possible, in that regard
 not using an idle sibling is consistent here.
 
 Also, for short running tasks the wakeup balancing is typically all we
 have, the 'big' periodic load-balancer will 'never' see them, making the
 multiple moves argument hard.
 
 For the initial starup of a new process, finding as idle and remote
 a core to start on (minimum sharing with existing processes) is
 probably the smart thing to do.

Right,

but we don't schedule to the SMT siblings, as Peter says above. So we
can't get to the case where two SMT siblings are not overloaded and the
processes remain on the same L2.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Peter Zijlstra

On Thu, 2012-09-27 at 11:19 -0700, Linus Torvalds wrote:
 On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov b...@alien8.de wrote:
  On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
  Or could we just improve the heuristics. What happens if the
  scheduling granularity is increased, for example? It's set to 1ms
  right now, with a logarithmic scaling by number of cpus.
 
  /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
  --
  tps = 4994.730809 (including connections establishing)
  tps = 5000.260764 (excluding connections establishing)
 
  A bit better over the default NO_WAKEUP_PREEMPTION setting.
 
 Ok, so this gives us something possible to actually play with.
 
 For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
 than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

Don't forget to run the desktop interactivity benchmarks after you're
done wriggling with this knob... wakeup preemption is important for most
those.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote:
   Or could we just improve the heuristics. What happens if the
   scheduling granularity is increased, for example? It's set to 1ms
   right now, with a logarithmic scaling by number of cpus.
  
   /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
   --
   tps = 4994.730809 (including connections establishing)
   tps = 5000.260764 (excluding connections establishing)
  
   A bit better over the default NO_WAKEUP_PREEMPTION setting.
  
  Ok, so this gives us something possible to actually play with.
  
  For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
  than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?
 
 Don't forget to run the desktop interactivity benchmarks after you're
 done wriggling with this knob... wakeup preemption is important for most
 those.

Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made
wakeup_granularity go to 4ms:

sched_autogroup_enabled:1
sched_child_runs_first:0
sched_latency_ns:2400
sched_migration_cost_ns:50
sched_min_granularity_ns:300
sched_nr_migrate:32
sched_rt_period_us:100
sched_rt_runtime_us:95
sched_shares_window_ns:1000
sched_time_avg_ms:1000
sched_tunable_scaling:2
sched_wakeup_granularity_ns:400

pgbench results look good:

tps = 4997.675331 (including connections establishing)
tps = 5003.256870 (excluding connections establishing)

This is still with Ingo's NO_WAKEUP_PREEMPTION patch.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Linus Torvalds

On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote:

 Don't forget to run the desktop interactivity benchmarks after you're
 done wriggling with this knob... wakeup preemption is important for most
 those.

So I don't think we want to *just* wiggle that knob per se. We
definitely don't want to hurt latency on actual interactive asks. But
it's interesting that it helps psql so much, and that there seems to
be some interaction with the select_idle_sibling().

So I do have a few things I react to when looking at that wakeup granularity..

I wonder about this comment, for example:

 * By using 'se' instead of 'curr' we penalize light tasks, so
 * they get preempted easier. That is, if 'se'  'curr' then
 * the resulting gran will be larger, therefore penalizing the
 * lighter, if otoh 'se'  'curr' then the resulting gran will
 * be smaller, again penalizing the lighter task.

why would we want to preempt light tasks easier? It sounds backwards
to me. If they are light, we have *less* reason to preempt them, since
they are more likely to just go to sleep on their own, no?

Another question is whether the fact that this same load interacts
with select_idle_sibling() is perhaps a sign that maybe the preemption
logic is all fine, but it interacts badly with the pick new cpu
code. In particular, after having changed rq's, is the vruntime really
comparable? IOW, maybe this is an interaction between place_entity()
and then the immediately following (?) call to check wakeup
preemption?

The fact that *either* changing select_idle_sibling() *or* changing
the wakeup preemption granularity seems to have such a huge impact
does seem to tie them together somehow for this particular load. No?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 21:24 +0200, Borislav Petkov wrote: 
 On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote:
Or could we just improve the heuristics. What happens if the
scheduling granularity is increased, for example? It's set to 1ms
right now, with a logarithmic scaling by number of cpus.
   
/proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms)
--
tps = 4994.730809 (including connections establishing)
tps = 5000.260764 (excluding connections establishing)
   
A bit better over the default NO_WAKEUP_PREEMPTION setting.
   
   Ok, so this gives us something possible to actually play with.
   
   For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
   than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?
  
  Don't forget to run the desktop interactivity benchmarks after you're
  done wriggling with this knob... wakeup preemption is important for most
  those.
 
 Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made
 wakeup_granularity go to 4ms:
 
 sched_autogroup_enabled:1
 sched_child_runs_first:0
 sched_latency_ns:2400
 sched_migration_cost_ns:50
 sched_min_granularity_ns:300
 sched_nr_migrate:32
 sched_rt_period_us:100
 sched_rt_runtime_us:95
 sched_shares_window_ns:1000
 sched_time_avg_ms:1000
 sched_tunable_scaling:2
 sched_wakeup_granularity_ns:400
 
 pgbench results look good:
 
 tps = 4997.675331 (including connections establishing)
 tps = 5003.256870 (excluding connections establishing)
 
 This is still with Ingo's NO_WAKEUP_PREEMPTION patch.

And wakeup preemption is still disabled as well, correct?

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-27 Thread Mike Galbraith

On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: 
 On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra a.p.zijls...@chello.nl 
 wrote:
 
  Don't forget to run the desktop interactivity benchmarks after you're
  done wriggling with this knob... wakeup preemption is important for most
  those.
 
 So I don't think we want to *just* wiggle that knob per se. We
 definitely don't want to hurt latency on actual interactive asks. But
 it's interesting that it helps psql so much, and that there seems to
 be some interaction with the select_idle_sibling().
 
 So I do have a few things I react to when looking at that wakeup granularity..
 
 I wonder about this comment, for example:
 
  * By using 'se' instead of 'curr' we penalize light tasks, so
  * they get preempted easier. That is, if 'se'  'curr' then
  * the resulting gran will be larger, therefore penalizing the
  * lighter, if otoh 'se'  'curr' then the resulting gran will
  * be smaller, again penalizing the lighter task.
 
 why would we want to preempt light tasks easier? It sounds backwards
 to me. If they are light, we have *less* reason to preempt them, since
 they are more likely to just go to sleep on their own, no?

At, that particular 'light' refers to se-load.weight.

 Another question is whether the fact that this same load interacts
 with select_idle_sibling() is perhaps a sign that maybe the preemption
 logic is all fine, but it interacts badly with the pick new cpu
 code. In particular, after having changed rq's, is the vruntime really
 comparable? IOW, maybe this is an interaction between place_entity()
 and then the immediately following (?) call to check wakeup
 preemption?

I think vruntime should be fine.  We set take the delta between the
task's vruntime when it went to sleep and it's previous rq min_vruntime
to capture progress made while it slept, and apply the relative offset
in the task's new home so a task can migrate and still have a chance to
preempt on wakeup.

 The fact that *either* changing select_idle_sibling() *or* changing
 the wakeup preemption granularity seems to have such a huge impact
 does seem to tie them together somehow for this particular load. No?

The way I read it, Boris had wakeup preemption disabled.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Ingo Molnar

* Ingo Molnar  wrote:

> * Mike Galbraith  wrote:
> 
> > I think the pgbench problem is more about latency for the 1 
> > in 1:N than spinlocks.
> 
> So my understanding of the psql workload is that basically 
> we've got a central psql proxy process that is distributing 
> work to worker psql processes. If a freshly woken worker 
> process ever preempts the central proxy process then it is 
> preventing a lot of new work from getting distributed.

Also, I'd like to stress that despite the optimization dilemma, 
the psql workload is *important*. More important than tbench - 
because psql does some real SQL work and it also matches the 
design of many real desktop and server workloads.

So if indeed the above is the main problem of psql it would be 
nice to add a 'perf bench sched proxy' testcase that emulates it 
- that would remove psql version dependencies and would ease the 
difficulty of running the benchmarks.

We alread have 'perf bench sched pipe' and 'perf bench sched 
messaging' - but neither shows the psql pattern currently.

I suspect a couple of udelay()s in the messaging benchmark would 
do the trick? The wakeup work there already matches much of how 
psql looks like.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Ingo Molnar

* Mike Galbraith  wrote:

> I think the pgbench problem is more about latency for the 1 in 
> 1:N than spinlocks.

So my understanding of the psql workload is that basically we've 
got a central psql proxy process that is distributing work to 
worker psql processes. If a freshly woken worker process ever 
preempts the central proxy process then it is preventing a lot 
of new work from getting distributed.

Correct?

So the central proxy psql process is 'much more important' to 
run than any of the worker processes - an importance that is not 
(currently) visible from the behavioral statistics the scheduler 
keeps on tasks.

So the scheduler has the following problem here: a new wakee 
might be starved enough and the proxy might have run long enough 
to really justify the preemption here and now. The buddy 
statistics help avoid some of these cases - but not all and the 
difference is measurable.

Yet the 'best' way for psql to run is for this proxy process to 
never be preempted. Your SCHED_BATCH experiments confirmed that.

The way remote CPU selection affects it is that if we ever get 
more aggressive in selecting a remote CPU then we, as a side 
effect, also reduce the chance of harmful preemption of the 
central proxy psql process.

So in that sense sibling selection is somewhat of an indirect 
red herring: it really only helps psql indirectly by preventing 
the harmful preemption. It also, somewhat paradoxially argues 
for suboptimal code: for example tearing apart buddies is 
beneficial in the psql workload, because it also allows the more 
important part of the buddy to run more (the proxy).

In that sense the *real* problem isnt even parallelism (although 
we obviously should improve the decisions there - and the logic 
has suffered in the past from the psql dilemma outlined above), 
but whether the scheduler can (and should) identify the central 
proxy and keep it running as much as possible, deprioritizing 
fairness, wakeup buddies, runtime overlap and cache affinity 
considerations.

There's two broad solutions that I can see:

 - Add a kernel solution to somehow identify 'central' processes
   and bias them. Xorg is a similar kind of process, so it would
   help other workloads as well. That way lie dragons, but might
   be worth an attempt or two. We already try to do a couple of
   robust metrics, like overlap statistics to identify buddies. 

 - Let user-space occasionally identify its important (and less
   important) tasks - say psql could mark it worker processes as
   SCHED_BATCH and keep its central process(es) higher prio. A
   single line of obvious code in 100 KLOCs of user-space code.

Just to confirm, if you turn off all preemption via a hack 
(basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
perform and scale much better, with the quality of sibling 
selection and spreading of processes only being a secondary 
effect?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Mike Galbraith

On Thu, 2012-09-27 at 07:18 +0200, Borislav Petkov wrote: 
> On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote:
>  but how does that affect pgbench and ilk that must spread regardless
> > of footprints.
> 
> Well, how do you measure latency of the 1 process in the 1:N case? Maybe
> pipeline stalls of the 1 along with some way to recognize it is the 1 in
> the 1:N case.

Best is to let userland tell us it's critical.  Smarts are expensive.  A
class of it's own (my wakees do _not_ preempt me, and I don't care that
you think this is unfair to the unwashed masses who will otherwise
_starve_ without me feeding them) makes sense for these guys.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote:
> > The way I understand it is, you either want to share L2 with a process,
> > because, for example, both working sets fit in the L2 and/or there's
> > some sharing which saves you moving everything over the L3. This is
> > where selecting a core on the same L2 is actually a good thing.
> 
> Yeah, and if the wakee can't get to the L2 hot data instantly, it may be
> better to let wakee drag the data to an instantly accessible spot.

Yep, then moving it to another L2 is the same.

[ … ]

> > A crazy thought: one could go and sample tasks while running their
> > timeslices with the perf counters to know exactly what type of workload
> > we're looking at. I.e., do I have a large number of L2 evictions? Yes,
> > then spread them out. No, then select the other core on the L2. And so
> > on.
> 
> Hm.  That sampling better be really cheap.  Might help...

Yeah, that's why I said sampling and not run the perfcounters during
every timeslice.

But if you count the proper events, you should be able to know exactly
what the workload is doing (compute-bound, io-bound, contention, etc...)

> but how does that affect pgbench and ilk that must spread regardless
> of footprints.

Well, how do you measure latency of the 1 process in the 1:N case? Maybe
pipeline stalls of the 1 along with some way to recognize it is the 1 in
the 1:N case.

Hmm.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Mike Galbraith

On Wed, 2012-09-26 at 23:37 +0200, Borislav Petkov wrote:

> The way I understand it is, you either want to share L2 with a process,
> because, for example, both working sets fit in the L2 and/or there's
> some sharing which saves you moving everything over the L3. This is
> where selecting a core on the same L2 is actually a good thing.

Yeah, and if the wakee can't get to the L2 hot data instantly, it may be
better to let wakee drag the data to an instantly accessible spot.

> Or, they're too big to fit into the L2 and they start kicking each-other
> out. Then you want to spread them out to different L2s - i.e., different
> HT groups in Intel-speak.
> 
> Oh, and then there's the userspace spinlocks thingie where Mike's patch
> hurts us.
> 
> Btw, Mike, you can jump in anytime :-)

I think the pgbench problem is more about latency for the 1 in 1:N than
spinlocks.

> So I'd say, this is the hard scheduling problem where fitting the
> workload to the architecture doesn't make everyone happy.

Yup.  I find it hard at least.

> A crazy thought: one could go and sample tasks while running their
> timeslices with the perf counters to know exactly what type of workload
> we're looking at. I.e., do I have a large number of L2 evictions? Yes,
> then spread them out. No, then select the other core on the L2. And so
> on.

Hm.  That sampling better be really cheap.  Might help... but how does
that affect pgbench and ilk that must spread regardless of footprints.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Mike Galbraith

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: 
> On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov  wrote:
> > On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
> >> How does pgbench look? That's the one that apparently really wants to
> >> spread out, possibly due to user-level spinlocks. So I assume it will
> >> show the reverse pattern, with "kill select_idle_sibling" being the
> >> worst case. Sad, because it really would be lovely to just remove that
> >> thing ;)
> >
> > Yep, correct. It hurts.
> 
> I'm *so* not surprised.

Any other result would have induced mushroom cloud, glazed eyes, and jaw
meets floor here.

> That said, I think your "kill select_idle_sibling()" one was
> interesting, but the wrong kind of "get rid of that logic".
> 
> It always selected target_cpu, but the fact is, that doesn't really
> sound very sane. The target cpu is either the previous cpu or the
> current cpu, depending on whether they should be balanced or not. But
> that still doesn't make any *sense*.
> 
> In fact, the whole select_idle_sibling() logic makes no sense
> what-so-ever to me. It seems to be total garbage.

Oh, it's not _that_ bad.  It does have it's troubles, but if it were
complete shite it wouldn't the make numbers that I showed, and wouldn't
make the even better numbers it does with some other loads. 

> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. If you want to be close to
> something, you want to look at the *smallest* domain first. But
> because it looks at things in the wrong order, it then needs to have
> that inner loop saying "does this group actually cover the cpu I am
> interested in?"
> 
> Please tell me I am mis-reading this?

We start at MC to get the tbench win I showed (Intel) vs loss at SMT.
Riddle me this, why does that produce the wins I showed?  I'm still
hoping someone can shed some light on why the heck there's such a
disparity in processor behaviors.

> But starting from the biggest ("llc" group) is wrong *anyway*, since
> it means that it starts looking at the L3 level, and then if it finds
> an acceptable cpu inside that level, it's all done. But that's
> *crazy*. Once again, it's much better to try to find an idle sibling
> *closeby* rather than at the L3 level. No? So once again, we should
> start at the inner level and if we can't find something really close,
> we work our way out, rather than starting from the outer level and
> working our way in.

Domains on my E5620 look like so when SMT is enabled (seldom):

[0.473692] CPU0 attaching sched-domain: 


[0.477616]  domain 0: span 0,4 level SIBLING


[0.481982]   groups: 0 (cpu_power = 589) 4 (cpu_power = 589)


[0.487805]   domain 1: span 0-7 level MC


[0.491829]groups: 0,4 (cpu_power = 1178) 1,5 (cpu_power = 1178) 2,6 
(cpu_power = 1178) 3,7 (cpu_power = 1178)
...

I usually have SMT off, which gives me more oomph at the bottom end (smt
affects turboboost gizmo methinks), have only one domain, so say I'm
waking from CPU0.  With cross wire thingy, we'll always wake to CPU1 if
idle.  That demonstrably works well despite it being L3.  Box coughs up
wins at fast movers I too would expect L3 to lose at.  If L2 is my only
viable target for fast movers, I'm stuck with SMT siblings, which I have
measured.  They aren't wonderful for this.  They do improve max
throughput markedly though, so aren't a complete waste of silicon ;-)

I wonder what domains look like on Bulldog. (boot w. sched_debug)
> If I read the code correctly, we can have both "prev" and "cpu" in the
> same L2 domain, but because we start looking at the L3 domain, we may
> end up picking another "affine" CPU that isn't even sharing L2's
> *before* we pick one that actually *is* sharing L2's with the target
> CPU. But that code is confusing enough with the scheduler groups inner
> loop that maybe I am mis-reading it entirely.

Yup, and on Intel, it manages to not suck.

> There are other oddities in select_idle_sibling() too, if I read
> things correctly.
> 
> For example, it uses

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote:
> I'm *so* not surprised.
> 
> That said, I think your "kill select_idle_sibling()" one was
> interesting, but the wrong kind of "get rid of that logic".

Yeah.

> It always selected target_cpu, but the fact is, that doesn't really
> sound very sane. The target cpu is either the previous cpu or the
> current cpu, depending on whether they should be balanced or not. But
> that still doesn't make any *sense*.
> 
> In fact, the whole select_idle_sibling() logic makes no sense
> what-so-ever to me. It seems to be total garbage.
> 
> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. If you want to be close to
> something, you want to look at the *smallest* domain first. But
> because it looks at things in the wrong order, it then needs to have
> that inner loop saying "does this group actually cover the cpu I am
> interested in?"
> 
> Please tell me I am mis-reading this?

First of all, I'm so *not* a scheduler guy so take this with a great
pinch of salt.

The way I understand it is, you either want to share L2 with a process,
because, for example, both working sets fit in the L2 and/or there's
some sharing which saves you moving everything over the L3. This is
where selecting a core on the same L2 is actually a good thing.

Or, they're too big to fit into the L2 and they start kicking each-other
out. Then you want to spread them out to different L2s - i.e., different
HT groups in Intel-speak.

Oh, and then there's the userspace spinlocks thingie where Mike's patch
hurts us.

Btw, Mike, you can jump in anytime :-)

So I'd say, this is the hard scheduling problem where fitting the
workload to the architecture doesn't make everyone happy.

A crazy thought: one could go and sample tasks while running their
timeslices with the perf counters to know exactly what type of workload
we're looking at. I.e., do I have a large number of L2 evictions? Yes,
then spread them out. No, then select the other core on the L2. And so
on.

> But starting from the biggest ("llc" group) is wrong *anyway*, since
> it means that it starts looking at the L3 level, and then if it
> finds an acceptable cpu inside that level, it's all done. But that's
> *crazy*. Once again, it's much better to try to find an idle sibling
> *closeby* rather than at the L3 level. No?

Exactly my thoughts a couple of days ago but see above.

> So once again, we should start at the inner level and if we can't find
> something really close, we work our way out, rather than starting from
> the outer level and working our way in.
>
> If I read the code correctly, we can have both "prev" and "cpu" in
> the same L2 domain, but because we start looking at the L3 domain, we
> may end up picking another "affine" CPU that isn't even sharing L2's
> *before* we pick one that actually *is* sharing L2's with the target
> CPU. But that code is confusing enough with the scheduler groups inner
> loop that maybe I am mis-reading it entirely.
>
> There are other oddities in select_idle_sibling() too, if I read
> things correctly.
>
> For example, it uses "cpu_idle(target)", but if we're actively trying
> to move to the current CPU (ie wake_affine() returned true), then
> target is the current cpu, which is certainly *not* going to be idle
> for a sync wakeup. So it should actually check whether it's a sync
> wakeup and the only thing pending is that synchronous waker, no?
>
> Maybe I'm missing something really fundamental, but it all really does
> look very odd to me.
>
> Attached is a totally untested and probably very buggy patch, so
> please consider it a "shouldn't we do something like this instead" RFC
> rather than anything serious. So this RFC patch is more a "ok, the
> patch tries to fix the above oddnesses, please tell me where I went
> wrong" than anything else.
>
> Comments?

Let me look at it tomorrow, on a fresh head. Too late here now.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Linus Torvalds

On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov  wrote:
> On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case. Sad, because it really would be lovely to just remove that
>> thing ;)
>
> Yep, correct. It hurts.

I'm *so* not surprised.

That said, I think your "kill select_idle_sibling()" one was
interesting, but the wrong kind of "get rid of that logic".

It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying "does this group actually cover the cpu I am
interested in?"

Please tell me I am mis-reading this?

But starting from the biggest ("llc" group) is wrong *anyway*, since
it means that it starts looking at the L3 level, and then if it finds
an acceptable cpu inside that level, it's all done. But that's
*crazy*. Once again, it's much better to try to find an idle sibling
*closeby* rather than at the L3 level. No? So once again, we should
start at the inner level and if we can't find something really close,
we work our way out, rather than starting from the outer level and
working our way in.

If I read the code correctly, we can have both "prev" and "cpu" in the
same L2 domain, but because we start looking at the L3 domain, we may
end up picking another "affine" CPU that isn't even sharing L2's
*before* we pick one that actually *is* sharing L2's with the target
CPU. But that code is confusing enough with the scheduler groups inner
loop that maybe I am mis-reading it entirely.

There are other oddities in select_idle_sibling() too, if I read
things correctly.

For example, it uses "cpu_idle(target)", but if we're actively trying
to move to the current CPU (ie wake_affine() returned true), then
target is the current cpu, which is certainly *not* going to be idle
for a sync wakeup. So it should actually check whether it's a sync
wakeup and the only thing pending is that synchronous waker, no?

Maybe I'm missing something really fundamental, but it all really does
look very odd to me.

Attached is a totally untested and probably very buggy patch, so
please consider it a "shouldn't we do something like this instead" RFC
rather than anything serious. So this RFC patch is more a "ok, the
patch tries to fix the above oddnesses, please tell me where I went
wrong" than anything else.

Comments?

Linus

patch.diff
Description: Binary data

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Wed, Sep 26, 2012 at 04:23:26AM +0200, Mike Galbraith wrote:
> On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote:
> 
> > Right, so why did we need it all, in the first place? There has to be
> > some reason for it.
> 
> Easy.  Take two communicating tasks.  Is an affine wakeup a good idea?
> It depends on how much execution overlap there is.  Wake affine when
> there is overlap larger than cache miss cost, and you just tossed
> throughput into the bin.
> 
> select_idle_sibling() was originally about shared L2, where any overlap
> was salvageable.  On modern processors with no shared L2,

Oh, but we do have shared L2s in the Bulldozer uarch (a subset of the
modern AMD processors :)).

> you have to get past the cost, but the gain is still there. Intel
> wins with loads that AMD loses very bady on, so I can only guess that
> Intel must feed caches more efficiently. Dunno. It just doesn't matter
> though, point is that there is a win to be had in both cases, the
> breakeven just isn't at the same point.

Well, I guess selecting the proper core in the hierarchy depending on
the workload is one of those hard problems.

Teaching select_idle_sibling to detect the breakeven point and act
accordingly would be not that easy then...

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 07:22:22PM -0700, Linus Torvalds wrote:
> So I'm sure there are architecture differences (where HT in particular
> probably changes optimal scheduling strategy, although I'd expect
> the bulldozer approach to not be *that*different - but I don't know
> if BD shows up as "HT siblings" or not, so dissimilar topology
> interpretation may make it *look* very different).

Right, those cores sharing an L2 are thread siblings on BD:

$ grep . /sys/devices/system/cpu/cpu0/topology/*
/sys/devices/system/cpu/cpu0/topology/core_id:0
/sys/devices/system/cpu/cpu0/topology/core_siblings:ff
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings:03
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1

much like HT siblings on this single-socket Sandybridge: 

$ grep . /sys/devices/system/cpu/cpu0/topology/*
/sys/devices/system/cpu/cpu0/topology/core_id:0
/sys/devices/system/cpu/cpu0/topology/core_siblings:ff
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings:11
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4

Although I don't know whether those thread siblings on this SB box are
actual HT siblings, sharing almost all resources, judging by the core
ids.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
> How does pgbench look? That's the one that apparently really wants to
> spread out, possibly due to user-level spinlocks. So I assume it will
> show the reverse pattern, with "kill select_idle_sibling" being the
> worst case. Sad, because it really would be lovely to just remove that
> thing ;)

Yep, correct. It hurts.

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance 
governor

tps = 4574.570857 (including connections establishing)
tps = 4579.166159 (excluding connections establishing)

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance 
governor + kill select_idle_sibling

tps = 2230.354093 (including connections establishing)
tps = 2231.412169 (excluding connections establishing)


-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
 How does pgbench look? That's the one that apparently really wants to
 spread out, possibly due to user-level spinlocks. So I assume it will
 show the reverse pattern, with kill select_idle_sibling being the
 worst case. Sad, because it really would be lovely to just remove that
 thing ;)

Yep, correct. It hurts.

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance 
governor

tps = 4574.570857 (including connections establishing)
tps = 4579.166159 (excluding connections establishing)

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance 
governor + kill select_idle_sibling

tps = 2230.354093 (including connections establishing)
tps = 2231.412169 (excluding connections establishing)


-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 07:22:22PM -0700, Linus Torvalds wrote:
 So I'm sure there are architecture differences (where HT in particular
 probably changes optimal scheduling strategy, although I'd expect
 the bulldozer approach to not be *that*different - but I don't know
 if BD shows up as HT siblings or not, so dissimilar topology
 interpretation may make it *look* very different).

Right, those cores sharing an L2 are thread siblings on BD:

$ grep . /sys/devices/system/cpu/cpu0/topology/*
/sys/devices/system/cpu/cpu0/topology/core_id:0
/sys/devices/system/cpu/cpu0/topology/core_siblings:ff
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings:03
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1

much like HT siblings on this single-socket Sandybridge: 

$ grep . /sys/devices/system/cpu/cpu0/topology/*
/sys/devices/system/cpu/cpu0/topology/core_id:0
/sys/devices/system/cpu/cpu0/topology/core_siblings:ff
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings:11
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4

Although I don't know whether those thread siblings on this SB box are
actual HT siblings, sharing almost all resources, judging by the core
ids.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Wed, Sep 26, 2012 at 04:23:26AM +0200, Mike Galbraith wrote:
 On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote:
 
  Right, so why did we need it all, in the first place? There has to be
  some reason for it.
 
 Easy.  Take two communicating tasks.  Is an affine wakeup a good idea?
 It depends on how much execution overlap there is.  Wake affine when
 there is overlap larger than cache miss cost, and you just tossed
 throughput into the bin.
 
 select_idle_sibling() was originally about shared L2, where any overlap
 was salvageable.  On modern processors with no shared L2,

Oh, but we do have shared L2s in the Bulldozer uarch (a subset of the
modern AMD processors :)).

 you have to get past the cost, but the gain is still there. Intel
 wins with loads that AMD loses very bady on, so I can only guess that
 Intel must feed caches more efficiently. Dunno. It just doesn't matter
 though, point is that there is a win to be had in both cases, the
 breakeven just isn't at the same point.

Well, I guess selecting the proper core in the hierarchy depending on
the workload is one of those hard problems.

Teaching select_idle_sibling to detect the breakeven point and act
accordingly would be not that easy then...

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Linus Torvalds

On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov b...@alien8.de wrote:
 On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
 How does pgbench look? That's the one that apparently really wants to
 spread out, possibly due to user-level spinlocks. So I assume it will
 show the reverse pattern, with kill select_idle_sibling being the
 worst case. Sad, because it really would be lovely to just remove that
 thing ;)

 Yep, correct. It hurts.

I'm *so* not surprised.

That said, I think your kill select_idle_sibling() one was
interesting, but the wrong kind of get rid of that logic.

It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying does this group actually cover the cpu I am
interested in?

Please tell me I am mis-reading this?

But starting from the biggest (llc group) is wrong *anyway*, since
it means that it starts looking at the L3 level, and then if it finds
an acceptable cpu inside that level, it's all done. But that's
*crazy*. Once again, it's much better to try to find an idle sibling
*closeby* rather than at the L3 level. No? So once again, we should
start at the inner level and if we can't find something really close,
we work our way out, rather than starting from the outer level and
working our way in.

If I read the code correctly, we can have both prev and cpu in the
same L2 domain, but because we start looking at the L3 domain, we may
end up picking another affine CPU that isn't even sharing L2's
*before* we pick one that actually *is* sharing L2's with the target
CPU. But that code is confusing enough with the scheduler groups inner
loop that maybe I am mis-reading it entirely.

There are other oddities in select_idle_sibling() too, if I read
things correctly.

For example, it uses cpu_idle(target), but if we're actively trying
to move to the current CPU (ie wake_affine() returned true), then
target is the current cpu, which is certainly *not* going to be idle
for a sync wakeup. So it should actually check whether it's a sync
wakeup and the only thing pending is that synchronous waker, no?

Maybe I'm missing something really fundamental, but it all really does
look very odd to me.

Attached is a totally untested and probably very buggy patch, so
please consider it a shouldn't we do something like this instead RFC
rather than anything serious. So this RFC patch is more a ok, the
patch tries to fix the above oddnesses, please tell me where I went
wrong than anything else.

Comments?

Linus


patch.diff
Description: Binary data

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote:
 I'm *so* not surprised.
 
 That said, I think your kill select_idle_sibling() one was
 interesting, but the wrong kind of get rid of that logic.

Yeah.

 It always selected target_cpu, but the fact is, that doesn't really
 sound very sane. The target cpu is either the previous cpu or the
 current cpu, depending on whether they should be balanced or not. But
 that still doesn't make any *sense*.
 
 In fact, the whole select_idle_sibling() logic makes no sense
 what-so-ever to me. It seems to be total garbage.
 
 For example, it starts with the maximum target scheduling domain, and
 works its way in over the scheduling groups within that domain. What
 the f*ck is the logic of that kind of crazy thing? It never makes
 sense to look at a biggest domain first. If you want to be close to
 something, you want to look at the *smallest* domain first. But
 because it looks at things in the wrong order, it then needs to have
 that inner loop saying does this group actually cover the cpu I am
 interested in?
 
 Please tell me I am mis-reading this?

First of all, I'm so *not* a scheduler guy so take this with a great
pinch of salt.

The way I understand it is, you either want to share L2 with a process,
because, for example, both working sets fit in the L2 and/or there's
some sharing which saves you moving everything over the L3. This is
where selecting a core on the same L2 is actually a good thing.

Or, they're too big to fit into the L2 and they start kicking each-other
out. Then you want to spread them out to different L2s - i.e., different
HT groups in Intel-speak.

Oh, and then there's the userspace spinlocks thingie where Mike's patch
hurts us.

Btw, Mike, you can jump in anytime :-)

So I'd say, this is the hard scheduling problem where fitting the
workload to the architecture doesn't make everyone happy.

A crazy thought: one could go and sample tasks while running their
timeslices with the perf counters to know exactly what type of workload
we're looking at. I.e., do I have a large number of L2 evictions? Yes,
then spread them out. No, then select the other core on the L2. And so
on.

 But starting from the biggest (llc group) is wrong *anyway*, since
 it means that it starts looking at the L3 level, and then if it
 finds an acceptable cpu inside that level, it's all done. But that's
 *crazy*. Once again, it's much better to try to find an idle sibling
 *closeby* rather than at the L3 level. No?

Exactly my thoughts a couple of days ago but see above.

 So once again, we should start at the inner level and if we can't find
 something really close, we work our way out, rather than starting from
 the outer level and working our way in.

 If I read the code correctly, we can have both prev and cpu in
 the same L2 domain, but because we start looking at the L3 domain, we
 may end up picking another affine CPU that isn't even sharing L2's
 *before* we pick one that actually *is* sharing L2's with the target
 CPU. But that code is confusing enough with the scheduler groups inner
 loop that maybe I am mis-reading it entirely.

 There are other oddities in select_idle_sibling() too, if I read
 things correctly.

 For example, it uses cpu_idle(target), but if we're actively trying
 to move to the current CPU (ie wake_affine() returned true), then
 target is the current cpu, which is certainly *not* going to be idle
 for a sync wakeup. So it should actually check whether it's a sync
 wakeup and the only thing pending is that synchronous waker, no?

 Maybe I'm missing something really fundamental, but it all really does
 look very odd to me.

 Attached is a totally untested and probably very buggy patch, so
 please consider it a shouldn't we do something like this instead RFC
 rather than anything serious. So this RFC patch is more a ok, the
 patch tries to fix the above oddnesses, please tell me where I went
 wrong than anything else.

 Comments?

Let me look at it tomorrow, on a fresh head. Too late here now.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Mike Galbraith

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: 
 On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov b...@alien8.de wrote:
  On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
  How does pgbench look? That's the one that apparently really wants to
  spread out, possibly due to user-level spinlocks. So I assume it will
  show the reverse pattern, with kill select_idle_sibling being the
  worst case. Sad, because it really would be lovely to just remove that
  thing ;)
 
  Yep, correct. It hurts.
 
 I'm *so* not surprised.

Any other result would have induced mushroom cloud, glazed eyes, and jaw
meets floor here.

 That said, I think your kill select_idle_sibling() one was
 interesting, but the wrong kind of get rid of that logic.
 
 It always selected target_cpu, but the fact is, that doesn't really
 sound very sane. The target cpu is either the previous cpu or the
 current cpu, depending on whether they should be balanced or not. But
 that still doesn't make any *sense*.
 
 In fact, the whole select_idle_sibling() logic makes no sense
 what-so-ever to me. It seems to be total garbage.

Oh, it's not _that_ bad.  It does have it's troubles, but if it were
complete shite it wouldn't the make numbers that I showed, and wouldn't
make the even better numbers it does with some other loads. 

 For example, it starts with the maximum target scheduling domain, and
 works its way in over the scheduling groups within that domain. What
 the f*ck is the logic of that kind of crazy thing? It never makes
 sense to look at a biggest domain first. If you want to be close to
 something, you want to look at the *smallest* domain first. But
 because it looks at things in the wrong order, it then needs to have
 that inner loop saying does this group actually cover the cpu I am
 interested in?
 
 Please tell me I am mis-reading this?

We start at MC to get the tbench win I showed (Intel) vs loss at SMT.
Riddle me this, why does that produce the wins I showed?  I'm still
hoping someone can shed some light on why the heck there's such a
disparity in processor behaviors.

 But starting from the biggest (llc group) is wrong *anyway*, since
 it means that it starts looking at the L3 level, and then if it finds
 an acceptable cpu inside that level, it's all done. But that's
 *crazy*. Once again, it's much better to try to find an idle sibling
 *closeby* rather than at the L3 level. No? So once again, we should
 start at the inner level and if we can't find something really close,
 we work our way out, rather than starting from the outer level and
 working our way in.

Domains on my E5620 look like so when SMT is enabled (seldom):

[0.473692] CPU0 attaching sched-domain: 


[0.477616]  domain 0: span 0,4 level SIBLING


[0.481982]   groups: 0 (cpu_power = 589) 4 (cpu_power = 589)


[0.487805]   domain 1: span 0-7 level MC


[0.491829]groups: 0,4 (cpu_power = 1178) 1,5 (cpu_power = 1178) 2,6 
(cpu_power = 1178) 3,7 (cpu_power = 1178)
...

I usually have SMT off, which gives me more oomph at the bottom end (smt
affects turboboost gizmo methinks), have only one domain, so say I'm
waking from CPU0.  With cross wire thingy, we'll always wake to CPU1 if
idle.  That demonstrably works well despite it being L3.  Box coughs up
wins at fast movers I too would expect L3 to lose at.  If L2 is my only
viable target for fast movers, I'm stuck with SMT siblings, which I have
measured.  They aren't wonderful for this.  They do improve max
throughput markedly though, so aren't a complete waste of silicon ;-)

I wonder what domains look like on Bulldog. (boot w. sched_debug)
 If I read the code correctly, we can have both prev and cpu in the
 same L2 domain, but because we start looking at the L3 domain, we may
 end up picking another affine CPU that isn't even sharing L2's
 *before* we pick one that actually *is* sharing L2's with the target
 CPU. But that code is confusing enough with the scheduler groups inner
 loop that maybe I am mis-reading it entirely.

Yup, and on Intel, it manages to not suck.

 There are other oddities in select_idle_sibling() too, if I read
 things correctly.
 
 For example, it uses cpu_idle(target), but if we're actively trying
 to move to the current

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Mike Galbraith

On Wed, 2012-09-26 at 23:37 +0200, Borislav Petkov wrote:

 The way I understand it is, you either want to share L2 with a process,
 because, for example, both working sets fit in the L2 and/or there's
 some sharing which saves you moving everything over the L3. This is
 where selecting a core on the same L2 is actually a good thing.

Yeah, and if the wakee can't get to the L2 hot data instantly, it may be
better to let wakee drag the data to an instantly accessible spot.

 Or, they're too big to fit into the L2 and they start kicking each-other
 out. Then you want to spread them out to different L2s - i.e., different
 HT groups in Intel-speak.
 
 Oh, and then there's the userspace spinlocks thingie where Mike's patch
 hurts us.
 
 Btw, Mike, you can jump in anytime :-)

I think the pgbench problem is more about latency for the 1 in 1:N than
spinlocks.

 So I'd say, this is the hard scheduling problem where fitting the
 workload to the architecture doesn't make everyone happy.

Yup.  I find it hard at least.

 A crazy thought: one could go and sample tasks while running their
 timeslices with the perf counters to know exactly what type of workload
 we're looking at. I.e., do I have a large number of L2 evictions? Yes,
 then spread them out. No, then select the other core on the L2. And so
 on.

Hm.  That sampling better be really cheap.  Might help... but how does
that affect pgbench and ilk that must spread regardless of footprints.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Borislav Petkov

On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote:
  The way I understand it is, you either want to share L2 with a process,
  because, for example, both working sets fit in the L2 and/or there's
  some sharing which saves you moving everything over the L3. This is
  where selecting a core on the same L2 is actually a good thing.
 
 Yeah, and if the wakee can't get to the L2 hot data instantly, it may be
 better to let wakee drag the data to an instantly accessible spot.

Yep, then moving it to another L2 is the same.

[ … ]

  A crazy thought: one could go and sample tasks while running their
  timeslices with the perf counters to know exactly what type of workload
  we're looking at. I.e., do I have a large number of L2 evictions? Yes,
  then spread them out. No, then select the other core on the L2. And so
  on.
 
 Hm.  That sampling better be really cheap.  Might help...

Yeah, that's why I said sampling and not run the perfcounters during
every timeslice.

But if you count the proper events, you should be able to know exactly
what the workload is doing (compute-bound, io-bound, contention, etc...)

 but how does that affect pgbench and ilk that must spread regardless
 of footprints.

Well, how do you measure latency of the 1 process in the 1:N case? Maybe
pipeline stalls of the 1 along with some way to recognize it is the 1 in
the 1:N case.

Hmm.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Mike Galbraith

On Thu, 2012-09-27 at 07:18 +0200, Borislav Petkov wrote: 
 On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote:
  but how does that affect pgbench and ilk that must spread regardless
  of footprints.
 
 Well, how do you measure latency of the 1 process in the 1:N case? Maybe
 pipeline stalls of the 1 along with some way to recognize it is the 1 in
 the 1:N case.

Best is to let userland tell us it's critical.  Smarts are expensive.  A
class of it's own (my wakees do _not_ preempt me, and I don't care that
you think this is unfair to the unwashed masses who will otherwise
_starve_ without me feeding them) makes sense for these guys.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Ingo Molnar


* Mike Galbraith efa...@gmx.de wrote:

 I think the pgbench problem is more about latency for the 1 in 
 1:N than spinlocks.

So my understanding of the psql workload is that basically we've 
got a central psql proxy process that is distributing work to 
worker psql processes. If a freshly woken worker process ever 
preempts the central proxy process then it is preventing a lot 
of new work from getting distributed.

Correct?

So the central proxy psql process is 'much more important' to 
run than any of the worker processes - an importance that is not 
(currently) visible from the behavioral statistics the scheduler 
keeps on tasks.

So the scheduler has the following problem here: a new wakee 
might be starved enough and the proxy might have run long enough 
to really justify the preemption here and now. The buddy 
statistics help avoid some of these cases - but not all and the 
difference is measurable.

Yet the 'best' way for psql to run is for this proxy process to 
never be preempted. Your SCHED_BATCH experiments confirmed that.

The way remote CPU selection affects it is that if we ever get 
more aggressive in selecting a remote CPU then we, as a side 
effect, also reduce the chance of harmful preemption of the 
central proxy psql process.

So in that sense sibling selection is somewhat of an indirect 
red herring: it really only helps psql indirectly by preventing 
the harmful preemption. It also, somewhat paradoxially argues 
for suboptimal code: for example tearing apart buddies is 
beneficial in the psql workload, because it also allows the more 
important part of the buddy to run more (the proxy).

In that sense the *real* problem isnt even parallelism (although 
we obviously should improve the decisions there - and the logic 
has suffered in the past from the psql dilemma outlined above), 
but whether the scheduler can (and should) identify the central 
proxy and keep it running as much as possible, deprioritizing 
fairness, wakeup buddies, runtime overlap and cache affinity 
considerations.

There's two broad solutions that I can see:

 - Add a kernel solution to somehow identify 'central' processes
   and bias them. Xorg is a similar kind of process, so it would
   help other workloads as well. That way lie dragons, but might
   be worth an attempt or two. We already try to do a couple of
   robust metrics, like overlap statistics to identify buddies. 

 - Let user-space occasionally identify its important (and less
   important) tasks - say psql could mark it worker processes as
   SCHED_BATCH and keep its central process(es) higher prio. A
   single line of obvious code in 100 KLOCs of user-space code.

Just to confirm, if you turn off all preemption via a hack 
(basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
perform and scale much better, with the quality of sibling 
selection and spreading of processes only being a secondary 
effect?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-26 Thread Ingo Molnar


* Ingo Molnar mi...@kernel.org wrote:

 * Mike Galbraith efa...@gmx.de wrote:
 
  I think the pgbench problem is more about latency for the 1 
  in 1:N than spinlocks.
 
 So my understanding of the psql workload is that basically 
 we've got a central psql proxy process that is distributing 
 work to worker psql processes. If a freshly woken worker 
 process ever preempts the central proxy process then it is 
 preventing a lot of new work from getting distributed.

Also, I'd like to stress that despite the optimization dilemma, 
the psql workload is *important*. More important than tbench - 
because psql does some real SQL work and it also matches the 
design of many real desktop and server workloads.

So if indeed the above is the main problem of psql it would be 
nice to add a 'perf bench sched proxy' testcase that emulates it 
- that would remove psql version dependencies and would ease the 
difficulty of running the benchmarks.

We alread have 'perf bench sched pipe' and 'perf bench sched 
messaging' - but neither shows the psql pattern currently.

I suspect a couple of udelay()s in the messaging benchmark would 
do the trick? The wakeup work there already matches much of how 
psql looks like.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Mike Galbraith

On Tue, 2012-09-25 at 19:22 -0700, Linus Torvalds wrote: 
> On Tue, Sep 25, 2012 at 7:00 PM, Mike Galbraith  wrote:
> >
> > Yes.  On AMD, the best thing you can do for fast switchers AFAIKT is
> > turn it off.  Different story on Intel.
> 
> I doubt it's all that different on Intel.

The behavioral difference is pretty large, question is why.


> Am I on the right track here? Or do you mean something completely
> different? Please explain it more verbosely.

A picture is worth a thousand words they say...

x3550 M3 E5620, SMT off, revert reverted, nohz off, zero knob twiddles,
governor=performance.

tbench1 2 4
398   820  1574  -select_idle_sibling() 
454   902  1574  +select_idle_sibling()
397   737  1556  +select_idle_sibling() virgin source

netperf TCP_RR, one unbound pair
114674   -select_idle_sibling()
131422   +select_idle_sibling()
111551   +select_idle_sibling() virgin source

These 1:1 buddy pairs scheduled cross core on E5620 feel no pain once
you kill the bouncing.  The bounce pain with 4 cores is _tons_ less
intense than on the 10 core Westmere, but it's still quite visible.  The
point though is that cross core doesn't hurt Westmere, but demolishes
Opteron for some reason.  (OTOH, bounce _helps_ fugly 1:N load.. grr;)


> Your patch showed improvement for Intel too on this same benchmark
> (tbench). Borislav just went even further. I'd suggest testing that
> patch on Intel too, and wouldn't be surprised at all if it shows
> improvement there too.

See above.

> It's pgbench that then regressed with your patch, and I suspect it
> will regress with Borislav's too.

Yeah, strongly suspect you're right.

> You probably looked at the fact that the original report from Nikolay
> says that the Intel E6300 hadn't regressed on pgbench, but I suspect
> you didn't realize that E6300 is just a dual-core CPU without even HT.
> So I doubt it's about "Intel vs AMD", it's more about "six cores" vs
> "just two".

No, I knew, and yeah, it's about number of paths.

> And the thing is - with just two cores, the fact that your patch
> didn't change the Intel numbers is totally irrelevant. With two cores,
> the whole "buddy_cpu" was equivalent to the old code, since there was
> ever only one other core to begin with!
> 
> So AMD and Intel do have differences, but they aren't all that radical.

Looks fairly radical to me, but as noted in mail to Boris, it boils down
to "what does it cost, and where does the breakeven lie?".

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Mike Galbraith

On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote:

> Right, so why did we need it all, in the first place? There has to be
> some reason for it.

Easy.  Take two communicating tasks.  Is an affine wakeup a good idea?
It depends on how much execution overlap there is.  Wake affine when
there is overlap larger than cache miss cost, and you just tossed
throughput into the bin.

select_idle_sibling() was originally about shared L2, where any overlap
was salvageable.  On modern processors with no shared L2, you have to
get past the cost, but the gain is still there.  Intel wins with loads
that AMD loses very bady on, so I can only guess that Intel must feed
caches more efficiently.  Dunno.  It just doesn't matter though, point
is that there is a win to be had in both cases, the breakeven just isn't
at the same point.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Linus Torvalds

On Tue, Sep 25, 2012 at 7:00 PM, Mike Galbraith  wrote:
>
> Yes.  On AMD, the best thing you can do for fast switchers AFAIKT is
> turn it off.  Different story on Intel.

I doubt it's all that different on Intel.

Your patch showed improvement for Intel too on this same benchmark
(tbench). Borislav just went even further. I'd suggest testing that
patch on Intel too, and wouldn't be surprised at all if it shows
improvement there too.

It's pgbench that then regressed with your patch, and I suspect it
will regress with Borislav's too.

So I'm sure there are architecture differences (where HT in particular
probably changes optimal scheduling strategy, although I'd expect the
bulldozer approach to not be *that*different - but I don't know if BD
shows up as "HT siblings" or not, so dissimilar topology
interpretation may make it *look* very different).

So I suspect the architectural differences are smaller than you claim,
and it's much more about the loads in question.

You probably looked at the fact that the original report from Nikolay
says that the Intel E6300 hadn't regressed on pgbench, but I suspect
you didn't realize that E6300 is just a dual-core CPU without even HT.
So I doubt it's about "Intel vs AMD", it's more about "six cores" vs
"just two".

And the thing is - with just two cores, the fact that your patch
didn't change the Intel numbers is totally irrelevant. With two cores,
the whole "buddy_cpu" was equivalent to the old code, since there was
ever only one other core to begin with!

So AMD and Intel do have differences, but they aren't all that radical.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Mike Galbraith

On Tue, 2012-09-25 at 10:21 -0700, Linus Torvalds wrote: 
> On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov  wrote:
> >
> > 3.6-rc6+tip/auto-latest-kill select_idle_sibling()
> 
> Is this literally just removing it entirely? Because apart from the
> latency spike at 4 procs (and the latency numbers look very noisy, so
> that's probably just noise), it looks clearly superior to everything
> else. On that benchmark, at least.

Yes.  On AMD, the best thing you can do for fast switchers AFAIKT is
turn it off.  Different story on Intel.

> How does pgbench look? That's the one that apparently really wants to
> spread out, possibly due to user-level spinlocks. So I assume it will
> show the reverse pattern, with "kill select_idle_sibling" being the
> worst case. Sad, because it really would be lovely to just remove that
> thing ;)

It _is_ irritating.  There's nohz, governors, and then come radically
different cross cpu data blasting ability on top. On Intel, it wins at
the same fast movers it demolishes on AMD.  Throttle it, and that goes
away, along with some other issues.

Or just kill it, then integrate what it does for you into a smarter
lighter wakeup balance.. but then that has to climb that same hills.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Suresh Siddha

On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote:
> On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith  wrote:
> >
> > Aside from the cache pollution I recall having been mentioned, on my
> > E5620, cross core is a tbench win over affine, cross thread is not.
> 
> Oh, I agree with trying to avoid HT threads, the resource contention
> easily gets too bad.
> 
> It's more a question of "if we have real cores with separate L1's but
> shared L2's, go with those first, before we start distributing it out
> to separate L2's".

There is one issue though. If the tasks continue to run in this state
and the periodic balance notices an idle L2, it will force migrate
(using active migration) one of the tasks to the idle L2. As the
periodic balance tries to spread the load as far as possible to take
maximum advantage of the available resources (and the perf advantage of
this really depends on the workload, cache usage/memory bw, the upside
of turbo etc).

But I am not sure if this was the reason why we chose to spread it out
to separate L2's during wakeup.

Anyways, this is one of the places where the Paul Turner's task load
average tracking patches will be useful. Depending on how long a task
typically runs, we can probably even chose a SMT siblings or a separate
L2 to run.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Linus Torvalds

On Tue, Sep 25, 2012 at 11:42 AM, Borislav Petkov  wrote:
>>
>> Is this literally just removing it entirely?
>
> Basically yes:

Ok, so you make it just always select 'target'. Fine. I wondered if
you just removed the calling logic entirely.

>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case.
>
> Let me run pgbench tomorrow (I had run it only on an older family 0x10
> single-node box) on Bulldozer to check that out. And we haven't started
> the multi-node measurements at all.

Ack, this clearly needs much more testing. That said, I really would
*love* to just get rid of the function entirely.

>> Sad, because it really would be lovely to just remove that thing ;)
>
> Right, so why did we need it all, in the first place? There has to be
> some reason for it.

I'm not entirely convinced.

Looking at the history of that thing, it's long and tortuous, and has
a few commits completely fixing the "logic" of it (eg see commit
99bd5e2f245d).

To the point where I don't think it necessarily even matches what the
original cause for it was. So it's *possible* that we have a case of
historical code that may have improved performance originally on at
least some machines, but that has (a) been changed due to it being
broken and (b) CPU's have changed too, so it may well be that it
simply doesn't help any more.

And we've had problems with this function before. See for example:
 - 4dcfe1025b51: sched: Avoid SMT siblings in select_idle_sibling() if possible
 - 518cd6234178: sched: Only queue remote wakeups when crossing cache boundaries

so we've basically had odd special-case "tuning" of this function from
the original. I do not think that there is any solid reason to believe
that it does what it used to do, or that what it used to do makes
sense any more.

It's entirely possible that "prev_cpu" basically ends up being the
better choice for spreading things out.

That said, my *guess* is that when you run pgbench, you'll see the
same regression that we saw due to Mike's patch too. It simply looks
like tbench wants to have minimal cpu selection and avoid moving
things around, while pgbench probably wants to spread out maximally.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
> On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov  wrote:
> >
> > 3.6-rc6+tip/auto-latest-kill select_idle_sibling()
> 
> Is this literally just removing it entirely?

Basically yes:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a14b990..016ba387c7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2640,6 +2640,8 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
struct sched_group *sg;
int i;
 
+   goto done;
+
/*
 * If the task is going to be woken-up on this cpu and if it is
 * already idle, then it is the right target.

> Because apart from the latency spike at 4 procs (and the latency
> numbers look very noisy, so that's probably just noise), it looks
> clearly superior to everything else. On that benchmark, at least.

Yep, I need more results for a more reliable say here.

> How does pgbench look? That's the one that apparently really wants to
> spread out, possibly due to user-level spinlocks. So I assume it will
> show the reverse pattern, with "kill select_idle_sibling" being the
> worst case.

Let me run pgbench tomorrow (I had run it only on an older family 0x10
single-node box) on Bulldozer to check that out. And we haven't started
the multi-node measurements at all.

> Sad, because it really would be lovely to just remove that thing ;)

Right, so why did we need it all, in the first place? There has to be
some reason for it.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Linus Torvalds

On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov  wrote:
>
> 3.6-rc6+tip/auto-latest-kill select_idle_sibling()

Is this literally just removing it entirely? Because apart from the
latency spike at 4 procs (and the latency numbers look very noisy, so
that's probably just noise), it looks clearly superior to everything
else. On that benchmark, at least.

How does pgbench look? That's the one that apparently really wants to
spread out, possibly due to user-level spinlocks. So I assume it will
show the reverse pattern, with "kill select_idle_sibling" being the
worst case. Sad, because it really would be lovely to just remove that
thing ;)

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 03:17:36PM +0200, Borislav Petkov wrote:
> For example, I did some measurements a couple of days ago on Bulldozer
> of tbench with and without select_idle_sibling:

Here are updated benchmark results with your patch here:
http://marc.info/?l=linux-kernel=134850871822587

I think this pretty much confirms Mel's results.

tbench runs single-socket OR-B  (box has 8 cores, 4 CUs) (tbench_srv 
localhost), tbench default settings as in debian testing

# clients   1   2   
4   8   12  16
3.6-rc6+tip/auto-latest 115.91  238.571 
469.606 1865.77 1863.08 1851.46
3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 
900.069 1969.35 1955.91 1940.84
3.6-rc6+tip/auto-latest-revert-the-revert   114.001 223.171 
408.507 1771.48 1757.08 1736.12
3.6-rc7+tip/auto-latest-select_idle_sibling-lists   107.39  222.439 
435.255 1659.42 1697.43 1685.92

3.6-rc6+tip/auto-latest
---
Throughput 115.91 MB/sec   1 clients  1 procs  max_latency=0.296 ms
Throughput 238.571 MB/sec  2 clients  2 procs  max_latency=1.296 ms
Throughput 469.606 MB/sec  4 clients  4 procs  max_latency=0.340 ms
Throughput 1865.77 MB/sec  8 clients  8 procs  max_latency=3.393 ms
Throughput 1863.08 MB/sec  12 clients  12 procs  max_latency=0.322 ms
Throughput 1851.46 MB/sec  16 clients  16 procs  max_latency=2.059 ms

3.6-rc6+tip/auto-latest-kill select_idle_sibling()
--
Throughput 354.619 MB/sec  1 clients  1 procs  max_latency=0.321 ms
Throughput 534.714 MB/sec  2 clients  2 procs  max_latency=2.651 ms
Throughput 900.069 MB/sec  4 clients  4 procs  max_latency=10.823 ms
Throughput 1969.35 MB/sec  8 clients  8 procs  max_latency=1.630 ms
Throughput 1955.91 MB/sec  12 clients  12 procs  max_latency=3.236 ms
Throughput 1940.84 MB/sec  16 clients  16 procs  max_latency=0.314 ms

3.6-rc6+tip/auto-latest-revert-the-revert
-
Throughput 114.001 MB/sec  1 clients  1 procs  max_latency=0.352 ms
Throughput 223.171 MB/sec  2 clients  2 procs  max_latency=0.348 ms
Throughput 408.507 MB/sec  4 clients  4 procs  max_latency=0.388 ms
Throughput 1771.48 MB/sec  8 clients  8 procs  max_latency=0.280 ms
Throughput 1757.08 MB/sec  12 clients  12 procs  max_latency=3.280 ms
Throughput 1736.12 MB/sec  16 clients  16 procs  max_latency=0.333 ms

3.6-rc7+tip/auto-latest-select_idle_sibling-lists
-
Throughput 107.39 MB/sec  1 clients  1 procs  max_latency=0.372 ms
Throughput 222.439 MB/sec  2 clients  2 procs  max_latency=0.345 ms
Throughput 435.255 MB/sec  4 clients  4 procs  max_latency=0.346 ms
Throughput 1659.42 MB/sec  8 clients  8 procs  max_latency=3.497 ms
Throughput 1697.43 MB/sec  12 clients  12 procs  max_latency=3.205 ms
Throughput 1685.92 MB/sec  16 clients  16 procs  max_latency=0.331 ms

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 14:23 +0100, Mel Gorman wrote:
> It crashes on boot due to the fact that you created a function-scope variable
> called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
> were interested in. 

D'0h!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Mel Gorman

On Mon, Sep 24, 2012 at 07:44:17PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote:
> > But let me try and come up with the list thing, I think we've
> > actually got that someplace as well. 
> 
> OK, I'm sure the below can be written better, but my brain is gone for
> the day...
> 

It crashes on boot due to the fact that you created a function-scope variable
called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
were interested in. Result: dereferenced uninitialised pointer and kaboom.
Trivial to fix so it boots at least.

This is a silly test for a scheduler patch but as "sched: Avoid SMT siblings
in select_idle_sibling() if possible" regressed 2% back in 3.2, it seemed
reasonable to retest with it.

KERNBENCH
   3.6.0 3.6.0 3.6.0
 rc6-vanillarc6-mikebuddy-v1r1  rc6-idlesibling-v1r1
Usermin 352.47 (  0.00%)  351.77 (  0.20%)  352.30 (  0.05%)
Usermean353.10 (  0.00%)  352.78 (  0.09%)  352.77 (  0.09%)
Userstddev0.41 (  0.00%)0.56 (-36.13%)0.35 ( 15.16%)
Usermax 353.55 (  0.00%)  353.43 (  0.03%)  353.31 (  0.07%)
System  min  34.86 (  0.00%)   34.83 (  0.09%)   35.37 ( -1.46%)
System  mean 35.35 (  0.00%)   35.29 (  0.16%)   35.63 ( -0.80%)
System  stddev0.41 (  0.00%)0.40 (  0.10%)0.15 ( 62.26%)
System  max  35.94 (  0.00%)   36.05 ( -0.31%)   35.81 (  0.36%)
Elapsed min 110.18 (  0.00%)  109.65 (  0.48%)  110.04 (  0.13%)
Elapsed mean110.21 (  0.00%)  109.75 (  0.42%)  110.15 (  0.06%)
Elapsed stddev0.03 (  0.00%)0.07 (-167.83%)0.09 
(-207.56%)
Elapsed max 110.26 (  0.00%)  109.86 (  0.36%)  110.26 (  0.00%)
CPU min 352.00 (  0.00%)  353.00 ( -0.28%)  352.00 (  0.00%)
CPU mean352.00 (  0.00%)  353.00 ( -0.28%)  352.00 (  0.00%)
CPU stddev0.00 (  0.00%)0.00 (  0.00%)0.00 (  0.00%)
CPU max 352.00 (  0.00%)  353.00 ( -0.28%)  352.00 (  0.00%)

mikebuddy-v1r1 is Mike's patch that just got reverted. idlesibling is
Peters patch. "Elapsed mean" time is the main value of interest. Mike's
patch gains 0.42% which is less than the 2% lost but at least the gain is
outside the noise. idlesibling make very little difference. "System mean"
is also interesting because even though idlesibling shows a "regression", it
also shows that the variation between runs is reduced. That might indicate
that fewer cache misses are being incurred in the select_idle_sibling()
code although that is a bit of a leap of faith.

The machine is in use at the moment but I'll queue up a test this evening to
gather a profile to confirm time is even being spent in select_idle_sibling()
Just because 2% was lost in select_idle_sibling() back in 3.2 does not
mean squat now.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 01:58:06PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
> > In the not-so-distant past, we had the intel "Dunnington" Xeon, which
> > was iirc basically three Core 2 duo's bolted together (ie three
> > clusters of two cores sharing L2, and a fully shared L3). So that was
> > a true multi-core with fairly big shared L2, and it really would be
> > sad to not use the second core aggressively. 
> 
> Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
> without a PSU atm so checking gets a little hard) so the LLC level was
> the L2 and all worked out right (it also not having SMT helped of
> course).
> 
> But if there was a Xeon chip that did add a package L3 then yes, all
> this would become more interesting still. We'd need to extend the
> scheduler topology a bit as well, I don't think it can currently handle
> this well.
> 
> So I guess we get to do some work for steamroller.

Right, but before that we can still do some experimenting on Bulldozer
- we have the shared 2M L2 there too and it would be nice to improve
select_idle_sibling there.

For example, I did some measurements a couple of days ago on Bulldozer
of tbench with and without select_idle_sibling:

tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv
localhost), tbench default settings as in debian testing

# clients   1   2   
4   8   12  16
3.6-rc6+tip/auto-latest 115.91  238.571 
469.606 1865.77 1863.08 1851.46
3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 
900.069 1969.35 1955.91 1940.84


3.6-rc6+tip/auto-latest
---
Throughput 115.91 MB/sec   1 clients  1 procs  max_latency=0.296 ms
Throughput 238.571 MB/sec  2 clients  2 procs  max_latency=1.296 ms
Throughput 469.606 MB/sec  4 clients  4 procs  max_latency=0.340 ms
Throughput 1865.77 MB/sec  8 clients  8 procs  max_latency=3.393 ms
Throughput 1863.08 MB/sec  12 clients  12 procs  max_latency=0.322 ms
Throughput 1851.46 MB/sec  16 clients  16 procs  max_latency=2.059 ms

3.6-rc6+tip/auto-latest-kill select_idle_sibling()
--
Throughput 354.619 MB/sec  1 clients  1 procs  max_latency=0.321 ms
Throughput 534.714 MB/sec  2 clients  2 procs  max_latency=2.651 ms
Throughput 900.069 MB/sec  4 clients  4 procs  max_latency=10.823 ms
Throughput 1969.35 MB/sec  8 clients  8 procs  max_latency=1.630 ms
Throughput 1955.91 MB/sec  12 clients  12 procs  max_latency=3.236 ms
Throughput 1940.84 MB/sec  16 clients  16 procs  max_latency=0.314 ms

So improving this select_idle_sibling thing wouldn't be such a bad
thing.

Btw, I'll run your patch at http://marc.info/?l=linux-kernel=134850571330618
with the same benchmark to see what it brings.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Hillf Danton

On Tue, Sep 25, 2012 at 12:54 AM, Peter Zijlstra  wrote:
> On Mon, 2012-09-24 at 09:33 -0700, Linus Torvalds wrote:
>> Sure, the "scan bits" bitops will return ">= nr_cpu_ids" for the "I
>> couldn't find a bit" thing, but that doesn't mean that everything else
>> should.
>
> Fair enough..
>
> ---
>  kernel/sched/fair.c | 42 +-
>  1 file changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b800a1..329f78d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group, struct 
> task_struct *p, int this_cpu)
>   */
>  static int select_idle_sibling(struct task_struct *p, int target)
>  {
> -   int cpu = smp_processor_id();
> -   int prev_cpu = task_cpu(p);
> struct sched_domain *sd;
> struct sched_group *sg;
> int i;
>
> -   /*
> -* If the task is going to be woken-up on this cpu and if it is
> -* already idle, then it is the right target.
> -*/
> -   if (target == cpu && idle_cpu(cpu))
> -   return cpu;
> -
> -   /*
> -* If the task is going to be woken-up on the cpu where it previously
> -* ran and if it is currently idle, then it the right target.
> -*/
> -   if (target == prev_cpu && idle_cpu(prev_cpu))
> -   return prev_cpu;
> +   if (idle_cpu(target))
> +   return target;
>
> /*
>  * Otherwise, iterate the domains and find an elegible idle cpu.
> @@ -2661,18 +2648,31 @@ static int select_idle_sibling(struct task_struct *p, 
> int target)
> for_each_lower_domain(sd) {
> sg = sd->groups;
> do {
> -   if (!cpumask_intersects(sched_group_cpus(sg),
> -   tsk_cpus_allowed(p)))
> -   goto next;
> +   int candidate = -1;
>
> +   /*
> +* In the SMT case the groups are the SMT-siblings,
> +* otherwise they're singleton groups.
> +*/
> for_each_cpu(i, sched_group_cpus(sg)) {
> +   if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
> +   continue;
> +
> +   /*
> +* If any of the SMT-siblings are !idle, the
> +* core isn't idle.
> +*/
> if (!idle_cpu(i))
> goto next;
> +
> +   if (candidate < 0)
> +   candidate = i;


Any reason to determine candidate by scanning a non-idle core?
> }
>
> -   target = cpumask_first_and(sched_group_cpus(sg),
> -   tsk_cpus_allowed(p));
> -   goto done;
> +   if (candidate >= 0) {
> +   target = candidate;
> +   goto done;
> +   }
>  next:
> sg = sg->next;
> } while (sg != sd->groups);
>
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Peter Zijlstra

On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
> In the not-so-distant past, we had the intel "Dunnington" Xeon, which
> was iirc basically three Core 2 duo's bolted together (ie three
> clusters of two cores sharing L2, and a fully shared L3). So that was
> a true multi-core with fairly big shared L2, and it really would be
> sad to not use the second core aggressively. 

Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
without a PSU atm so checking gets a little hard) so the LLC level was
the L2 and all worked out right (it also not having SMT helped of
course).

But if there was a Xeon chip that did add a package L3 then yes, all
this would become more interesting still. We'd need to extend the
scheduler topology a bit as well, I don't think it can currently handle
this well.

So I guess we get to do some work for steamroller.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Peter Zijlstra

On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
 In the not-so-distant past, we had the intel Dunnington Xeon, which
 was iirc basically three Core 2 duo's bolted together (ie three
 clusters of two cores sharing L2, and a fully shared L3). So that was
 a true multi-core with fairly big shared L2, and it really would be
 sad to not use the second core aggressively. 

Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
without a PSU atm so checking gets a little hard) so the LLC level was
the L2 and all worked out right (it also not having SMT helped of
course).

But if there was a Xeon chip that did add a package L3 then yes, all
this would become more interesting still. We'd need to extend the
scheduler topology a bit as well, I don't think it can currently handle
this well.

So I guess we get to do some work for steamroller.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Hillf Danton

On Tue, Sep 25, 2012 at 12:54 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Mon, 2012-09-24 at 09:33 -0700, Linus Torvalds wrote:
 Sure, the scan bits bitops will return = nr_cpu_ids for the I
 couldn't find a bit thing, but that doesn't mean that everything else
 should.

 Fair enough..

 ---
  kernel/sched/fair.c | 42 +-
  1 file changed, 21 insertions(+), 21 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 6b800a1..329f78d 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group, struct 
 task_struct *p, int this_cpu)
   */
  static int select_idle_sibling(struct task_struct *p, int target)
  {
 -   int cpu = smp_processor_id();
 -   int prev_cpu = task_cpu(p);
 struct sched_domain *sd;
 struct sched_group *sg;
 int i;

 -   /*
 -* If the task is going to be woken-up on this cpu and if it is
 -* already idle, then it is the right target.
 -*/
 -   if (target == cpu  idle_cpu(cpu))
 -   return cpu;
 -
 -   /*
 -* If the task is going to be woken-up on the cpu where it previously
 -* ran and if it is currently idle, then it the right target.
 -*/
 -   if (target == prev_cpu  idle_cpu(prev_cpu))
 -   return prev_cpu;
 +   if (idle_cpu(target))
 +   return target;

 /*
  * Otherwise, iterate the domains and find an elegible idle cpu.
 @@ -2661,18 +2648,31 @@ static int select_idle_sibling(struct task_struct *p, 
 int target)
 for_each_lower_domain(sd) {
 sg = sd-groups;
 do {
 -   if (!cpumask_intersects(sched_group_cpus(sg),
 -   tsk_cpus_allowed(p)))
 -   goto next;
 +   int candidate = -1;

 +   /*
 +* In the SMT case the groups are the SMT-siblings,
 +* otherwise they're singleton groups.
 +*/
 for_each_cpu(i, sched_group_cpus(sg)) {
 +   if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
 +   continue;
 +
 +   /*
 +* If any of the SMT-siblings are !idle, the
 +* core isn't idle.
 +*/
 if (!idle_cpu(i))
 goto next;
 +
 +   if (candidate  0)
 +   candidate = i;


Any reason to determine candidate by scanning a non-idle core?
 }

 -   target = cpumask_first_and(sched_group_cpus(sg),
 -   tsk_cpus_allowed(p));
 -   goto done;
 +   if (candidate = 0) {
 +   target = candidate;
 +   goto done;
 +   }
  next:
 sg = sg-next;
 } while (sg != sd-groups);

 --
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 01:58:06PM +0200, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
  In the not-so-distant past, we had the intel Dunnington Xeon, which
  was iirc basically three Core 2 duo's bolted together (ie three
  clusters of two cores sharing L2, and a fully shared L3). So that was
  a true multi-core with fairly big shared L2, and it really would be
  sad to not use the second core aggressively. 
 
 Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
 without a PSU atm so checking gets a little hard) so the LLC level was
 the L2 and all worked out right (it also not having SMT helped of
 course).
 
 But if there was a Xeon chip that did add a package L3 then yes, all
 this would become more interesting still. We'd need to extend the
 scheduler topology a bit as well, I don't think it can currently handle
 this well.
 
 So I guess we get to do some work for steamroller.

Right, but before that we can still do some experimenting on Bulldozer
- we have the shared 2M L2 there too and it would be nice to improve
select_idle_sibling there.

For example, I did some measurements a couple of days ago on Bulldozer
of tbench with and without select_idle_sibling:

tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv
localhost), tbench default settings as in debian testing

# clients   1   2   
4   8   12  16
3.6-rc6+tip/auto-latest 115.91  238.571 
469.606 1865.77 1863.08 1851.46
3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 
900.069 1969.35 1955.91 1940.84


3.6-rc6+tip/auto-latest
---
Throughput 115.91 MB/sec   1 clients  1 procs  max_latency=0.296 ms
Throughput 238.571 MB/sec  2 clients  2 procs  max_latency=1.296 ms
Throughput 469.606 MB/sec  4 clients  4 procs  max_latency=0.340 ms
Throughput 1865.77 MB/sec  8 clients  8 procs  max_latency=3.393 ms
Throughput 1863.08 MB/sec  12 clients  12 procs  max_latency=0.322 ms
Throughput 1851.46 MB/sec  16 clients  16 procs  max_latency=2.059 ms

3.6-rc6+tip/auto-latest-kill select_idle_sibling()
--
Throughput 354.619 MB/sec  1 clients  1 procs  max_latency=0.321 ms
Throughput 534.714 MB/sec  2 clients  2 procs  max_latency=2.651 ms
Throughput 900.069 MB/sec  4 clients  4 procs  max_latency=10.823 ms
Throughput 1969.35 MB/sec  8 clients  8 procs  max_latency=1.630 ms
Throughput 1955.91 MB/sec  12 clients  12 procs  max_latency=3.236 ms
Throughput 1940.84 MB/sec  16 clients  16 procs  max_latency=0.314 ms

So improving this select_idle_sibling thing wouldn't be such a bad
thing.

Btw, I'll run your patch at http://marc.info/?l=linux-kernelm=134850571330618
with the same benchmark to see what it brings.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Mel Gorman

On Mon, Sep 24, 2012 at 07:44:17PM +0200, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote:
  But let me try and come up with the list thing, I think we've
  actually got that someplace as well. 
 
 OK, I'm sure the below can be written better, but my brain is gone for
 the day...
 

It crashes on boot due to the fact that you created a function-scope variable
called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
were interested in. Result: dereferenced uninitialised pointer and kaboom.
Trivial to fix so it boots at least.

This is a silly test for a scheduler patch but as sched: Avoid SMT siblings
in select_idle_sibling() if possible regressed 2% back in 3.2, it seemed
reasonable to retest with it.

KERNBENCH
   3.6.0 3.6.0 3.6.0
 rc6-vanillarc6-mikebuddy-v1r1  rc6-idlesibling-v1r1
Usermin 352.47 (  0.00%)  351.77 (  0.20%)  352.30 (  0.05%)
Usermean353.10 (  0.00%)  352.78 (  0.09%)  352.77 (  0.09%)
Userstddev0.41 (  0.00%)0.56 (-36.13%)0.35 ( 15.16%)
Usermax 353.55 (  0.00%)  353.43 (  0.03%)  353.31 (  0.07%)
System  min  34.86 (  0.00%)   34.83 (  0.09%)   35.37 ( -1.46%)
System  mean 35.35 (  0.00%)   35.29 (  0.16%)   35.63 ( -0.80%)
System  stddev0.41 (  0.00%)0.40 (  0.10%)0.15 ( 62.26%)
System  max  35.94 (  0.00%)   36.05 ( -0.31%)   35.81 (  0.36%)
Elapsed min 110.18 (  0.00%)  109.65 (  0.48%)  110.04 (  0.13%)
Elapsed mean110.21 (  0.00%)  109.75 (  0.42%)  110.15 (  0.06%)
Elapsed stddev0.03 (  0.00%)0.07 (-167.83%)0.09 
(-207.56%)
Elapsed max 110.26 (  0.00%)  109.86 (  0.36%)  110.26 (  0.00%)
CPU min 352.00 (  0.00%)  353.00 ( -0.28%)  352.00 (  0.00%)
CPU mean352.00 (  0.00%)  353.00 ( -0.28%)  352.00 (  0.00%)
CPU stddev0.00 (  0.00%)0.00 (  0.00%)0.00 (  0.00%)
CPU max 352.00 (  0.00%)  353.00 ( -0.28%)  352.00 (  0.00%)

mikebuddy-v1r1 is Mike's patch that just got reverted. idlesibling is
Peters patch. Elapsed mean time is the main value of interest. Mike's
patch gains 0.42% which is less than the 2% lost but at least the gain is
outside the noise. idlesibling make very little difference. System mean
is also interesting because even though idlesibling shows a regression, it
also shows that the variation between runs is reduced. That might indicate
that fewer cache misses are being incurred in the select_idle_sibling()
code although that is a bit of a leap of faith.

The machine is in use at the moment but I'll queue up a test this evening to
gather a profile to confirm time is even being spent in select_idle_sibling()
Just because 2% was lost in select_idle_sibling() back in 3.2 does not
mean squat now.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Peter Zijlstra

On Tue, 2012-09-25 at 14:23 +0100, Mel Gorman wrote:
 It crashes on boot due to the fact that you created a function-scope variable
 called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
 were interested in. 

D'0h!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 03:17:36PM +0200, Borislav Petkov wrote:
 For example, I did some measurements a couple of days ago on Bulldozer
 of tbench with and without select_idle_sibling:

Here are updated benchmark results with your patch here:
http://marc.info/?l=linux-kernelm=134850871822587

I think this pretty much confirms Mel's results.

tbench runs single-socket OR-B  (box has 8 cores, 4 CUs) (tbench_srv 
localhost), tbench default settings as in debian testing

# clients   1   2   
4   8   12  16
3.6-rc6+tip/auto-latest 115.91  238.571 
469.606 1865.77 1863.08 1851.46
3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 
900.069 1969.35 1955.91 1940.84
3.6-rc6+tip/auto-latest-revert-the-revert   114.001 223.171 
408.507 1771.48 1757.08 1736.12
3.6-rc7+tip/auto-latest-select_idle_sibling-lists   107.39  222.439 
435.255 1659.42 1697.43 1685.92

3.6-rc6+tip/auto-latest
---
Throughput 115.91 MB/sec   1 clients  1 procs  max_latency=0.296 ms
Throughput 238.571 MB/sec  2 clients  2 procs  max_latency=1.296 ms
Throughput 469.606 MB/sec  4 clients  4 procs  max_latency=0.340 ms
Throughput 1865.77 MB/sec  8 clients  8 procs  max_latency=3.393 ms
Throughput 1863.08 MB/sec  12 clients  12 procs  max_latency=0.322 ms
Throughput 1851.46 MB/sec  16 clients  16 procs  max_latency=2.059 ms

3.6-rc6+tip/auto-latest-kill select_idle_sibling()
--
Throughput 354.619 MB/sec  1 clients  1 procs  max_latency=0.321 ms
Throughput 534.714 MB/sec  2 clients  2 procs  max_latency=2.651 ms
Throughput 900.069 MB/sec  4 clients  4 procs  max_latency=10.823 ms
Throughput 1969.35 MB/sec  8 clients  8 procs  max_latency=1.630 ms
Throughput 1955.91 MB/sec  12 clients  12 procs  max_latency=3.236 ms
Throughput 1940.84 MB/sec  16 clients  16 procs  max_latency=0.314 ms

3.6-rc6+tip/auto-latest-revert-the-revert
-
Throughput 114.001 MB/sec  1 clients  1 procs  max_latency=0.352 ms
Throughput 223.171 MB/sec  2 clients  2 procs  max_latency=0.348 ms
Throughput 408.507 MB/sec  4 clients  4 procs  max_latency=0.388 ms
Throughput 1771.48 MB/sec  8 clients  8 procs  max_latency=0.280 ms
Throughput 1757.08 MB/sec  12 clients  12 procs  max_latency=3.280 ms
Throughput 1736.12 MB/sec  16 clients  16 procs  max_latency=0.333 ms

3.6-rc7+tip/auto-latest-select_idle_sibling-lists
-
Throughput 107.39 MB/sec  1 clients  1 procs  max_latency=0.372 ms
Throughput 222.439 MB/sec  2 clients  2 procs  max_latency=0.345 ms
Throughput 435.255 MB/sec  4 clients  4 procs  max_latency=0.346 ms
Throughput 1659.42 MB/sec  8 clients  8 procs  max_latency=3.497 ms
Throughput 1697.43 MB/sec  12 clients  12 procs  max_latency=3.205 ms
Throughput 1685.92 MB/sec  16 clients  16 procs  max_latency=0.331 ms

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Linus Torvalds

On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov b...@alien8.de wrote:

 3.6-rc6+tip/auto-latest-kill select_idle_sibling()

Is this literally just removing it entirely? Because apart from the
latency spike at 4 procs (and the latency numbers look very noisy, so
that's probably just noise), it looks clearly superior to everything
else. On that benchmark, at least.

How does pgbench look? That's the one that apparently really wants to
spread out, possibly due to user-level spinlocks. So I assume it will
show the reverse pattern, with kill select_idle_sibling being the
worst case. Sad, because it really would be lovely to just remove that
thing ;)

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Borislav Petkov

On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
 On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov b...@alien8.de wrote:
 
  3.6-rc6+tip/auto-latest-kill select_idle_sibling()
 
 Is this literally just removing it entirely?

Basically yes:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a14b990..016ba387c7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2640,6 +2640,8 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
struct sched_group *sg;
int i;
 
+   goto done;
+
/*
 * If the task is going to be woken-up on this cpu and if it is
 * already idle, then it is the right target.

 Because apart from the latency spike at 4 procs (and the latency
 numbers look very noisy, so that's probably just noise), it looks
 clearly superior to everything else. On that benchmark, at least.

Yep, I need more results for a more reliable say here.

 How does pgbench look? That's the one that apparently really wants to
 spread out, possibly due to user-level spinlocks. So I assume it will
 show the reverse pattern, with kill select_idle_sibling being the
 worst case.

Let me run pgbench tomorrow (I had run it only on an older family 0x10
single-node box) on Bulldozer to check that out. And we haven't started
the multi-node measurements at all.

 Sad, because it really would be lovely to just remove that thing ;)

Right, so why did we need it all, in the first place? There has to be
some reason for it.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Linus Torvalds

On Tue, Sep 25, 2012 at 11:42 AM, Borislav Petkov b...@alien8.de wrote:

 Is this literally just removing it entirely?

 Basically yes:

Ok, so you make it just always select 'target'. Fine. I wondered if
you just removed the calling logic entirely.

 How does pgbench look? That's the one that apparently really wants to
 spread out, possibly due to user-level spinlocks. So I assume it will
 show the reverse pattern, with kill select_idle_sibling being the
 worst case.

 Let me run pgbench tomorrow (I had run it only on an older family 0x10
 single-node box) on Bulldozer to check that out. And we haven't started
 the multi-node measurements at all.

Ack, this clearly needs much more testing. That said, I really would
*love* to just get rid of the function entirely.

 Sad, because it really would be lovely to just remove that thing ;)

 Right, so why did we need it all, in the first place? There has to be
 some reason for it.

I'm not entirely convinced.

Looking at the history of that thing, it's long and tortuous, and has
a few commits completely fixing the logic of it (eg see commit
99bd5e2f245d).

To the point where I don't think it necessarily even matches what the
original cause for it was. So it's *possible* that we have a case of
historical code that may have improved performance originally on at
least some machines, but that has (a) been changed due to it being
broken and (b) CPU's have changed too, so it may well be that it
simply doesn't help any more.

And we've had problems with this function before. See for example:
 - 4dcfe1025b51: sched: Avoid SMT siblings in select_idle_sibling() if possible
 - 518cd6234178: sched: Only queue remote wakeups when crossing cache boundaries

so we've basically had odd special-case tuning of this function from
the original. I do not think that there is any solid reason to believe
that it does what it used to do, or that what it used to do makes
sense any more.

It's entirely possible that prev_cpu basically ends up being the
better choice for spreading things out.

That said, my *guess* is that when you run pgbench, you'll see the
same regression that we saw due to Mike's patch too. It simply looks
like tbench wants to have minimal cpu selection and avoid moving
things around, while pgbench probably wants to spread out maximally.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Suresh Siddha

On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote:
 On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith efa...@gmx.de wrote:
 
  Aside from the cache pollution I recall having been mentioned, on my
  E5620, cross core is a tbench win over affine, cross thread is not.
 
 Oh, I agree with trying to avoid HT threads, the resource contention
 easily gets too bad.
 
 It's more a question of if we have real cores with separate L1's but
 shared L2's, go with those first, before we start distributing it out
 to separate L2's.

There is one issue though. If the tasks continue to run in this state
and the periodic balance notices an idle L2, it will force migrate
(using active migration) one of the tasks to the idle L2. As the
periodic balance tries to spread the load as far as possible to take
maximum advantage of the available resources (and the perf advantage of
this really depends on the workload, cache usage/memory bw, the upside
of turbo etc).

But I am not sure if this was the reason why we chose to spread it out
to separate L2's during wakeup.

Anyways, this is one of the places where the Paul Turner's task load
average tracking patches will be useful. Depending on how long a task
typically runs, we can probably even chose a SMT siblings or a separate
L2 to run.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 224 matches

Mail list logo