Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Fri, Sep 28, 2012 at 05:50:20AM +0200, Mike Galbraith wrote: > And wakeup preemption is still disabled as well, correct? Yes it is by default anyway: $ cat /mnt/dbg/sched_features GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY NO_WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS OWNER_SPIN NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NO_WAKEUP_PREEMPTION brings 9% improvement with pgbench, btw: http://marc.info/?l=linux-kernel=134876312310048 -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: > I wonder about this comment, for example: > > * By using 'se' instead of 'curr' we penalize light tasks, so > * they get preempted easier. That is, if 'se' < 'curr' then > * the resulting gran will be larger, therefore penalizing the > * lighter, if otoh 'se' > 'curr' then the resulting gran will > * be smaller, again penalizing the lighter task. > > why would we want to preempt light tasks easier? It sounds backwards > to me. If they are light, we have *less* reason to preempt them, since > they are more likely to just go to sleep on their own, no? No, weight is nice, you nicing a task doesn't make it want to run less. So preempting them sooner means they disturb the heavier less, which is I think what you want with nice. > Another question is whether the fact that this same load interacts > with select_idle_sibling() is perhaps a sign that maybe the preemption > logic is all fine, but it interacts badly with the "pick new cpu" > code. In particular, after having changed rq's, is the vruntime really > comparable? IOW, maybe this is an interaction between "place_entity()" > and then the immediately following (?) call to check wakeup > preemption? No, the vruntime comparison between cpus is dubious, its not complete nonsense but its not 'correct' either. PJT has patches to improve that based on his per-entity tracking stuff. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: I wonder about this comment, for example: * By using 'se' instead of 'curr' we penalize light tasks, so * they get preempted easier. That is, if 'se' 'curr' then * the resulting gran will be larger, therefore penalizing the * lighter, if otoh 'se' 'curr' then the resulting gran will * be smaller, again penalizing the lighter task. why would we want to preempt light tasks easier? It sounds backwards to me. If they are light, we have *less* reason to preempt them, since they are more likely to just go to sleep on their own, no? No, weight is nice, you nicing a task doesn't make it want to run less. So preempting them sooner means they disturb the heavier less, which is I think what you want with nice. Another question is whether the fact that this same load interacts with select_idle_sibling() is perhaps a sign that maybe the preemption logic is all fine, but it interacts badly with the pick new cpu code. In particular, after having changed rq's, is the vruntime really comparable? IOW, maybe this is an interaction between place_entity() and then the immediately following (?) call to check wakeup preemption? No, the vruntime comparison between cpus is dubious, its not complete nonsense but its not 'correct' either. PJT has patches to improve that based on his per-entity tracking stuff. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Fri, Sep 28, 2012 at 05:50:20AM +0200, Mike Galbraith wrote: And wakeup preemption is still disabled as well, correct? Yes it is by default anyway: $ cat /mnt/dbg/sched_features GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY NO_WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS OWNER_SPIN NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NO_WAKEUP_PREEMPTION brings 9% improvement with pgbench, btw: http://marc.info/?l=linux-kernelm=134876312310048 -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: > On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra > wrote: > > > > Don't forget to run the desktop interactivity benchmarks after you're > > done wriggling with this knob... wakeup preemption is important for most > > those. > > So I don't think we want to *just* wiggle that knob per se. We > definitely don't want to hurt latency on actual interactive asks. But > it's interesting that it helps psql so much, and that there seems to > be some interaction with the select_idle_sibling(). > > So I do have a few things I react to when looking at that wakeup granularity.. > > I wonder about this comment, for example: > > * By using 'se' instead of 'curr' we penalize light tasks, so > * they get preempted easier. That is, if 'se' < 'curr' then > * the resulting gran will be larger, therefore penalizing the > * lighter, if otoh 'se' > 'curr' then the resulting gran will > * be smaller, again penalizing the lighter task. > > why would we want to preempt light tasks easier? It sounds backwards > to me. If they are light, we have *less* reason to preempt them, since > they are more likely to just go to sleep on their own, no? At, that particular 'light' refers to se->load.weight. > Another question is whether the fact that this same load interacts > with select_idle_sibling() is perhaps a sign that maybe the preemption > logic is all fine, but it interacts badly with the "pick new cpu" > code. In particular, after having changed rq's, is the vruntime really > comparable? IOW, maybe this is an interaction between "place_entity()" > and then the immediately following (?) call to check wakeup > preemption? I think vruntime should be fine. We set take the delta between the task's vruntime when it went to sleep and it's previous rq min_vruntime to capture progress made while it slept, and apply the relative offset in the task's new home so a task can migrate and still have a chance to preempt on wakeup. > The fact that *either* changing select_idle_sibling() *or* changing > the wakeup preemption granularity seems to have such a huge impact > does seem to tie them together somehow for this particular load. No? The way I read it, Boris had wakeup preemption disabled. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 21:24 +0200, Borislav Petkov wrote: > On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote: > > > >> Or could we just improve the heuristics. What happens if the > > > >> scheduling granularity is increased, for example? It's set to 1ms > > > >> right now, with a logarithmic scaling by number of cpus. > > > > > > > > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) > > > > -- > > > > tps = 4994.730809 (including connections establishing) > > > > tps = 5000.260764 (excluding connections establishing) > > > > > > > > A bit better over the default NO_WAKEUP_PREEMPTION setting. > > > > > > Ok, so this gives us something possible to actually play with. > > > > > > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate > > > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? > > > > Don't forget to run the desktop interactivity benchmarks after you're > > done wriggling with this knob... wakeup preemption is important for most > > those. > > Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made > wakeup_granularity go to 4ms: > > sched_autogroup_enabled:1 > sched_child_runs_first:0 > sched_latency_ns:2400 > sched_migration_cost_ns:50 > sched_min_granularity_ns:300 > sched_nr_migrate:32 > sched_rt_period_us:100 > sched_rt_runtime_us:95 > sched_shares_window_ns:1000 > sched_time_avg_ms:1000 > sched_tunable_scaling:2 > sched_wakeup_granularity_ns:400 > > pgbench results look good: > > tps = 4997.675331 (including connections establishing) > tps = 5003.256870 (excluding connections establishing) > > This is still with Ingo's NO_WAKEUP_PREEMPTION patch. And wakeup preemption is still disabled as well, correct? -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra wrote: > > Don't forget to run the desktop interactivity benchmarks after you're > done wriggling with this knob... wakeup preemption is important for most > those. So I don't think we want to *just* wiggle that knob per se. We definitely don't want to hurt latency on actual interactive asks. But it's interesting that it helps psql so much, and that there seems to be some interaction with the select_idle_sibling(). So I do have a few things I react to when looking at that wakeup granularity.. I wonder about this comment, for example: * By using 'se' instead of 'curr' we penalize light tasks, so * they get preempted easier. That is, if 'se' < 'curr' then * the resulting gran will be larger, therefore penalizing the * lighter, if otoh 'se' > 'curr' then the resulting gran will * be smaller, again penalizing the lighter task. why would we want to preempt light tasks easier? It sounds backwards to me. If they are light, we have *less* reason to preempt them, since they are more likely to just go to sleep on their own, no? Another question is whether the fact that this same load interacts with select_idle_sibling() is perhaps a sign that maybe the preemption logic is all fine, but it interacts badly with the "pick new cpu" code. In particular, after having changed rq's, is the vruntime really comparable? IOW, maybe this is an interaction between "place_entity()" and then the immediately following (?) call to check wakeup preemption? The fact that *either* changing select_idle_sibling() *or* changing the wakeup preemption granularity seems to have such a huge impact does seem to tie them together somehow for this particular load. No? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote: > > >> Or could we just improve the heuristics. What happens if the > > >> scheduling granularity is increased, for example? It's set to 1ms > > >> right now, with a logarithmic scaling by number of cpus. > > > > > > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) > > > -- > > > tps = 4994.730809 (including connections establishing) > > > tps = 5000.260764 (excluding connections establishing) > > > > > > A bit better over the default NO_WAKEUP_PREEMPTION setting. > > > > Ok, so this gives us something possible to actually play with. > > > > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate > > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? > > Don't forget to run the desktop interactivity benchmarks after you're > done wriggling with this knob... wakeup preemption is important for most > those. Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made wakeup_granularity go to 4ms: sched_autogroup_enabled:1 sched_child_runs_first:0 sched_latency_ns:2400 sched_migration_cost_ns:50 sched_min_granularity_ns:300 sched_nr_migrate:32 sched_rt_period_us:100 sched_rt_runtime_us:95 sched_shares_window_ns:1000 sched_time_avg_ms:1000 sched_tunable_scaling:2 sched_wakeup_granularity_ns:400 pgbench results look good: tps = 4997.675331 (including connections establishing) tps = 5003.256870 (excluding connections establishing) This is still with Ingo's NO_WAKEUP_PREEMPTION patch. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 11:19 -0700, Linus Torvalds wrote: > On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov wrote: > > On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote: > >> Or could we just improve the heuristics. What happens if the > >> scheduling granularity is increased, for example? It's set to 1ms > >> right now, with a logarithmic scaling by number of cpus. > > > > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) > > -- > > tps = 4994.730809 (including connections establishing) > > tps = 5000.260764 (excluding connections establishing) > > > > A bit better over the default NO_WAKEUP_PREEMPTION setting. > > Ok, so this gives us something possible to actually play with. > > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? Don't forget to run the desktop interactivity benchmarks after you're done wriggling with this knob... wakeup preemption is important for most those. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 10:45:06AM -0700, da...@lang.hm wrote: > On Thu, 27 Sep 2012, Peter Zijlstra wrote: > > >On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote: > >>I think you are bing too smart for your own good. you don't know if it's > >>best to move them further apart or not. > > > >Well yes and no.. You're right, however in general the load-balancer has > >always tried to not use (SMT) siblings whenever possible, in that regard > >not using an idle sibling is consistent here. > > > >Also, for short running tasks the wakeup balancing is typically all we > >have, the 'big' periodic load-balancer will 'never' see them, making the > >multiple moves argument hard. > > For the initial starup of a new process, finding as idle and remote > a core to start on (minimum sharing with existing processes) is > probably the smart thing to do. Right, but we don't schedule to the SMT siblings, as Peter says above. So we can't get to the case where two SMT siblings are not overloaded and the processes remain on the same L2. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov wrote: > On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote: >> Or could we just improve the heuristics. What happens if the >> scheduling granularity is increased, for example? It's set to 1ms >> right now, with a logarithmic scaling by number of cpus. > > /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) > -- > tps = 4994.730809 (including connections establishing) > tps = 5000.260764 (excluding connections establishing) > > A bit better over the default NO_WAKEUP_PREEMPTION setting. Ok, so this gives us something possible to actually play with. For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? (Btw, "linear" right now looks like 1:1. That's linear, but it's a very aggressive linearity. Something like "factor = (cpus+1)/2" would also be linear, but by a less extreme factor. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 10:45 AM, wrote: > > For the initial starup of a new process, finding as idle and remote a core > to start on (minimum sharing with existing processes) is probably the smart > thing to do. Actually, no. It's *exec* that should go remote. New processes (fork, vfork or clone) absolutely should *not* go remote at all. vfork() should stay on the same CPU (synchronous wakeup), fork() should possibly go SMT (likely exec in the near future will spread it out), and clone should likely just stay close too. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 10:45 -0700, da...@lang.hm wrote: > But I thought that this conversation (pgbench) was dealing with long > running processes, Ah, I think we've got a confusion on long vs short.. yes pgbench is a long-running process, however the tasks might not be long in runnable state. Ie it receives a request, computes a bit, blocks on IO, computes a bit, replies, goes idle to wait for a new request. If all those runnable sections are short enough, it will 'never' be around when the periodic load-balancer does its thing, since that only looks at the tasks in runnable state at that moment in time. I say 'never' because while it will occasionally show up due to pure chance, it will unlikely be a very big player in placement. Once a cpu is overloaded enough to get real queueing they'll show up, get dispersed and then its back to wakeup stuff. Then again, it might be completely irrelevant to pgbench, its been a while since I looked at how it schedules. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote: > Or could we just improve the heuristics. What happens if the > scheduling granularity is increased, for example? It's set to 1ms > right now, with a logarithmic scaling by number of cpus. /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) -- tps = 4994.730809 (including connections establishing) tps = 5000.260764 (excluding connections establishing) A bit better over the default NO_WAKEUP_PREEMPTION setting. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Peter Zijlstra wrote: On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote: I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. Well yes and no.. You're right, however in general the load-balancer has always tried to not use (SMT) siblings whenever possible, in that regard not using an idle sibling is consistent here. Also, for short running tasks the wakeup balancing is typically all we have, the 'big' periodic load-balancer will 'never' see them, making the multiple moves argument hard. For the initial starup of a new process, finding as idle and remote a core to start on (minimum sharing with existing processes) is probably the smart thing to do. But I thought that this conversation (pgbench) was dealing with long running processes, and how to deal with the overload where one master process is kicking off many child processes and the core that the master process starts off on gets overloaded as a result, with the question being how to spread the load out from this one core as it gets overloaded. David Lang Measuring resource contention on the various levels is a fun research subject, I've spoken to various people who are/were doing so, I've always encouraged them to send their code just so we can see/learn, even if not integrate, sadly I can't remember ever having seen any of it :/ And yeah, all the load-balancing stuff is very near to scrying or tealeaf reading. We can't know all current state (too expensive) nor can we know the future. That said, I'm all for less/simpler code, pesky benchmarks aside ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 12:10 AM, Ingo Molnar wrote: > > Just in case someone prefers patches to user-space approaches (I > certainly do!), here's one that turns off wakeup driven > preemption by default. Ok, so apparently this fixes performance in a big way, and might allow us to simplify select_idle_sibling(), which is clearly way too random. That is, if we could make it automatic, some way. Not the "let the user tune it" - that's just fundamentally broken. What is the common pattern for the wakeups for psql? Can we detect this somehow? Are they sync? It looks wrong to preempt for sync wakeups, for example, but we seem to do that. Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote: > I think you are bing too smart for your own good. you don't know if it's > best to move them further apart or not. Well yes and no.. You're right, however in general the load-balancer has always tried to not use (SMT) siblings whenever possible, in that regard not using an idle sibling is consistent here. Also, for short running tasks the wakeup balancing is typically all we have, the 'big' periodic load-balancer will 'never' see them, making the multiple moves argument hard. Measuring resource contention on the various levels is a fun research subject, I've spoken to various people who are/were doing so, I've always encouraged them to send their code just so we can see/learn, even if not integrate, sadly I can't remember ever having seen any of it :/ And yeah, all the load-balancing stuff is very near to scrying or tealeaf reading. We can't know all current state (too expensive) nor can we know the future. That said, I'm all for less/simpler code, pesky benchmarks aside ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Borislav Petkov wrote: On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote: It seems to me that trying to figure out if you are going to overload the L2 is an impossible task, so just assume that it will all fit, and the worst case is you have one balancing cycle where you can't do as much work and then the normal balancing will kick in and move something anyway. Right, and this implies that when the load balancer runs, it will definitely move the task away from the L2. But what do I do in the cases where the two tasks don't overload the L2 and it is actually beneficial to keep them there? How does the load balancer know that? no, I'm saying that you should assume that the two tasks won't overload the L2, try it, and if they do overload the L2, move one of the tasks again the next balancing cycle. there is a lot of possible sharing going on between 'cores' shared everything (a single core) different registers, shared everything else (HT core) shared floating point, shared cache, different everything else shared L2/L3/Memory, different everything else shared L3/Memory, different everything else shared Memory, different everything else different everything and just wait a couple of years and someone will add a new entry to this list (if I haven't already missed a few :-) the more that is shared, the cheaper it is to move the process (the less cached state you throw away), so ideally you want to move the process as little as possible, just enough to eliminate whatever the contended resource it. But since you really don't know the footprint of each process in each of these layers, all you can measure is what percentage of the total core time the process used, just move it a little and see if that was enough. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Peter Zijlstra wrote: On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. That's about SMT, it was felt that you don't want SMT siblings first because typically SMT siblings are somewhat under-powered compared to actual cores. Also, the whole scheduler topology thing doesn't have L2/L3 domains, it only has the LLC domain, if you want more we'll need to fix that. For now its a fixed: SMT MC (llc) CPU (package/machine-for-!numa) NUMA So in your patch, your for_each_domain() loop will really only do the SMT/MC levels and prefer an SMT sibling over an idle core. I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. I'm arguing that you can't know. so I'm saying do the simple thing. if a core is overloaded, move to an idle core that is as close as possible to the core you start from (as much shared as possible). if this does not overload the shared resource, you did the right thing. if this does overload the shared resource, it's still no worse than leaving it on the original core (which was shared everything, so you've reduced the sharing a little bit) the next balancing cycle you then work to move something again, and since both the original and new core show as overloaded (due to the contention on the shared resources), you move something to another core that shares just a little less. Yes, this means that it may take more balancing cycles to move things far enough apartto reduce the sharing enough to avoid overload of the shared resource, but I don't see any way that you can possibly guess if two processes are going to overload the shared resource ahead of time. It may be that simply moving to a HT core (and no longer contending for registers) is enough to let both processes fly, or it may be that the overload is in a shared floating point unit or L1 cache and you need to move further away, or you may find the contention is in the L2 cache and move further away, or it could be in the L3 cache, or it could be in the memory interface (NUMA) Without being able to predict the future, you don't know how far away you need to move the tasks to have them operate at th eoptimal level. All that you do know is that the shorter the move, the less expensive the move. So make each move be as short as possible, and measure again to see if that was enough. For some workloads, it will be. For many workloads the least expensive move won't be. The question is if doing multiple, cheap moves (requiring simple checking for each moves) ends up being a win compared to do better guessing over when the more expensive moves are worth it. Give how chips change from year to year, I don't see how the 'better guessing' is going to survive more than a couple of chip releases in any case. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 09:10:11AM +0200, Ingo Molnar wrote: > The theory would be that this patch fixes psql performance, with CPU > selection being a measurable but second order of magnitude effect. How > well does practice match theory in this case? Yeah, it looks a bit better than default linux. A whopping 9% perf delta :-). v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor == plain - tps = 4574.570857 (including connections establishing) tps = 4579.166159 (excluding connections establishing) kill select_idle_sibling tps = 2230.354093 (including connections establishing) tps = 2231.412169 (excluding connections establishing) NO_WAKEUP_PREEMPTION tps = 4991.206742 (including connections establishing) tps = 4996.743622 (excluding connections establishing) -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 12:20 +0200, Borislav Petkov wrote: > On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote: > > It seems to me that trying to figure out if you are going to > > overload the L2 is an impossible task, so just assume that it will > > all fit, and the worst case is you have one balancing cycle where > > you can't do as much work and then the normal balancing will kick in > > and move something anyway. > > Right, and this implies that when the load balancer runs, it will > definitely move the task away from the L2. But what do I do in the cases > where the two tasks don't overload the L2 and it is actually beneficial > to keep them there? How does the load balancer know that? It doesn't, but it has task_hot(). A preempted buddy may be pulled, but the next wakeup will try to bring buddies back together. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote: > It seems to me that trying to figure out if you are going to > overload the L2 is an impossible task, so just assume that it will > all fit, and the worst case is you have one balancing cycle where > you can't do as much work and then the normal balancing will kick in > and move something anyway. Right, and this implies that when the load balancer runs, it will definitely move the task away from the L2. But what do I do in the cases where the two tasks don't overload the L2 and it is actually beneficial to keep them there? How does the load balancer know that? -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: > > For example, it starts with the maximum target scheduling domain, and > works its way in over the scheduling groups within that domain. What > the f*ck is the logic of that kind of crazy thing? It never makes > sense to look at a biggest domain first. That's about SMT, it was felt that you don't want SMT siblings first because typically SMT siblings are somewhat under-powered compared to actual cores. Also, the whole scheduler topology thing doesn't have L2/L3 domains, it only has the LLC domain, if you want more we'll need to fix that. For now its a fixed: SMT MC (llc) CPU (package/machine-for-!numa) NUMA So in your patch, your for_each_domain() loop will really only do the SMT/MC levels and prefer an SMT sibling over an idle core. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 00:17 -0700, da...@lang.hm wrote: > over the long term, the work lost due to not moving optimally right away > is probably much less than the work lost due to trying to figure out the > perfect thing to do. Yeah, "Perfect is the enemy of good" definitely applies. Once you're ramped, less is more. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 26 Sep 2012, Borislav Petkov wrote: It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying "does this group actually cover the cpu I am interested in?" Please tell me I am mis-reading this? First of all, I'm so *not* a scheduler guy so take this with a great pinch of salt. The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Or, they're too big to fit into the L2 and they start kicking each-other out. Then you want to spread them out to different L2s - i.e., different HT groups in Intel-speak. an observation from an outsider here. if you do overload a L2 cache, then the core will be busy all the time and you will end up migrating a task away from that core. It seems to me that trying to figure out if you are going to overload the L2 is an impossible task, so just assume that it will all fit, and the worst case is you have one balancing cycle where you can't do as much work and then the normal balancing will kick in and move something anyway. over the long term, the work lost due to not moving optimally right away is probably much less than the work lost due to trying to figure out the perfect thing to do. and since the perfect thing to do is going to be both workload and chip specific, trying to model that in your decision making is a lost cause. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Mike Galbraith wrote: > > Do you have an easy-to-apply hack patch by chance that has > > the effect of turning off all such preemption, which people > > could try? > > They don't need any hacks, all they have to do is start > postgreqsl SCHED_BATCH, then run pgbench the same way. > > I use schedctl, but in chrt speak, chrt -b 0 > /etc/init.d/postgresql start, and then the same for pgbench > itself. Just in case someone prefers patches to user-space approaches (I certainly do!), here's one that turns off wakeup driven preemption by default. It can be turned back on via: echo WAKEUP_PREEMPTION > /debug/sched_features and off again via: echo NO_WAKEUP_PREEMPTION > /debug/sched_features (the patch is completely untested and such.) The theory would be that this patch fixes psql performance, with CPU selection being a measurable but second order of magnitude effect. How well does practice match theory in this case? Thanks, Ingo - diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a1..f936552 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2907,7 +2907,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ * Batch and idle tasks do not preempt non-idle tasks (their preemption * is driven by the tick): */ - if (unlikely(p->policy != SCHED_NORMAL)) + if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) return; find_matching_se(, ); diff --git a/kernel/sched/features.h b/kernel/sched/features.h index eebefca..e68e69a 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -32,6 +32,11 @@ SCHED_FEAT(LAST_BUDDY, true) SCHED_FEAT(CACHE_HOT_BUDDY, true) /* + * Allow wakeup-time preemption of the current task: + */ +SCHED_FEAT(WAKEUP_PREEMPTION, false) + +/* * Use arch dependent cpu power functions */ SCHED_FEAT(ARCH_POWER, true) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 08:41 +0200, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > > Just to confirm, if you turn off all preemption via a hack > > > (basically if you turn SCHED_OTHER into SCHED_BATCH), does > > > psql perform and scale much better, with the quality of > > > sibling selection and spreading of processes only being a > > > secondary effect? > > > > That has always been the case here. Preemption dominates. > > Yes, so we get the best psql performance if we allow the central > proxy process to dominate a single CPU (IIRC it can easily go up > to 100% CPU utilization on that CPU - it is what determines max > psql throughput), and not let any worker run there much, right? Running the thing RT didn't cut it iirc (will try that again). For RT, we won't look for an empty spot on wakeup, we'll just squash an ant. > > Others should play with it too, and let their boxen speak. > > Do you have an easy-to-apply hack patch by chance that has the > effect of turning off all such preemption, which people could > try? They don't need any hacks, all they have to do is start postgreqsl SCHED_BATCH, then run pgbench the same way. I use schedctl, but in chrt speak, chrt -b 0 /etc/init.d/postgresql start, and then the same for pgbench itself. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Mike Galbraith wrote: > > Just to confirm, if you turn off all preemption via a hack > > (basically if you turn SCHED_OTHER into SCHED_BATCH), does > > psql perform and scale much better, with the quality of > > sibling selection and spreading of processes only being a > > secondary effect? > > That has always been the case here. Preemption dominates. Yes, so we get the best psql performance if we allow the central proxy process to dominate a single CPU (IIRC it can easily go up to 100% CPU utilization on that CPU - it is what determines max psql throughput), and not let any worker run there much, right? > Others should play with it too, and let their boxen speak. Do you have an easy-to-apply hack patch by chance that has the effect of turning off all such preemption, which people could try? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 07:47 +0200, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > I think the pgbench problem is more about latency for the 1 in > > 1:N than spinlocks. > > So my understanding of the psql workload is that basically we've > got a central psql proxy process that is distributing work to > worker psql processes. If a freshly woken worker process ever > preempts the central proxy process then it is preventing a lot > of new work from getting distributed. > > Correct? Yeah, that's my understanding of the thing, and I played with it quite a bit in the past (only refreshed memories briefly in present). > So the central proxy psql process is 'much more important' to > run than any of the worker processes - an importance that is not > (currently) visible from the behavioral statistics the scheduler > keeps on tasks. Yeah. We had the adaptive waker thing, but it stopped being a winner at the one load it originally did help quite a lot, and it didn't help pgbench all that much in it's then form anyway iirc. > So the scheduler has the following problem here: a new wakee > might be starved enough and the proxy might have run long enough > to really justify the preemption here and now. The buddy > statistics help avoid some of these cases - but not all and the > difference is measurable. > > Yet the 'best' way for psql to run is for this proxy process to > never be preempted. Your SCHED_BATCH experiments confirmed that. Yes. > The way remote CPU selection affects it is that if we ever get > more aggressive in selecting a remote CPU then we, as a side > effect, also reduce the chance of harmful preemption of the > central proxy psql process. Right. > So in that sense sibling selection is somewhat of an indirect > red herring: it really only helps psql indirectly by preventing > the harmful preemption. It also, somewhat paradoxially argues > for suboptimal code: for example tearing apart buddies is > beneficial in the psql workload, because it also allows the more > important part of the buddy to run more (the proxy). Yes, I believe preemption dominates, but it's not alone, you can see that in the numbers. > In that sense the *real* problem isnt even parallelism (although > we obviously should improve the decisions there - and the logic > has suffered in the past from the psql dilemma outlined above), > but whether the scheduler can (and should) identify the central > proxy and keep it running as much as possible, deprioritizing > fairness, wakeup buddies, runtime overlap and cache affinity > considerations. > > There's two broad solutions that I can see: > > - Add a kernel solution to somehow identify 'central' processes >and bias them. Xorg is a similar kind of process, so it would >help other workloads as well. That way lie dragons, but might >be worth an attempt or two. We already try to do a couple of >robust metrics, like overlap statistics to identify buddies. What we do now works well for X and friends I think, because there aren't so many buddies It might work better though, and for the same reasons. I've in fact [re]invented a SCHED_SERVER class a few times, but never one that survived my own scrutiny for long. Arrr, here there be dragons is true ;-) > - Let user-space occasionally identify its important (and less >important) tasks - say psql could mark it worker processes as >SCHED_BATCH and keep its central process(es) higher prio. A >single line of obvious code in 100 KLOCs of user-space code. > > Just to confirm, if you turn off all preemption via a hack > (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql > perform and scale much better, with the quality of sibling > selection and spreading of processes only being a secondary > effect? That has always been the case here. Preemption dominates. Others should play with it too, and let their boxen speak. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 07:47 +0200, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: I think the pgbench problem is more about latency for the 1 in 1:N than spinlocks. So my understanding of the psql workload is that basically we've got a central psql proxy process that is distributing work to worker psql processes. If a freshly woken worker process ever preempts the central proxy process then it is preventing a lot of new work from getting distributed. Correct? Yeah, that's my understanding of the thing, and I played with it quite a bit in the past (only refreshed memories briefly in present). So the central proxy psql process is 'much more important' to run than any of the worker processes - an importance that is not (currently) visible from the behavioral statistics the scheduler keeps on tasks. Yeah. We had the adaptive waker thing, but it stopped being a winner at the one load it originally did help quite a lot, and it didn't help pgbench all that much in it's then form anyway iirc. So the scheduler has the following problem here: a new wakee might be starved enough and the proxy might have run long enough to really justify the preemption here and now. The buddy statistics help avoid some of these cases - but not all and the difference is measurable. Yet the 'best' way for psql to run is for this proxy process to never be preempted. Your SCHED_BATCH experiments confirmed that. Yes. The way remote CPU selection affects it is that if we ever get more aggressive in selecting a remote CPU then we, as a side effect, also reduce the chance of harmful preemption of the central proxy psql process. Right. So in that sense sibling selection is somewhat of an indirect red herring: it really only helps psql indirectly by preventing the harmful preemption. It also, somewhat paradoxially argues for suboptimal code: for example tearing apart buddies is beneficial in the psql workload, because it also allows the more important part of the buddy to run more (the proxy). Yes, I believe preemption dominates, but it's not alone, you can see that in the numbers. In that sense the *real* problem isnt even parallelism (although we obviously should improve the decisions there - and the logic has suffered in the past from the psql dilemma outlined above), but whether the scheduler can (and should) identify the central proxy and keep it running as much as possible, deprioritizing fairness, wakeup buddies, runtime overlap and cache affinity considerations. There's two broad solutions that I can see: - Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two. We already try to do a couple of robust metrics, like overlap statistics to identify buddies. What we do now works well for X and friends I think, because there aren't so many buddies It might work better though, and for the same reasons. I've in fact [re]invented a SCHED_SERVER class a few times, but never one that survived my own scrutiny for long. Arrr, here there be dragons is true ;-) - Let user-space occasionally identify its important (and less important) tasks - say psql could mark it worker processes as SCHED_BATCH and keep its central process(es) higher prio. A single line of obvious code in 100 KLOCs of user-space code. Just to confirm, if you turn off all preemption via a hack (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql perform and scale much better, with the quality of sibling selection and spreading of processes only being a secondary effect? That has always been the case here. Preemption dominates. Others should play with it too, and let their boxen speak. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Mike Galbraith efa...@gmx.de wrote: Just to confirm, if you turn off all preemption via a hack (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql perform and scale much better, with the quality of sibling selection and spreading of processes only being a secondary effect? That has always been the case here. Preemption dominates. Yes, so we get the best psql performance if we allow the central proxy process to dominate a single CPU (IIRC it can easily go up to 100% CPU utilization on that CPU - it is what determines max psql throughput), and not let any worker run there much, right? Others should play with it too, and let their boxen speak. Do you have an easy-to-apply hack patch by chance that has the effect of turning off all such preemption, which people could try? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 08:41 +0200, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: Just to confirm, if you turn off all preemption via a hack (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql perform and scale much better, with the quality of sibling selection and spreading of processes only being a secondary effect? That has always been the case here. Preemption dominates. Yes, so we get the best psql performance if we allow the central proxy process to dominate a single CPU (IIRC it can easily go up to 100% CPU utilization on that CPU - it is what determines max psql throughput), and not let any worker run there much, right? Running the thing RT didn't cut it iirc (will try that again). For RT, we won't look for an empty spot on wakeup, we'll just squash an ant. Others should play with it too, and let their boxen speak. Do you have an easy-to-apply hack patch by chance that has the effect of turning off all such preemption, which people could try? They don't need any hacks, all they have to do is start postgreqsl SCHED_BATCH, then run pgbench the same way. I use schedctl, but in chrt speak, chrt -b 0 /etc/init.d/postgresql start, and then the same for pgbench itself. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Mike Galbraith efa...@gmx.de wrote: Do you have an easy-to-apply hack patch by chance that has the effect of turning off all such preemption, which people could try? They don't need any hacks, all they have to do is start postgreqsl SCHED_BATCH, then run pgbench the same way. I use schedctl, but in chrt speak, chrt -b 0 /etc/init.d/postgresql start, and then the same for pgbench itself. Just in case someone prefers patches to user-space approaches (I certainly do!), here's one that turns off wakeup driven preemption by default. It can be turned back on via: echo WAKEUP_PREEMPTION /debug/sched_features and off again via: echo NO_WAKEUP_PREEMPTION /debug/sched_features (the patch is completely untested and such.) The theory would be that this patch fixes psql performance, with CPU selection being a measurable but second order of magnitude effect. How well does practice match theory in this case? Thanks, Ingo - diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a1..f936552 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2907,7 +2907,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ * Batch and idle tasks do not preempt non-idle tasks (their preemption * is driven by the tick): */ - if (unlikely(p-policy != SCHED_NORMAL)) + if (unlikely(p-policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) return; find_matching_se(se, pse); diff --git a/kernel/sched/features.h b/kernel/sched/features.h index eebefca..e68e69a 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -32,6 +32,11 @@ SCHED_FEAT(LAST_BUDDY, true) SCHED_FEAT(CACHE_HOT_BUDDY, true) /* + * Allow wakeup-time preemption of the current task: + */ +SCHED_FEAT(WAKEUP_PREEMPTION, false) + +/* * Use arch dependent cpu power functions */ SCHED_FEAT(ARCH_POWER, true) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 26 Sep 2012, Borislav Petkov wrote: It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying does this group actually cover the cpu I am interested in? Please tell me I am mis-reading this? First of all, I'm so *not* a scheduler guy so take this with a great pinch of salt. The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Or, they're too big to fit into the L2 and they start kicking each-other out. Then you want to spread them out to different L2s - i.e., different HT groups in Intel-speak. an observation from an outsider here. if you do overload a L2 cache, then the core will be busy all the time and you will end up migrating a task away from that core. It seems to me that trying to figure out if you are going to overload the L2 is an impossible task, so just assume that it will all fit, and the worst case is you have one balancing cycle where you can't do as much work and then the normal balancing will kick in and move something anyway. over the long term, the work lost due to not moving optimally right away is probably much less than the work lost due to trying to figure out the perfect thing to do. and since the perfect thing to do is going to be both workload and chip specific, trying to model that in your decision making is a lost cause. David Lang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 00:17 -0700, da...@lang.hm wrote: over the long term, the work lost due to not moving optimally right away is probably much less than the work lost due to trying to figure out the perfect thing to do. Yeah, Perfect is the enemy of good definitely applies. Once you're ramped, less is more. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. That's about SMT, it was felt that you don't want SMT siblings first because typically SMT siblings are somewhat under-powered compared to actual cores. Also, the whole scheduler topology thing doesn't have L2/L3 domains, it only has the LLC domain, if you want more we'll need to fix that. For now its a fixed: SMT MC (llc) CPU (package/machine-for-!numa) NUMA So in your patch, your for_each_domain() loop will really only do the SMT/MC levels and prefer an SMT sibling over an idle core. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote: It seems to me that trying to figure out if you are going to overload the L2 is an impossible task, so just assume that it will all fit, and the worst case is you have one balancing cycle where you can't do as much work and then the normal balancing will kick in and move something anyway. Right, and this implies that when the load balancer runs, it will definitely move the task away from the L2. But what do I do in the cases where the two tasks don't overload the L2 and it is actually beneficial to keep them there? How does the load balancer know that? -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 12:20 +0200, Borislav Petkov wrote: On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote: It seems to me that trying to figure out if you are going to overload the L2 is an impossible task, so just assume that it will all fit, and the worst case is you have one balancing cycle where you can't do as much work and then the normal balancing will kick in and move something anyway. Right, and this implies that when the load balancer runs, it will definitely move the task away from the L2. But what do I do in the cases where the two tasks don't overload the L2 and it is actually beneficial to keep them there? How does the load balancer know that? It doesn't, but it has task_hot(). A preempted buddy may be pulled, but the next wakeup will try to bring buddies back together. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 09:10:11AM +0200, Ingo Molnar wrote: The theory would be that this patch fixes psql performance, with CPU selection being a measurable but second order of magnitude effect. How well does practice match theory in this case? Yeah, it looks a bit better than default linux. A whopping 9% perf delta :-). v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor == plain - tps = 4574.570857 (including connections establishing) tps = 4579.166159 (excluding connections establishing) kill select_idle_sibling tps = 2230.354093 (including connections establishing) tps = 2231.412169 (excluding connections establishing) NO_WAKEUP_PREEMPTION tps = 4991.206742 (including connections establishing) tps = 4996.743622 (excluding connections establishing) -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Peter Zijlstra wrote: On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. That's about SMT, it was felt that you don't want SMT siblings first because typically SMT siblings are somewhat under-powered compared to actual cores. Also, the whole scheduler topology thing doesn't have L2/L3 domains, it only has the LLC domain, if you want more we'll need to fix that. For now its a fixed: SMT MC (llc) CPU (package/machine-for-!numa) NUMA So in your patch, your for_each_domain() loop will really only do the SMT/MC levels and prefer an SMT sibling over an idle core. I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. I'm arguing that you can't know. so I'm saying do the simple thing. if a core is overloaded, move to an idle core that is as close as possible to the core you start from (as much shared as possible). if this does not overload the shared resource, you did the right thing. if this does overload the shared resource, it's still no worse than leaving it on the original core (which was shared everything, so you've reduced the sharing a little bit) the next balancing cycle you then work to move something again, and since both the original and new core show as overloaded (due to the contention on the shared resources), you move something to another core that shares just a little less. Yes, this means that it may take more balancing cycles to move things far enough apartto reduce the sharing enough to avoid overload of the shared resource, but I don't see any way that you can possibly guess if two processes are going to overload the shared resource ahead of time. It may be that simply moving to a HT core (and no longer contending for registers) is enough to let both processes fly, or it may be that the overload is in a shared floating point unit or L1 cache and you need to move further away, or you may find the contention is in the L2 cache and move further away, or it could be in the L3 cache, or it could be in the memory interface (NUMA) Without being able to predict the future, you don't know how far away you need to move the tasks to have them operate at th eoptimal level. All that you do know is that the shorter the move, the less expensive the move. So make each move be as short as possible, and measure again to see if that was enough. For some workloads, it will be. For many workloads the least expensive move won't be. The question is if doing multiple, cheap moves (requiring simple checking for each moves) ends up being a win compared to do better guessing over when the more expensive moves are worth it. Give how chips change from year to year, I don't see how the 'better guessing' is going to survive more than a couple of chip releases in any case. David Lang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Borislav Petkov wrote: On Thu, Sep 27, 2012 at 12:17:22AM -0700, da...@lang.hm wrote: It seems to me that trying to figure out if you are going to overload the L2 is an impossible task, so just assume that it will all fit, and the worst case is you have one balancing cycle where you can't do as much work and then the normal balancing will kick in and move something anyway. Right, and this implies that when the load balancer runs, it will definitely move the task away from the L2. But what do I do in the cases where the two tasks don't overload the L2 and it is actually beneficial to keep them there? How does the load balancer know that? no, I'm saying that you should assume that the two tasks won't overload the L2, try it, and if they do overload the L2, move one of the tasks again the next balancing cycle. there is a lot of possible sharing going on between 'cores' shared everything (a single core) different registers, shared everything else (HT core) shared floating point, shared cache, different everything else shared L2/L3/Memory, different everything else shared L3/Memory, different everything else shared Memory, different everything else different everything and just wait a couple of years and someone will add a new entry to this list (if I haven't already missed a few :-) the more that is shared, the cheaper it is to move the process (the less cached state you throw away), so ideally you want to move the process as little as possible, just enough to eliminate whatever the contended resource it. But since you really don't know the footprint of each process in each of these layers, all you can measure is what percentage of the total core time the process used, just move it a little and see if that was enough. David Lang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote: I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. Well yes and no.. You're right, however in general the load-balancer has always tried to not use (SMT) siblings whenever possible, in that regard not using an idle sibling is consistent here. Also, for short running tasks the wakeup balancing is typically all we have, the 'big' periodic load-balancer will 'never' see them, making the multiple moves argument hard. Measuring resource contention on the various levels is a fun research subject, I've spoken to various people who are/were doing so, I've always encouraged them to send their code just so we can see/learn, even if not integrate, sadly I can't remember ever having seen any of it :/ And yeah, all the load-balancing stuff is very near to scrying or tealeaf reading. We can't know all current state (too expensive) nor can we know the future. That said, I'm all for less/simpler code, pesky benchmarks aside ;-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 12:10 AM, Ingo Molnar mi...@kernel.org wrote: Just in case someone prefers patches to user-space approaches (I certainly do!), here's one that turns off wakeup driven preemption by default. Ok, so apparently this fixes performance in a big way, and might allow us to simplify select_idle_sibling(), which is clearly way too random. That is, if we could make it automatic, some way. Not the let the user tune it - that's just fundamentally broken. What is the common pattern for the wakeups for psql? Can we detect this somehow? Are they sync? It looks wrong to preempt for sync wakeups, for example, but we seem to do that. Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Peter Zijlstra wrote: On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote: I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. Well yes and no.. You're right, however in general the load-balancer has always tried to not use (SMT) siblings whenever possible, in that regard not using an idle sibling is consistent here. Also, for short running tasks the wakeup balancing is typically all we have, the 'big' periodic load-balancer will 'never' see them, making the multiple moves argument hard. For the initial starup of a new process, finding as idle and remote a core to start on (minimum sharing with existing processes) is probably the smart thing to do. But I thought that this conversation (pgbench) was dealing with long running processes, and how to deal with the overload where one master process is kicking off many child processes and the core that the master process starts off on gets overloaded as a result, with the question being how to spread the load out from this one core as it gets overloaded. David Lang Measuring resource contention on the various levels is a fun research subject, I've spoken to various people who are/were doing so, I've always encouraged them to send their code just so we can see/learn, even if not integrate, sadly I can't remember ever having seen any of it :/ And yeah, all the load-balancing stuff is very near to scrying or tealeaf reading. We can't know all current state (too expensive) nor can we know the future. That said, I'm all for less/simpler code, pesky benchmarks aside ;-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote: Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) -- tps = 4994.730809 (including connections establishing) tps = 5000.260764 (excluding connections establishing) A bit better over the default NO_WAKEUP_PREEMPTION setting. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 10:45 -0700, da...@lang.hm wrote: But I thought that this conversation (pgbench) was dealing with long running processes, Ah, I think we've got a confusion on long vs short.. yes pgbench is a long-running process, however the tasks might not be long in runnable state. Ie it receives a request, computes a bit, blocks on IO, computes a bit, replies, goes idle to wait for a new request. If all those runnable sections are short enough, it will 'never' be around when the periodic load-balancer does its thing, since that only looks at the tasks in runnable state at that moment in time. I say 'never' because while it will occasionally show up due to pure chance, it will unlikely be a very big player in placement. Once a cpu is overloaded enough to get real queueing they'll show up, get dispersed and then its back to wakeup stuff. Then again, it might be completely irrelevant to pgbench, its been a while since I looked at how it schedules. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 10:45 AM, da...@lang.hm wrote: For the initial starup of a new process, finding as idle and remote a core to start on (minimum sharing with existing processes) is probably the smart thing to do. Actually, no. It's *exec* that should go remote. New processes (fork, vfork or clone) absolutely should *not* go remote at all. vfork() should stay on the same CPU (synchronous wakeup), fork() should possibly go SMT (likely exec in the near future will spread it out), and clone should likely just stay close too. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov b...@alien8.de wrote: On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote: Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) -- tps = 4994.730809 (including connections establishing) tps = 5000.260764 (excluding connections establishing) A bit better over the default NO_WAKEUP_PREEMPTION setting. Ok, so this gives us something possible to actually play with. For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? (Btw, linear right now looks like 1:1. That's linear, but it's a very aggressive linearity. Something like factor = (cpus+1)/2 would also be linear, but by a less extreme factor. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 10:45:06AM -0700, da...@lang.hm wrote: On Thu, 27 Sep 2012, Peter Zijlstra wrote: On Thu, 2012-09-27 at 09:48 -0700, da...@lang.hm wrote: I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. Well yes and no.. You're right, however in general the load-balancer has always tried to not use (SMT) siblings whenever possible, in that regard not using an idle sibling is consistent here. Also, for short running tasks the wakeup balancing is typically all we have, the 'big' periodic load-balancer will 'never' see them, making the multiple moves argument hard. For the initial starup of a new process, finding as idle and remote a core to start on (minimum sharing with existing processes) is probably the smart thing to do. Right, but we don't schedule to the SMT siblings, as Peter says above. So we can't get to the case where two SMT siblings are not overloaded and the processes remain on the same L2. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 11:19 -0700, Linus Torvalds wrote: On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov b...@alien8.de wrote: On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote: Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) -- tps = 4994.730809 (including connections establishing) tps = 5000.260764 (excluding connections establishing) A bit better over the default NO_WAKEUP_PREEMPTION setting. Ok, so this gives us something possible to actually play with. For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? Don't forget to run the desktop interactivity benchmarks after you're done wriggling with this knob... wakeup preemption is important for most those. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote: Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) -- tps = 4994.730809 (including connections establishing) tps = 5000.260764 (excluding connections establishing) A bit better over the default NO_WAKEUP_PREEMPTION setting. Ok, so this gives us something possible to actually play with. For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? Don't forget to run the desktop interactivity benchmarks after you're done wriggling with this knob... wakeup preemption is important for most those. Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made wakeup_granularity go to 4ms: sched_autogroup_enabled:1 sched_child_runs_first:0 sched_latency_ns:2400 sched_migration_cost_ns:50 sched_min_granularity_ns:300 sched_nr_migrate:32 sched_rt_period_us:100 sched_rt_runtime_us:95 sched_shares_window_ns:1000 sched_time_avg_ms:1000 sched_tunable_scaling:2 sched_wakeup_granularity_ns:400 pgbench results look good: tps = 4997.675331 (including connections establishing) tps = 5003.256870 (excluding connections establishing) This is still with Ingo's NO_WAKEUP_PREEMPTION patch. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: Don't forget to run the desktop interactivity benchmarks after you're done wriggling with this knob... wakeup preemption is important for most those. So I don't think we want to *just* wiggle that knob per se. We definitely don't want to hurt latency on actual interactive asks. But it's interesting that it helps psql so much, and that there seems to be some interaction with the select_idle_sibling(). So I do have a few things I react to when looking at that wakeup granularity.. I wonder about this comment, for example: * By using 'se' instead of 'curr' we penalize light tasks, so * they get preempted easier. That is, if 'se' 'curr' then * the resulting gran will be larger, therefore penalizing the * lighter, if otoh 'se' 'curr' then the resulting gran will * be smaller, again penalizing the lighter task. why would we want to preempt light tasks easier? It sounds backwards to me. If they are light, we have *less* reason to preempt them, since they are more likely to just go to sleep on their own, no? Another question is whether the fact that this same load interacts with select_idle_sibling() is perhaps a sign that maybe the preemption logic is all fine, but it interacts badly with the pick new cpu code. In particular, after having changed rq's, is the vruntime really comparable? IOW, maybe this is an interaction between place_entity() and then the immediately following (?) call to check wakeup preemption? The fact that *either* changing select_idle_sibling() *or* changing the wakeup preemption granularity seems to have such a huge impact does seem to tie them together somehow for this particular load. No? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 21:24 +0200, Borislav Petkov wrote: On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote: Or could we just improve the heuristics. What happens if the scheduling granularity is increased, for example? It's set to 1ms right now, with a logarithmic scaling by number of cpus. /proc/sys/kernel/sched_wakeup_granularity_ns=1000 (10ms) -- tps = 4994.730809 (including connections establishing) tps = 5000.260764 (excluding connections establishing) A bit better over the default NO_WAKEUP_PREEMPTION setting. Ok, so this gives us something possible to actually play with. For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm? Don't forget to run the desktop interactivity benchmarks after you're done wriggling with this knob... wakeup preemption is important for most those. Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made wakeup_granularity go to 4ms: sched_autogroup_enabled:1 sched_child_runs_first:0 sched_latency_ns:2400 sched_migration_cost_ns:50 sched_min_granularity_ns:300 sched_nr_migrate:32 sched_rt_period_us:100 sched_rt_runtime_us:95 sched_shares_window_ns:1000 sched_time_avg_ms:1000 sched_tunable_scaling:2 sched_wakeup_granularity_ns:400 pgbench results look good: tps = 4997.675331 (including connections establishing) tps = 5003.256870 (excluding connections establishing) This is still with Ingo's NO_WAKEUP_PREEMPTION patch. And wakeup preemption is still disabled as well, correct? -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: Don't forget to run the desktop interactivity benchmarks after you're done wriggling with this knob... wakeup preemption is important for most those. So I don't think we want to *just* wiggle that knob per se. We definitely don't want to hurt latency on actual interactive asks. But it's interesting that it helps psql so much, and that there seems to be some interaction with the select_idle_sibling(). So I do have a few things I react to when looking at that wakeup granularity.. I wonder about this comment, for example: * By using 'se' instead of 'curr' we penalize light tasks, so * they get preempted easier. That is, if 'se' 'curr' then * the resulting gran will be larger, therefore penalizing the * lighter, if otoh 'se' 'curr' then the resulting gran will * be smaller, again penalizing the lighter task. why would we want to preempt light tasks easier? It sounds backwards to me. If they are light, we have *less* reason to preempt them, since they are more likely to just go to sleep on their own, no? At, that particular 'light' refers to se-load.weight. Another question is whether the fact that this same load interacts with select_idle_sibling() is perhaps a sign that maybe the preemption logic is all fine, but it interacts badly with the pick new cpu code. In particular, after having changed rq's, is the vruntime really comparable? IOW, maybe this is an interaction between place_entity() and then the immediately following (?) call to check wakeup preemption? I think vruntime should be fine. We set take the delta between the task's vruntime when it went to sleep and it's previous rq min_vruntime to capture progress made while it slept, and apply the relative offset in the task's new home so a task can migrate and still have a chance to preempt on wakeup. The fact that *either* changing select_idle_sibling() *or* changing the wakeup preemption granularity seems to have such a huge impact does seem to tie them together somehow for this particular load. No? The way I read it, Boris had wakeup preemption disabled. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Ingo Molnar wrote: > * Mike Galbraith wrote: > > > I think the pgbench problem is more about latency for the 1 > > in 1:N than spinlocks. > > So my understanding of the psql workload is that basically > we've got a central psql proxy process that is distributing > work to worker psql processes. If a freshly woken worker > process ever preempts the central proxy process then it is > preventing a lot of new work from getting distributed. Also, I'd like to stress that despite the optimization dilemma, the psql workload is *important*. More important than tbench - because psql does some real SQL work and it also matches the design of many real desktop and server workloads. So if indeed the above is the main problem of psql it would be nice to add a 'perf bench sched proxy' testcase that emulates it - that would remove psql version dependencies and would ease the difficulty of running the benchmarks. We alread have 'perf bench sched pipe' and 'perf bench sched messaging' - but neither shows the psql pattern currently. I suspect a couple of udelay()s in the messaging benchmark would do the trick? The wakeup work there already matches much of how psql looks like. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Mike Galbraith wrote: > I think the pgbench problem is more about latency for the 1 in > 1:N than spinlocks. So my understanding of the psql workload is that basically we've got a central psql proxy process that is distributing work to worker psql processes. If a freshly woken worker process ever preempts the central proxy process then it is preventing a lot of new work from getting distributed. Correct? So the central proxy psql process is 'much more important' to run than any of the worker processes - an importance that is not (currently) visible from the behavioral statistics the scheduler keeps on tasks. So the scheduler has the following problem here: a new wakee might be starved enough and the proxy might have run long enough to really justify the preemption here and now. The buddy statistics help avoid some of these cases - but not all and the difference is measurable. Yet the 'best' way for psql to run is for this proxy process to never be preempted. Your SCHED_BATCH experiments confirmed that. The way remote CPU selection affects it is that if we ever get more aggressive in selecting a remote CPU then we, as a side effect, also reduce the chance of harmful preemption of the central proxy psql process. So in that sense sibling selection is somewhat of an indirect red herring: it really only helps psql indirectly by preventing the harmful preemption. It also, somewhat paradoxially argues for suboptimal code: for example tearing apart buddies is beneficial in the psql workload, because it also allows the more important part of the buddy to run more (the proxy). In that sense the *real* problem isnt even parallelism (although we obviously should improve the decisions there - and the logic has suffered in the past from the psql dilemma outlined above), but whether the scheduler can (and should) identify the central proxy and keep it running as much as possible, deprioritizing fairness, wakeup buddies, runtime overlap and cache affinity considerations. There's two broad solutions that I can see: - Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two. We already try to do a couple of robust metrics, like overlap statistics to identify buddies. - Let user-space occasionally identify its important (and less important) tasks - say psql could mark it worker processes as SCHED_BATCH and keep its central process(es) higher prio. A single line of obvious code in 100 KLOCs of user-space code. Just to confirm, if you turn off all preemption via a hack (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql perform and scale much better, with the quality of sibling selection and spreading of processes only being a secondary effect? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 07:18 +0200, Borislav Petkov wrote: > On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote: > but how does that affect pgbench and ilk that must spread regardless > > of footprints. > > Well, how do you measure latency of the 1 process in the 1:N case? Maybe > pipeline stalls of the 1 along with some way to recognize it is the 1 in > the 1:N case. Best is to let userland tell us it's critical. Smarts are expensive. A class of it's own (my wakees do _not_ preempt me, and I don't care that you think this is unfair to the unwashed masses who will otherwise _starve_ without me feeding them) makes sense for these guys. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote: > > The way I understand it is, you either want to share L2 with a process, > > because, for example, both working sets fit in the L2 and/or there's > > some sharing which saves you moving everything over the L3. This is > > where selecting a core on the same L2 is actually a good thing. > > Yeah, and if the wakee can't get to the L2 hot data instantly, it may be > better to let wakee drag the data to an instantly accessible spot. Yep, then moving it to another L2 is the same. [ … ] > > A crazy thought: one could go and sample tasks while running their > > timeslices with the perf counters to know exactly what type of workload > > we're looking at. I.e., do I have a large number of L2 evictions? Yes, > > then spread them out. No, then select the other core on the L2. And so > > on. > > Hm. That sampling better be really cheap. Might help... Yeah, that's why I said sampling and not run the perfcounters during every timeslice. But if you count the proper events, you should be able to know exactly what the workload is doing (compute-bound, io-bound, contention, etc...) > but how does that affect pgbench and ilk that must spread regardless > of footprints. Well, how do you measure latency of the 1 process in the 1:N case? Maybe pipeline stalls of the 1 along with some way to recognize it is the 1 in the 1:N case. Hmm. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 2012-09-26 at 23:37 +0200, Borislav Petkov wrote: > The way I understand it is, you either want to share L2 with a process, > because, for example, both working sets fit in the L2 and/or there's > some sharing which saves you moving everything over the L3. This is > where selecting a core on the same L2 is actually a good thing. Yeah, and if the wakee can't get to the L2 hot data instantly, it may be better to let wakee drag the data to an instantly accessible spot. > Or, they're too big to fit into the L2 and they start kicking each-other > out. Then you want to spread them out to different L2s - i.e., different > HT groups in Intel-speak. > > Oh, and then there's the userspace spinlocks thingie where Mike's patch > hurts us. > > Btw, Mike, you can jump in anytime :-) I think the pgbench problem is more about latency for the 1 in 1:N than spinlocks. > So I'd say, this is the hard scheduling problem where fitting the > workload to the architecture doesn't make everyone happy. Yup. I find it hard at least. > A crazy thought: one could go and sample tasks while running their > timeslices with the perf counters to know exactly what type of workload > we're looking at. I.e., do I have a large number of L2 evictions? Yes, > then spread them out. No, then select the other core on the L2. And so > on. Hm. That sampling better be really cheap. Might help... but how does that affect pgbench and ilk that must spread regardless of footprints. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: > On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov wrote: > > On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: > >> How does pgbench look? That's the one that apparently really wants to > >> spread out, possibly due to user-level spinlocks. So I assume it will > >> show the reverse pattern, with "kill select_idle_sibling" being the > >> worst case. Sad, because it really would be lovely to just remove that > >> thing ;) > > > > Yep, correct. It hurts. > > I'm *so* not surprised. Any other result would have induced mushroom cloud, glazed eyes, and jaw meets floor here. > That said, I think your "kill select_idle_sibling()" one was > interesting, but the wrong kind of "get rid of that logic". > > It always selected target_cpu, but the fact is, that doesn't really > sound very sane. The target cpu is either the previous cpu or the > current cpu, depending on whether they should be balanced or not. But > that still doesn't make any *sense*. > > In fact, the whole select_idle_sibling() logic makes no sense > what-so-ever to me. It seems to be total garbage. Oh, it's not _that_ bad. It does have it's troubles, but if it were complete shite it wouldn't the make numbers that I showed, and wouldn't make the even better numbers it does with some other loads. > For example, it starts with the maximum target scheduling domain, and > works its way in over the scheduling groups within that domain. What > the f*ck is the logic of that kind of crazy thing? It never makes > sense to look at a biggest domain first. If you want to be close to > something, you want to look at the *smallest* domain first. But > because it looks at things in the wrong order, it then needs to have > that inner loop saying "does this group actually cover the cpu I am > interested in?" > > Please tell me I am mis-reading this? We start at MC to get the tbench win I showed (Intel) vs loss at SMT. Riddle me this, why does that produce the wins I showed? I'm still hoping someone can shed some light on why the heck there's such a disparity in processor behaviors. > But starting from the biggest ("llc" group) is wrong *anyway*, since > it means that it starts looking at the L3 level, and then if it finds > an acceptable cpu inside that level, it's all done. But that's > *crazy*. Once again, it's much better to try to find an idle sibling > *closeby* rather than at the L3 level. No? So once again, we should > start at the inner level and if we can't find something really close, > we work our way out, rather than starting from the outer level and > working our way in. Domains on my E5620 look like so when SMT is enabled (seldom): [0.473692] CPU0 attaching sched-domain: [0.477616] domain 0: span 0,4 level SIBLING [0.481982] groups: 0 (cpu_power = 589) 4 (cpu_power = 589) [0.487805] domain 1: span 0-7 level MC [0.491829]groups: 0,4 (cpu_power = 1178) 1,5 (cpu_power = 1178) 2,6 (cpu_power = 1178) 3,7 (cpu_power = 1178) ... I usually have SMT off, which gives me more oomph at the bottom end (smt affects turboboost gizmo methinks), have only one domain, so say I'm waking from CPU0. With cross wire thingy, we'll always wake to CPU1 if idle. That demonstrably works well despite it being L3. Box coughs up wins at fast movers I too would expect L3 to lose at. If L2 is my only viable target for fast movers, I'm stuck with SMT siblings, which I have measured. They aren't wonderful for this. They do improve max throughput markedly though, so aren't a complete waste of silicon ;-) I wonder what domains look like on Bulldog. (boot w. sched_debug) > If I read the code correctly, we can have both "prev" and "cpu" in the > same L2 domain, but because we start looking at the L3 domain, we may > end up picking another "affine" CPU that isn't even sharing L2's > *before* we pick one that actually *is* sharing L2's with the target > CPU. But that code is confusing enough with the scheduler groups inner > loop that maybe I am mis-reading it entirely. Yup, and on Intel, it manages to not suck. > There are other oddities in select_idle_sibling() too, if I read > things correctly. > > For example, it uses
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote: > I'm *so* not surprised. > > That said, I think your "kill select_idle_sibling()" one was > interesting, but the wrong kind of "get rid of that logic". Yeah. > It always selected target_cpu, but the fact is, that doesn't really > sound very sane. The target cpu is either the previous cpu or the > current cpu, depending on whether they should be balanced or not. But > that still doesn't make any *sense*. > > In fact, the whole select_idle_sibling() logic makes no sense > what-so-ever to me. It seems to be total garbage. > > For example, it starts with the maximum target scheduling domain, and > works its way in over the scheduling groups within that domain. What > the f*ck is the logic of that kind of crazy thing? It never makes > sense to look at a biggest domain first. If you want to be close to > something, you want to look at the *smallest* domain first. But > because it looks at things in the wrong order, it then needs to have > that inner loop saying "does this group actually cover the cpu I am > interested in?" > > Please tell me I am mis-reading this? First of all, I'm so *not* a scheduler guy so take this with a great pinch of salt. The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Or, they're too big to fit into the L2 and they start kicking each-other out. Then you want to spread them out to different L2s - i.e., different HT groups in Intel-speak. Oh, and then there's the userspace spinlocks thingie where Mike's patch hurts us. Btw, Mike, you can jump in anytime :-) So I'd say, this is the hard scheduling problem where fitting the workload to the architecture doesn't make everyone happy. A crazy thought: one could go and sample tasks while running their timeslices with the perf counters to know exactly what type of workload we're looking at. I.e., do I have a large number of L2 evictions? Yes, then spread them out. No, then select the other core on the L2. And so on. > But starting from the biggest ("llc" group) is wrong *anyway*, since > it means that it starts looking at the L3 level, and then if it > finds an acceptable cpu inside that level, it's all done. But that's > *crazy*. Once again, it's much better to try to find an idle sibling > *closeby* rather than at the L3 level. No? Exactly my thoughts a couple of days ago but see above. > So once again, we should start at the inner level and if we can't find > something really close, we work our way out, rather than starting from > the outer level and working our way in. > > If I read the code correctly, we can have both "prev" and "cpu" in > the same L2 domain, but because we start looking at the L3 domain, we > may end up picking another "affine" CPU that isn't even sharing L2's > *before* we pick one that actually *is* sharing L2's with the target > CPU. But that code is confusing enough with the scheduler groups inner > loop that maybe I am mis-reading it entirely. > > There are other oddities in select_idle_sibling() too, if I read > things correctly. > > For example, it uses "cpu_idle(target)", but if we're actively trying > to move to the current CPU (ie wake_affine() returned true), then > target is the current cpu, which is certainly *not* going to be idle > for a sync wakeup. So it should actually check whether it's a sync > wakeup and the only thing pending is that synchronous waker, no? > > Maybe I'm missing something really fundamental, but it all really does > look very odd to me. > > Attached is a totally untested and probably very buggy patch, so > please consider it a "shouldn't we do something like this instead" RFC > rather than anything serious. So this RFC patch is more a "ok, the > patch tries to fix the above oddnesses, please tell me where I went > wrong" than anything else. > > Comments? Let me look at it tomorrow, on a fresh head. Too late here now. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov wrote: > On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: >> How does pgbench look? That's the one that apparently really wants to >> spread out, possibly due to user-level spinlocks. So I assume it will >> show the reverse pattern, with "kill select_idle_sibling" being the >> worst case. Sad, because it really would be lovely to just remove that >> thing ;) > > Yep, correct. It hurts. I'm *so* not surprised. That said, I think your "kill select_idle_sibling()" one was interesting, but the wrong kind of "get rid of that logic". It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying "does this group actually cover the cpu I am interested in?" Please tell me I am mis-reading this? But starting from the biggest ("llc" group) is wrong *anyway*, since it means that it starts looking at the L3 level, and then if it finds an acceptable cpu inside that level, it's all done. But that's *crazy*. Once again, it's much better to try to find an idle sibling *closeby* rather than at the L3 level. No? So once again, we should start at the inner level and if we can't find something really close, we work our way out, rather than starting from the outer level and working our way in. If I read the code correctly, we can have both "prev" and "cpu" in the same L2 domain, but because we start looking at the L3 domain, we may end up picking another "affine" CPU that isn't even sharing L2's *before* we pick one that actually *is* sharing L2's with the target CPU. But that code is confusing enough with the scheduler groups inner loop that maybe I am mis-reading it entirely. There are other oddities in select_idle_sibling() too, if I read things correctly. For example, it uses "cpu_idle(target)", but if we're actively trying to move to the current CPU (ie wake_affine() returned true), then target is the current cpu, which is certainly *not* going to be idle for a sync wakeup. So it should actually check whether it's a sync wakeup and the only thing pending is that synchronous waker, no? Maybe I'm missing something really fundamental, but it all really does look very odd to me. Attached is a totally untested and probably very buggy patch, so please consider it a "shouldn't we do something like this instead" RFC rather than anything serious. So this RFC patch is more a "ok, the patch tries to fix the above oddnesses, please tell me where I went wrong" than anything else. Comments? Linus patch.diff Description: Binary data
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, Sep 26, 2012 at 04:23:26AM +0200, Mike Galbraith wrote: > On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote: > > > Right, so why did we need it all, in the first place? There has to be > > some reason for it. > > Easy. Take two communicating tasks. Is an affine wakeup a good idea? > It depends on how much execution overlap there is. Wake affine when > there is overlap larger than cache miss cost, and you just tossed > throughput into the bin. > > select_idle_sibling() was originally about shared L2, where any overlap > was salvageable. On modern processors with no shared L2, Oh, but we do have shared L2s in the Bulldozer uarch (a subset of the modern AMD processors :)). > you have to get past the cost, but the gain is still there. Intel > wins with loads that AMD loses very bady on, so I can only guess that > Intel must feed caches more efficiently. Dunno. It just doesn't matter > though, point is that there is a win to be had in both cases, the > breakeven just isn't at the same point. Well, I guess selecting the proper core in the hierarchy depending on the workload is one of those hard problems. Teaching select_idle_sibling to detect the breakeven point and act accordingly would be not that easy then... Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 07:22:22PM -0700, Linus Torvalds wrote: > So I'm sure there are architecture differences (where HT in particular > probably changes optimal scheduling strategy, although I'd expect > the bulldozer approach to not be *that*different - but I don't know > if BD shows up as "HT siblings" or not, so dissimilar topology > interpretation may make it *look* very different). Right, those cores sharing an L2 are thread siblings on BD: $ grep . /sys/devices/system/cpu/cpu0/topology/* /sys/devices/system/cpu/cpu0/topology/core_id:0 /sys/devices/system/cpu/cpu0/topology/core_siblings:ff /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7 /sys/devices/system/cpu/cpu0/topology/physical_package_id:0 /sys/devices/system/cpu/cpu0/topology/thread_siblings:03 /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1 much like HT siblings on this single-socket Sandybridge: $ grep . /sys/devices/system/cpu/cpu0/topology/* /sys/devices/system/cpu/cpu0/topology/core_id:0 /sys/devices/system/cpu/cpu0/topology/core_siblings:ff /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7 /sys/devices/system/cpu/cpu0/topology/physical_package_id:0 /sys/devices/system/cpu/cpu0/topology/thread_siblings:11 /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4 Although I don't know whether those thread siblings on this SB box are actual HT siblings, sharing almost all resources, judging by the core ids. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: > How does pgbench look? That's the one that apparently really wants to > spread out, possibly due to user-level spinlocks. So I assume it will > show the reverse pattern, with "kill select_idle_sibling" being the > worst case. Sad, because it really would be lovely to just remove that > thing ;) Yep, correct. It hurts. v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor tps = 4574.570857 (including connections establishing) tps = 4579.166159 (excluding connections establishing) v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor + kill select_idle_sibling tps = 2230.354093 (including connections establishing) tps = 2231.412169 (excluding connections establishing) -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with kill select_idle_sibling being the worst case. Sad, because it really would be lovely to just remove that thing ;) Yep, correct. It hurts. v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor tps = 4574.570857 (including connections establishing) tps = 4579.166159 (excluding connections establishing) v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor + kill select_idle_sibling tps = 2230.354093 (including connections establishing) tps = 2231.412169 (excluding connections establishing) -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 07:22:22PM -0700, Linus Torvalds wrote: So I'm sure there are architecture differences (where HT in particular probably changes optimal scheduling strategy, although I'd expect the bulldozer approach to not be *that*different - but I don't know if BD shows up as HT siblings or not, so dissimilar topology interpretation may make it *look* very different). Right, those cores sharing an L2 are thread siblings on BD: $ grep . /sys/devices/system/cpu/cpu0/topology/* /sys/devices/system/cpu/cpu0/topology/core_id:0 /sys/devices/system/cpu/cpu0/topology/core_siblings:ff /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7 /sys/devices/system/cpu/cpu0/topology/physical_package_id:0 /sys/devices/system/cpu/cpu0/topology/thread_siblings:03 /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1 much like HT siblings on this single-socket Sandybridge: $ grep . /sys/devices/system/cpu/cpu0/topology/* /sys/devices/system/cpu/cpu0/topology/core_id:0 /sys/devices/system/cpu/cpu0/topology/core_siblings:ff /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7 /sys/devices/system/cpu/cpu0/topology/physical_package_id:0 /sys/devices/system/cpu/cpu0/topology/thread_siblings:11 /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4 Although I don't know whether those thread siblings on this SB box are actual HT siblings, sharing almost all resources, judging by the core ids. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, Sep 26, 2012 at 04:23:26AM +0200, Mike Galbraith wrote: On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote: Right, so why did we need it all, in the first place? There has to be some reason for it. Easy. Take two communicating tasks. Is an affine wakeup a good idea? It depends on how much execution overlap there is. Wake affine when there is overlap larger than cache miss cost, and you just tossed throughput into the bin. select_idle_sibling() was originally about shared L2, where any overlap was salvageable. On modern processors with no shared L2, Oh, but we do have shared L2s in the Bulldozer uarch (a subset of the modern AMD processors :)). you have to get past the cost, but the gain is still there. Intel wins with loads that AMD loses very bady on, so I can only guess that Intel must feed caches more efficiently. Dunno. It just doesn't matter though, point is that there is a win to be had in both cases, the breakeven just isn't at the same point. Well, I guess selecting the proper core in the hierarchy depending on the workload is one of those hard problems. Teaching select_idle_sibling to detect the breakeven point and act accordingly would be not that easy then... Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov b...@alien8.de wrote: On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with kill select_idle_sibling being the worst case. Sad, because it really would be lovely to just remove that thing ;) Yep, correct. It hurts. I'm *so* not surprised. That said, I think your kill select_idle_sibling() one was interesting, but the wrong kind of get rid of that logic. It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying does this group actually cover the cpu I am interested in? Please tell me I am mis-reading this? But starting from the biggest (llc group) is wrong *anyway*, since it means that it starts looking at the L3 level, and then if it finds an acceptable cpu inside that level, it's all done. But that's *crazy*. Once again, it's much better to try to find an idle sibling *closeby* rather than at the L3 level. No? So once again, we should start at the inner level and if we can't find something really close, we work our way out, rather than starting from the outer level and working our way in. If I read the code correctly, we can have both prev and cpu in the same L2 domain, but because we start looking at the L3 domain, we may end up picking another affine CPU that isn't even sharing L2's *before* we pick one that actually *is* sharing L2's with the target CPU. But that code is confusing enough with the scheduler groups inner loop that maybe I am mis-reading it entirely. There are other oddities in select_idle_sibling() too, if I read things correctly. For example, it uses cpu_idle(target), but if we're actively trying to move to the current CPU (ie wake_affine() returned true), then target is the current cpu, which is certainly *not* going to be idle for a sync wakeup. So it should actually check whether it's a sync wakeup and the only thing pending is that synchronous waker, no? Maybe I'm missing something really fundamental, but it all really does look very odd to me. Attached is a totally untested and probably very buggy patch, so please consider it a shouldn't we do something like this instead RFC rather than anything serious. So this RFC patch is more a ok, the patch tries to fix the above oddnesses, please tell me where I went wrong than anything else. Comments? Linus patch.diff Description: Binary data
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote: I'm *so* not surprised. That said, I think your kill select_idle_sibling() one was interesting, but the wrong kind of get rid of that logic. Yeah. It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying does this group actually cover the cpu I am interested in? Please tell me I am mis-reading this? First of all, I'm so *not* a scheduler guy so take this with a great pinch of salt. The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Or, they're too big to fit into the L2 and they start kicking each-other out. Then you want to spread them out to different L2s - i.e., different HT groups in Intel-speak. Oh, and then there's the userspace spinlocks thingie where Mike's patch hurts us. Btw, Mike, you can jump in anytime :-) So I'd say, this is the hard scheduling problem where fitting the workload to the architecture doesn't make everyone happy. A crazy thought: one could go and sample tasks while running their timeslices with the perf counters to know exactly what type of workload we're looking at. I.e., do I have a large number of L2 evictions? Yes, then spread them out. No, then select the other core on the L2. And so on. But starting from the biggest (llc group) is wrong *anyway*, since it means that it starts looking at the L3 level, and then if it finds an acceptable cpu inside that level, it's all done. But that's *crazy*. Once again, it's much better to try to find an idle sibling *closeby* rather than at the L3 level. No? Exactly my thoughts a couple of days ago but see above. So once again, we should start at the inner level and if we can't find something really close, we work our way out, rather than starting from the outer level and working our way in. If I read the code correctly, we can have both prev and cpu in the same L2 domain, but because we start looking at the L3 domain, we may end up picking another affine CPU that isn't even sharing L2's *before* we pick one that actually *is* sharing L2's with the target CPU. But that code is confusing enough with the scheduler groups inner loop that maybe I am mis-reading it entirely. There are other oddities in select_idle_sibling() too, if I read things correctly. For example, it uses cpu_idle(target), but if we're actively trying to move to the current CPU (ie wake_affine() returned true), then target is the current cpu, which is certainly *not* going to be idle for a sync wakeup. So it should actually check whether it's a sync wakeup and the only thing pending is that synchronous waker, no? Maybe I'm missing something really fundamental, but it all really does look very odd to me. Attached is a totally untested and probably very buggy patch, so please consider it a shouldn't we do something like this instead RFC rather than anything serious. So this RFC patch is more a ok, the patch tries to fix the above oddnesses, please tell me where I went wrong than anything else. Comments? Let me look at it tomorrow, on a fresh head. Too late here now. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov b...@alien8.de wrote: On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with kill select_idle_sibling being the worst case. Sad, because it really would be lovely to just remove that thing ;) Yep, correct. It hurts. I'm *so* not surprised. Any other result would have induced mushroom cloud, glazed eyes, and jaw meets floor here. That said, I think your kill select_idle_sibling() one was interesting, but the wrong kind of get rid of that logic. It always selected target_cpu, but the fact is, that doesn't really sound very sane. The target cpu is either the previous cpu or the current cpu, depending on whether they should be balanced or not. But that still doesn't make any *sense*. In fact, the whole select_idle_sibling() logic makes no sense what-so-ever to me. It seems to be total garbage. Oh, it's not _that_ bad. It does have it's troubles, but if it were complete shite it wouldn't the make numbers that I showed, and wouldn't make the even better numbers it does with some other loads. For example, it starts with the maximum target scheduling domain, and works its way in over the scheduling groups within that domain. What the f*ck is the logic of that kind of crazy thing? It never makes sense to look at a biggest domain first. If you want to be close to something, you want to look at the *smallest* domain first. But because it looks at things in the wrong order, it then needs to have that inner loop saying does this group actually cover the cpu I am interested in? Please tell me I am mis-reading this? We start at MC to get the tbench win I showed (Intel) vs loss at SMT. Riddle me this, why does that produce the wins I showed? I'm still hoping someone can shed some light on why the heck there's such a disparity in processor behaviors. But starting from the biggest (llc group) is wrong *anyway*, since it means that it starts looking at the L3 level, and then if it finds an acceptable cpu inside that level, it's all done. But that's *crazy*. Once again, it's much better to try to find an idle sibling *closeby* rather than at the L3 level. No? So once again, we should start at the inner level and if we can't find something really close, we work our way out, rather than starting from the outer level and working our way in. Domains on my E5620 look like so when SMT is enabled (seldom): [0.473692] CPU0 attaching sched-domain: [0.477616] domain 0: span 0,4 level SIBLING [0.481982] groups: 0 (cpu_power = 589) 4 (cpu_power = 589) [0.487805] domain 1: span 0-7 level MC [0.491829]groups: 0,4 (cpu_power = 1178) 1,5 (cpu_power = 1178) 2,6 (cpu_power = 1178) 3,7 (cpu_power = 1178) ... I usually have SMT off, which gives me more oomph at the bottom end (smt affects turboboost gizmo methinks), have only one domain, so say I'm waking from CPU0. With cross wire thingy, we'll always wake to CPU1 if idle. That demonstrably works well despite it being L3. Box coughs up wins at fast movers I too would expect L3 to lose at. If L2 is my only viable target for fast movers, I'm stuck with SMT siblings, which I have measured. They aren't wonderful for this. They do improve max throughput markedly though, so aren't a complete waste of silicon ;-) I wonder what domains look like on Bulldog. (boot w. sched_debug) If I read the code correctly, we can have both prev and cpu in the same L2 domain, but because we start looking at the L3 domain, we may end up picking another affine CPU that isn't even sharing L2's *before* we pick one that actually *is* sharing L2's with the target CPU. But that code is confusing enough with the scheduler groups inner loop that maybe I am mis-reading it entirely. Yup, and on Intel, it manages to not suck. There are other oddities in select_idle_sibling() too, if I read things correctly. For example, it uses cpu_idle(target), but if we're actively trying to move to the current
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Wed, 2012-09-26 at 23:37 +0200, Borislav Petkov wrote: The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Yeah, and if the wakee can't get to the L2 hot data instantly, it may be better to let wakee drag the data to an instantly accessible spot. Or, they're too big to fit into the L2 and they start kicking each-other out. Then you want to spread them out to different L2s - i.e., different HT groups in Intel-speak. Oh, and then there's the userspace spinlocks thingie where Mike's patch hurts us. Btw, Mike, you can jump in anytime :-) I think the pgbench problem is more about latency for the 1 in 1:N than spinlocks. So I'd say, this is the hard scheduling problem where fitting the workload to the architecture doesn't make everyone happy. Yup. I find it hard at least. A crazy thought: one could go and sample tasks while running their timeslices with the perf counters to know exactly what type of workload we're looking at. I.e., do I have a large number of L2 evictions? Yes, then spread them out. No, then select the other core on the L2. And so on. Hm. That sampling better be really cheap. Might help... but how does that affect pgbench and ilk that must spread regardless of footprints. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote: The way I understand it is, you either want to share L2 with a process, because, for example, both working sets fit in the L2 and/or there's some sharing which saves you moving everything over the L3. This is where selecting a core on the same L2 is actually a good thing. Yeah, and if the wakee can't get to the L2 hot data instantly, it may be better to let wakee drag the data to an instantly accessible spot. Yep, then moving it to another L2 is the same. [ … ] A crazy thought: one could go and sample tasks while running their timeslices with the perf counters to know exactly what type of workload we're looking at. I.e., do I have a large number of L2 evictions? Yes, then spread them out. No, then select the other core on the L2. And so on. Hm. That sampling better be really cheap. Might help... Yeah, that's why I said sampling and not run the perfcounters during every timeslice. But if you count the proper events, you should be able to know exactly what the workload is doing (compute-bound, io-bound, contention, etc...) but how does that affect pgbench and ilk that must spread regardless of footprints. Well, how do you measure latency of the 1 process in the 1:N case? Maybe pipeline stalls of the 1 along with some way to recognize it is the 1 in the 1:N case. Hmm. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Thu, 2012-09-27 at 07:18 +0200, Borislav Petkov wrote: On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote: but how does that affect pgbench and ilk that must spread regardless of footprints. Well, how do you measure latency of the 1 process in the 1:N case? Maybe pipeline stalls of the 1 along with some way to recognize it is the 1 in the 1:N case. Best is to let userland tell us it's critical. Smarts are expensive. A class of it's own (my wakees do _not_ preempt me, and I don't care that you think this is unfair to the unwashed masses who will otherwise _starve_ without me feeding them) makes sense for these guys. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Mike Galbraith efa...@gmx.de wrote: I think the pgbench problem is more about latency for the 1 in 1:N than spinlocks. So my understanding of the psql workload is that basically we've got a central psql proxy process that is distributing work to worker psql processes. If a freshly woken worker process ever preempts the central proxy process then it is preventing a lot of new work from getting distributed. Correct? So the central proxy psql process is 'much more important' to run than any of the worker processes - an importance that is not (currently) visible from the behavioral statistics the scheduler keeps on tasks. So the scheduler has the following problem here: a new wakee might be starved enough and the proxy might have run long enough to really justify the preemption here and now. The buddy statistics help avoid some of these cases - but not all and the difference is measurable. Yet the 'best' way for psql to run is for this proxy process to never be preempted. Your SCHED_BATCH experiments confirmed that. The way remote CPU selection affects it is that if we ever get more aggressive in selecting a remote CPU then we, as a side effect, also reduce the chance of harmful preemption of the central proxy psql process. So in that sense sibling selection is somewhat of an indirect red herring: it really only helps psql indirectly by preventing the harmful preemption. It also, somewhat paradoxially argues for suboptimal code: for example tearing apart buddies is beneficial in the psql workload, because it also allows the more important part of the buddy to run more (the proxy). In that sense the *real* problem isnt even parallelism (although we obviously should improve the decisions there - and the logic has suffered in the past from the psql dilemma outlined above), but whether the scheduler can (and should) identify the central proxy and keep it running as much as possible, deprioritizing fairness, wakeup buddies, runtime overlap and cache affinity considerations. There's two broad solutions that I can see: - Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two. We already try to do a couple of robust metrics, like overlap statistics to identify buddies. - Let user-space occasionally identify its important (and less important) tasks - say psql could mark it worker processes as SCHED_BATCH and keep its central process(es) higher prio. A single line of obvious code in 100 KLOCs of user-space code. Just to confirm, if you turn off all preemption via a hack (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql perform and scale much better, with the quality of sibling selection and spreading of processes only being a secondary effect? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
* Ingo Molnar mi...@kernel.org wrote: * Mike Galbraith efa...@gmx.de wrote: I think the pgbench problem is more about latency for the 1 in 1:N than spinlocks. So my understanding of the psql workload is that basically we've got a central psql proxy process that is distributing work to worker psql processes. If a freshly woken worker process ever preempts the central proxy process then it is preventing a lot of new work from getting distributed. Also, I'd like to stress that despite the optimization dilemma, the psql workload is *important*. More important than tbench - because psql does some real SQL work and it also matches the design of many real desktop and server workloads. So if indeed the above is the main problem of psql it would be nice to add a 'perf bench sched proxy' testcase that emulates it - that would remove psql version dependencies and would ease the difficulty of running the benchmarks. We alread have 'perf bench sched pipe' and 'perf bench sched messaging' - but neither shows the psql pattern currently. I suspect a couple of udelay()s in the messaging benchmark would do the trick? The wakeup work there already matches much of how psql looks like. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, 2012-09-25 at 19:22 -0700, Linus Torvalds wrote: > On Tue, Sep 25, 2012 at 7:00 PM, Mike Galbraith wrote: > > > > Yes. On AMD, the best thing you can do for fast switchers AFAIKT is > > turn it off. Different story on Intel. > > I doubt it's all that different on Intel. The behavioral difference is pretty large, question is why. > Am I on the right track here? Or do you mean something completely > different? Please explain it more verbosely. A picture is worth a thousand words they say... x3550 M3 E5620, SMT off, revert reverted, nohz off, zero knob twiddles, governor=performance. tbench1 2 4 398 820 1574 -select_idle_sibling() 454 902 1574 +select_idle_sibling() 397 737 1556 +select_idle_sibling() virgin source netperf TCP_RR, one unbound pair 114674 -select_idle_sibling() 131422 +select_idle_sibling() 111551 +select_idle_sibling() virgin source These 1:1 buddy pairs scheduled cross core on E5620 feel no pain once you kill the bouncing. The bounce pain with 4 cores is _tons_ less intense than on the 10 core Westmere, but it's still quite visible. The point though is that cross core doesn't hurt Westmere, but demolishes Opteron for some reason. (OTOH, bounce _helps_ fugly 1:N load.. grr;) > Your patch showed improvement for Intel too on this same benchmark > (tbench). Borislav just went even further. I'd suggest testing that > patch on Intel too, and wouldn't be surprised at all if it shows > improvement there too. See above. > It's pgbench that then regressed with your patch, and I suspect it > will regress with Borislav's too. Yeah, strongly suspect you're right. > You probably looked at the fact that the original report from Nikolay > says that the Intel E6300 hadn't regressed on pgbench, but I suspect > you didn't realize that E6300 is just a dual-core CPU without even HT. > So I doubt it's about "Intel vs AMD", it's more about "six cores" vs > "just two". No, I knew, and yeah, it's about number of paths. > And the thing is - with just two cores, the fact that your patch > didn't change the Intel numbers is totally irrelevant. With two cores, > the whole "buddy_cpu" was equivalent to the old code, since there was > ever only one other core to begin with! > > So AMD and Intel do have differences, but they aren't all that radical. Looks fairly radical to me, but as noted in mail to Boris, it boils down to "what does it cost, and where does the breakeven lie?". -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote: > Right, so why did we need it all, in the first place? There has to be > some reason for it. Easy. Take two communicating tasks. Is an affine wakeup a good idea? It depends on how much execution overlap there is. Wake affine when there is overlap larger than cache miss cost, and you just tossed throughput into the bin. select_idle_sibling() was originally about shared L2, where any overlap was salvageable. On modern processors with no shared L2, you have to get past the cost, but the gain is still there. Intel wins with loads that AMD loses very bady on, so I can only guess that Intel must feed caches more efficiently. Dunno. It just doesn't matter though, point is that there is a win to be had in both cases, the breakeven just isn't at the same point. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 7:00 PM, Mike Galbraith wrote: > > Yes. On AMD, the best thing you can do for fast switchers AFAIKT is > turn it off. Different story on Intel. I doubt it's all that different on Intel. Your patch showed improvement for Intel too on this same benchmark (tbench). Borislav just went even further. I'd suggest testing that patch on Intel too, and wouldn't be surprised at all if it shows improvement there too. It's pgbench that then regressed with your patch, and I suspect it will regress with Borislav's too. So I'm sure there are architecture differences (where HT in particular probably changes optimal scheduling strategy, although I'd expect the bulldozer approach to not be *that*different - but I don't know if BD shows up as "HT siblings" or not, so dissimilar topology interpretation may make it *look* very different). So I suspect the architectural differences are smaller than you claim, and it's much more about the loads in question. You probably looked at the fact that the original report from Nikolay says that the Intel E6300 hadn't regressed on pgbench, but I suspect you didn't realize that E6300 is just a dual-core CPU without even HT. So I doubt it's about "Intel vs AMD", it's more about "six cores" vs "just two". And the thing is - with just two cores, the fact that your patch didn't change the Intel numbers is totally irrelevant. With two cores, the whole "buddy_cpu" was equivalent to the old code, since there was ever only one other core to begin with! So AMD and Intel do have differences, but they aren't all that radical. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, 2012-09-25 at 10:21 -0700, Linus Torvalds wrote: > On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov wrote: > > > > 3.6-rc6+tip/auto-latest-kill select_idle_sibling() > > Is this literally just removing it entirely? Because apart from the > latency spike at 4 procs (and the latency numbers look very noisy, so > that's probably just noise), it looks clearly superior to everything > else. On that benchmark, at least. Yes. On AMD, the best thing you can do for fast switchers AFAIKT is turn it off. Different story on Intel. > How does pgbench look? That's the one that apparently really wants to > spread out, possibly due to user-level spinlocks. So I assume it will > show the reverse pattern, with "kill select_idle_sibling" being the > worst case. Sad, because it really would be lovely to just remove that > thing ;) It _is_ irritating. There's nohz, governors, and then come radically different cross cpu data blasting ability on top. On Intel, it wins at the same fast movers it demolishes on AMD. Throttle it, and that goes away, along with some other issues. Or just kill it, then integrate what it does for you into a smarter lighter wakeup balance.. but then that has to climb that same hills. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote: > On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith wrote: > > > > Aside from the cache pollution I recall having been mentioned, on my > > E5620, cross core is a tbench win over affine, cross thread is not. > > Oh, I agree with trying to avoid HT threads, the resource contention > easily gets too bad. > > It's more a question of "if we have real cores with separate L1's but > shared L2's, go with those first, before we start distributing it out > to separate L2's". There is one issue though. If the tasks continue to run in this state and the periodic balance notices an idle L2, it will force migrate (using active migration) one of the tasks to the idle L2. As the periodic balance tries to spread the load as far as possible to take maximum advantage of the available resources (and the perf advantage of this really depends on the workload, cache usage/memory bw, the upside of turbo etc). But I am not sure if this was the reason why we chose to spread it out to separate L2's during wakeup. Anyways, this is one of the places where the Paul Turner's task load average tracking patches will be useful. Depending on how long a task typically runs, we can probably even chose a SMT siblings or a separate L2 to run. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 11:42 AM, Borislav Petkov wrote: >> >> Is this literally just removing it entirely? > > Basically yes: Ok, so you make it just always select 'target'. Fine. I wondered if you just removed the calling logic entirely. >> How does pgbench look? That's the one that apparently really wants to >> spread out, possibly due to user-level spinlocks. So I assume it will >> show the reverse pattern, with "kill select_idle_sibling" being the >> worst case. > > Let me run pgbench tomorrow (I had run it only on an older family 0x10 > single-node box) on Bulldozer to check that out. And we haven't started > the multi-node measurements at all. Ack, this clearly needs much more testing. That said, I really would *love* to just get rid of the function entirely. >> Sad, because it really would be lovely to just remove that thing ;) > > Right, so why did we need it all, in the first place? There has to be > some reason for it. I'm not entirely convinced. Looking at the history of that thing, it's long and tortuous, and has a few commits completely fixing the "logic" of it (eg see commit 99bd5e2f245d). To the point where I don't think it necessarily even matches what the original cause for it was. So it's *possible* that we have a case of historical code that may have improved performance originally on at least some machines, but that has (a) been changed due to it being broken and (b) CPU's have changed too, so it may well be that it simply doesn't help any more. And we've had problems with this function before. See for example: - 4dcfe1025b51: sched: Avoid SMT siblings in select_idle_sibling() if possible - 518cd6234178: sched: Only queue remote wakeups when crossing cache boundaries so we've basically had odd special-case "tuning" of this function from the original. I do not think that there is any solid reason to believe that it does what it used to do, or that what it used to do makes sense any more. It's entirely possible that "prev_cpu" basically ends up being the better choice for spreading things out. That said, my *guess* is that when you run pgbench, you'll see the same regression that we saw due to Mike's patch too. It simply looks like tbench wants to have minimal cpu selection and avoid moving things around, while pgbench probably wants to spread out maximally. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: > On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov wrote: > > > > 3.6-rc6+tip/auto-latest-kill select_idle_sibling() > > Is this literally just removing it entirely? Basically yes: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a14b990..016ba387c7f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2640,6 +2640,8 @@ static int select_idle_sibling(struct task_struct *p, int target) struct sched_group *sg; int i; + goto done; + /* * If the task is going to be woken-up on this cpu and if it is * already idle, then it is the right target. > Because apart from the latency spike at 4 procs (and the latency > numbers look very noisy, so that's probably just noise), it looks > clearly superior to everything else. On that benchmark, at least. Yep, I need more results for a more reliable say here. > How does pgbench look? That's the one that apparently really wants to > spread out, possibly due to user-level spinlocks. So I assume it will > show the reverse pattern, with "kill select_idle_sibling" being the > worst case. Let me run pgbench tomorrow (I had run it only on an older family 0x10 single-node box) on Bulldozer to check that out. And we haven't started the multi-node measurements at all. > Sad, because it really would be lovely to just remove that thing ;) Right, so why did we need it all, in the first place? There has to be some reason for it. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov wrote: > > 3.6-rc6+tip/auto-latest-kill select_idle_sibling() Is this literally just removing it entirely? Because apart from the latency spike at 4 procs (and the latency numbers look very noisy, so that's probably just noise), it looks clearly superior to everything else. On that benchmark, at least. How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with "kill select_idle_sibling" being the worst case. Sad, because it really would be lovely to just remove that thing ;) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 03:17:36PM +0200, Borislav Petkov wrote: > For example, I did some measurements a couple of days ago on Bulldozer > of tbench with and without select_idle_sibling: Here are updated benchmark results with your patch here: http://marc.info/?l=linux-kernel=134850871822587 I think this pretty much confirms Mel's results. tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv localhost), tbench default settings as in debian testing # clients 1 2 4 8 12 16 3.6-rc6+tip/auto-latest 115.91 238.571 469.606 1865.77 1863.08 1851.46 3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 900.069 1969.35 1955.91 1940.84 3.6-rc6+tip/auto-latest-revert-the-revert 114.001 223.171 408.507 1771.48 1757.08 1736.12 3.6-rc7+tip/auto-latest-select_idle_sibling-lists 107.39 222.439 435.255 1659.42 1697.43 1685.92 3.6-rc6+tip/auto-latest --- Throughput 115.91 MB/sec 1 clients 1 procs max_latency=0.296 ms Throughput 238.571 MB/sec 2 clients 2 procs max_latency=1.296 ms Throughput 469.606 MB/sec 4 clients 4 procs max_latency=0.340 ms Throughput 1865.77 MB/sec 8 clients 8 procs max_latency=3.393 ms Throughput 1863.08 MB/sec 12 clients 12 procs max_latency=0.322 ms Throughput 1851.46 MB/sec 16 clients 16 procs max_latency=2.059 ms 3.6-rc6+tip/auto-latest-kill select_idle_sibling() -- Throughput 354.619 MB/sec 1 clients 1 procs max_latency=0.321 ms Throughput 534.714 MB/sec 2 clients 2 procs max_latency=2.651 ms Throughput 900.069 MB/sec 4 clients 4 procs max_latency=10.823 ms Throughput 1969.35 MB/sec 8 clients 8 procs max_latency=1.630 ms Throughput 1955.91 MB/sec 12 clients 12 procs max_latency=3.236 ms Throughput 1940.84 MB/sec 16 clients 16 procs max_latency=0.314 ms 3.6-rc6+tip/auto-latest-revert-the-revert - Throughput 114.001 MB/sec 1 clients 1 procs max_latency=0.352 ms Throughput 223.171 MB/sec 2 clients 2 procs max_latency=0.348 ms Throughput 408.507 MB/sec 4 clients 4 procs max_latency=0.388 ms Throughput 1771.48 MB/sec 8 clients 8 procs max_latency=0.280 ms Throughput 1757.08 MB/sec 12 clients 12 procs max_latency=3.280 ms Throughput 1736.12 MB/sec 16 clients 16 procs max_latency=0.333 ms 3.6-rc7+tip/auto-latest-select_idle_sibling-lists - Throughput 107.39 MB/sec 1 clients 1 procs max_latency=0.372 ms Throughput 222.439 MB/sec 2 clients 2 procs max_latency=0.345 ms Throughput 435.255 MB/sec 4 clients 4 procs max_latency=0.346 ms Throughput 1659.42 MB/sec 8 clients 8 procs max_latency=3.497 ms Throughput 1697.43 MB/sec 12 clients 12 procs max_latency=3.205 ms Throughput 1685.92 MB/sec 16 clients 16 procs max_latency=0.331 ms -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, 2012-09-25 at 14:23 +0100, Mel Gorman wrote: > It crashes on boot due to the fact that you created a function-scope variable > called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you > were interested in. D'0h! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, Sep 24, 2012 at 07:44:17PM +0200, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote: > > But let me try and come up with the list thing, I think we've > > actually got that someplace as well. > > OK, I'm sure the below can be written better, but my brain is gone for > the day... > It crashes on boot due to the fact that you created a function-scope variable called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you were interested in. Result: dereferenced uninitialised pointer and kaboom. Trivial to fix so it boots at least. This is a silly test for a scheduler patch but as "sched: Avoid SMT siblings in select_idle_sibling() if possible" regressed 2% back in 3.2, it seemed reasonable to retest with it. KERNBENCH 3.6.0 3.6.0 3.6.0 rc6-vanillarc6-mikebuddy-v1r1 rc6-idlesibling-v1r1 Usermin 352.47 ( 0.00%) 351.77 ( 0.20%) 352.30 ( 0.05%) Usermean353.10 ( 0.00%) 352.78 ( 0.09%) 352.77 ( 0.09%) Userstddev0.41 ( 0.00%)0.56 (-36.13%)0.35 ( 15.16%) Usermax 353.55 ( 0.00%) 353.43 ( 0.03%) 353.31 ( 0.07%) System min 34.86 ( 0.00%) 34.83 ( 0.09%) 35.37 ( -1.46%) System mean 35.35 ( 0.00%) 35.29 ( 0.16%) 35.63 ( -0.80%) System stddev0.41 ( 0.00%)0.40 ( 0.10%)0.15 ( 62.26%) System max 35.94 ( 0.00%) 36.05 ( -0.31%) 35.81 ( 0.36%) Elapsed min 110.18 ( 0.00%) 109.65 ( 0.48%) 110.04 ( 0.13%) Elapsed mean110.21 ( 0.00%) 109.75 ( 0.42%) 110.15 ( 0.06%) Elapsed stddev0.03 ( 0.00%)0.07 (-167.83%)0.09 (-207.56%) Elapsed max 110.26 ( 0.00%) 109.86 ( 0.36%) 110.26 ( 0.00%) CPU min 352.00 ( 0.00%) 353.00 ( -0.28%) 352.00 ( 0.00%) CPU mean352.00 ( 0.00%) 353.00 ( -0.28%) 352.00 ( 0.00%) CPU stddev0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) CPU max 352.00 ( 0.00%) 353.00 ( -0.28%) 352.00 ( 0.00%) mikebuddy-v1r1 is Mike's patch that just got reverted. idlesibling is Peters patch. "Elapsed mean" time is the main value of interest. Mike's patch gains 0.42% which is less than the 2% lost but at least the gain is outside the noise. idlesibling make very little difference. "System mean" is also interesting because even though idlesibling shows a "regression", it also shows that the variation between runs is reduced. That might indicate that fewer cache misses are being incurred in the select_idle_sibling() code although that is a bit of a leap of faith. The machine is in use at the moment but I'll queue up a test this evening to gather a profile to confirm time is even being spent in select_idle_sibling() Just because 2% was lost in select_idle_sibling() back in 3.2 does not mean squat now. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 01:58:06PM +0200, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote: > > In the not-so-distant past, we had the intel "Dunnington" Xeon, which > > was iirc basically three Core 2 duo's bolted together (ie three > > clusters of two cores sharing L2, and a fully shared L3). So that was > > a true multi-core with fairly big shared L2, and it really would be > > sad to not use the second core aggressively. > > Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around > without a PSU atm so checking gets a little hard) so the LLC level was > the L2 and all worked out right (it also not having SMT helped of > course). > > But if there was a Xeon chip that did add a package L3 then yes, all > this would become more interesting still. We'd need to extend the > scheduler topology a bit as well, I don't think it can currently handle > this well. > > So I guess we get to do some work for steamroller. Right, but before that we can still do some experimenting on Bulldozer - we have the shared 2M L2 there too and it would be nice to improve select_idle_sibling there. For example, I did some measurements a couple of days ago on Bulldozer of tbench with and without select_idle_sibling: tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv localhost), tbench default settings as in debian testing # clients 1 2 4 8 12 16 3.6-rc6+tip/auto-latest 115.91 238.571 469.606 1865.77 1863.08 1851.46 3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 900.069 1969.35 1955.91 1940.84 3.6-rc6+tip/auto-latest --- Throughput 115.91 MB/sec 1 clients 1 procs max_latency=0.296 ms Throughput 238.571 MB/sec 2 clients 2 procs max_latency=1.296 ms Throughput 469.606 MB/sec 4 clients 4 procs max_latency=0.340 ms Throughput 1865.77 MB/sec 8 clients 8 procs max_latency=3.393 ms Throughput 1863.08 MB/sec 12 clients 12 procs max_latency=0.322 ms Throughput 1851.46 MB/sec 16 clients 16 procs max_latency=2.059 ms 3.6-rc6+tip/auto-latest-kill select_idle_sibling() -- Throughput 354.619 MB/sec 1 clients 1 procs max_latency=0.321 ms Throughput 534.714 MB/sec 2 clients 2 procs max_latency=2.651 ms Throughput 900.069 MB/sec 4 clients 4 procs max_latency=10.823 ms Throughput 1969.35 MB/sec 8 clients 8 procs max_latency=1.630 ms Throughput 1955.91 MB/sec 12 clients 12 procs max_latency=3.236 ms Throughput 1940.84 MB/sec 16 clients 16 procs max_latency=0.314 ms So improving this select_idle_sibling thing wouldn't be such a bad thing. Btw, I'll run your patch at http://marc.info/?l=linux-kernel=134850571330618 with the same benchmark to see what it brings. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 12:54 AM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 09:33 -0700, Linus Torvalds wrote: >> Sure, the "scan bits" bitops will return ">= nr_cpu_ids" for the "I >> couldn't find a bit" thing, but that doesn't mean that everything else >> should. > > Fair enough.. > > --- > kernel/sched/fair.c | 42 +- > 1 file changed, 21 insertions(+), 21 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6b800a1..329f78d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group, struct > task_struct *p, int this_cpu) > */ > static int select_idle_sibling(struct task_struct *p, int target) > { > - int cpu = smp_processor_id(); > - int prev_cpu = task_cpu(p); > struct sched_domain *sd; > struct sched_group *sg; > int i; > > - /* > -* If the task is going to be woken-up on this cpu and if it is > -* already idle, then it is the right target. > -*/ > - if (target == cpu && idle_cpu(cpu)) > - return cpu; > - > - /* > -* If the task is going to be woken-up on the cpu where it previously > -* ran and if it is currently idle, then it the right target. > -*/ > - if (target == prev_cpu && idle_cpu(prev_cpu)) > - return prev_cpu; > + if (idle_cpu(target)) > + return target; > > /* > * Otherwise, iterate the domains and find an elegible idle cpu. > @@ -2661,18 +2648,31 @@ static int select_idle_sibling(struct task_struct *p, > int target) > for_each_lower_domain(sd) { > sg = sd->groups; > do { > - if (!cpumask_intersects(sched_group_cpus(sg), > - tsk_cpus_allowed(p))) > - goto next; > + int candidate = -1; > > + /* > +* In the SMT case the groups are the SMT-siblings, > +* otherwise they're singleton groups. > +*/ > for_each_cpu(i, sched_group_cpus(sg)) { > + if (!cpumask_test_cpu(i, tsk_cpus_allowed(p))) > + continue; > + > + /* > +* If any of the SMT-siblings are !idle, the > +* core isn't idle. > +*/ > if (!idle_cpu(i)) > goto next; > + > + if (candidate < 0) > + candidate = i; Any reason to determine candidate by scanning a non-idle core? > } > > - target = cpumask_first_and(sched_group_cpus(sg), > - tsk_cpus_allowed(p)); > - goto done; > + if (candidate >= 0) { > + target = candidate; > + goto done; > + } > next: > sg = sg->next; > } while (sg != sd->groups); > > -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote: > In the not-so-distant past, we had the intel "Dunnington" Xeon, which > was iirc basically three Core 2 duo's bolted together (ie three > clusters of two cores sharing L2, and a fully shared L3). So that was > a true multi-core with fairly big shared L2, and it really would be > sad to not use the second core aggressively. Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around without a PSU atm so checking gets a little hard) so the LLC level was the L2 and all worked out right (it also not having SMT helped of course). But if there was a Xeon chip that did add a package L3 then yes, all this would become more interesting still. We'd need to extend the scheduler topology a bit as well, I don't think it can currently handle this well. So I guess we get to do some work for steamroller. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote: In the not-so-distant past, we had the intel Dunnington Xeon, which was iirc basically three Core 2 duo's bolted together (ie three clusters of two cores sharing L2, and a fully shared L3). So that was a true multi-core with fairly big shared L2, and it really would be sad to not use the second core aggressively. Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around without a PSU atm so checking gets a little hard) so the LLC level was the L2 and all worked out right (it also not having SMT helped of course). But if there was a Xeon chip that did add a package L3 then yes, all this would become more interesting still. We'd need to extend the scheduler topology a bit as well, I don't think it can currently handle this well. So I guess we get to do some work for steamroller. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 12:54 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-09-24 at 09:33 -0700, Linus Torvalds wrote: Sure, the scan bits bitops will return = nr_cpu_ids for the I couldn't find a bit thing, but that doesn't mean that everything else should. Fair enough.. --- kernel/sched/fair.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a1..329f78d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ static int select_idle_sibling(struct task_struct *p, int target) { - int cpu = smp_processor_id(); - int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; int i; - /* -* If the task is going to be woken-up on this cpu and if it is -* already idle, then it is the right target. -*/ - if (target == cpu idle_cpu(cpu)) - return cpu; - - /* -* If the task is going to be woken-up on the cpu where it previously -* ran and if it is currently idle, then it the right target. -*/ - if (target == prev_cpu idle_cpu(prev_cpu)) - return prev_cpu; + if (idle_cpu(target)) + return target; /* * Otherwise, iterate the domains and find an elegible idle cpu. @@ -2661,18 +2648,31 @@ static int select_idle_sibling(struct task_struct *p, int target) for_each_lower_domain(sd) { sg = sd-groups; do { - if (!cpumask_intersects(sched_group_cpus(sg), - tsk_cpus_allowed(p))) - goto next; + int candidate = -1; + /* +* In the SMT case the groups are the SMT-siblings, +* otherwise they're singleton groups. +*/ for_each_cpu(i, sched_group_cpus(sg)) { + if (!cpumask_test_cpu(i, tsk_cpus_allowed(p))) + continue; + + /* +* If any of the SMT-siblings are !idle, the +* core isn't idle. +*/ if (!idle_cpu(i)) goto next; + + if (candidate 0) + candidate = i; Any reason to determine candidate by scanning a non-idle core? } - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; + if (candidate = 0) { + target = candidate; + goto done; + } next: sg = sg-next; } while (sg != sd-groups); -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 01:58:06PM +0200, Peter Zijlstra wrote: On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote: In the not-so-distant past, we had the intel Dunnington Xeon, which was iirc basically three Core 2 duo's bolted together (ie three clusters of two cores sharing L2, and a fully shared L3). So that was a true multi-core with fairly big shared L2, and it really would be sad to not use the second core aggressively. Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around without a PSU atm so checking gets a little hard) so the LLC level was the L2 and all worked out right (it also not having SMT helped of course). But if there was a Xeon chip that did add a package L3 then yes, all this would become more interesting still. We'd need to extend the scheduler topology a bit as well, I don't think it can currently handle this well. So I guess we get to do some work for steamroller. Right, but before that we can still do some experimenting on Bulldozer - we have the shared 2M L2 there too and it would be nice to improve select_idle_sibling there. For example, I did some measurements a couple of days ago on Bulldozer of tbench with and without select_idle_sibling: tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv localhost), tbench default settings as in debian testing # clients 1 2 4 8 12 16 3.6-rc6+tip/auto-latest 115.91 238.571 469.606 1865.77 1863.08 1851.46 3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 900.069 1969.35 1955.91 1940.84 3.6-rc6+tip/auto-latest --- Throughput 115.91 MB/sec 1 clients 1 procs max_latency=0.296 ms Throughput 238.571 MB/sec 2 clients 2 procs max_latency=1.296 ms Throughput 469.606 MB/sec 4 clients 4 procs max_latency=0.340 ms Throughput 1865.77 MB/sec 8 clients 8 procs max_latency=3.393 ms Throughput 1863.08 MB/sec 12 clients 12 procs max_latency=0.322 ms Throughput 1851.46 MB/sec 16 clients 16 procs max_latency=2.059 ms 3.6-rc6+tip/auto-latest-kill select_idle_sibling() -- Throughput 354.619 MB/sec 1 clients 1 procs max_latency=0.321 ms Throughput 534.714 MB/sec 2 clients 2 procs max_latency=2.651 ms Throughput 900.069 MB/sec 4 clients 4 procs max_latency=10.823 ms Throughput 1969.35 MB/sec 8 clients 8 procs max_latency=1.630 ms Throughput 1955.91 MB/sec 12 clients 12 procs max_latency=3.236 ms Throughput 1940.84 MB/sec 16 clients 16 procs max_latency=0.314 ms So improving this select_idle_sibling thing wouldn't be such a bad thing. Btw, I'll run your patch at http://marc.info/?l=linux-kernelm=134850571330618 with the same benchmark to see what it brings. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, Sep 24, 2012 at 07:44:17PM +0200, Peter Zijlstra wrote: On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote: But let me try and come up with the list thing, I think we've actually got that someplace as well. OK, I'm sure the below can be written better, but my brain is gone for the day... It crashes on boot due to the fact that you created a function-scope variable called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you were interested in. Result: dereferenced uninitialised pointer and kaboom. Trivial to fix so it boots at least. This is a silly test for a scheduler patch but as sched: Avoid SMT siblings in select_idle_sibling() if possible regressed 2% back in 3.2, it seemed reasonable to retest with it. KERNBENCH 3.6.0 3.6.0 3.6.0 rc6-vanillarc6-mikebuddy-v1r1 rc6-idlesibling-v1r1 Usermin 352.47 ( 0.00%) 351.77 ( 0.20%) 352.30 ( 0.05%) Usermean353.10 ( 0.00%) 352.78 ( 0.09%) 352.77 ( 0.09%) Userstddev0.41 ( 0.00%)0.56 (-36.13%)0.35 ( 15.16%) Usermax 353.55 ( 0.00%) 353.43 ( 0.03%) 353.31 ( 0.07%) System min 34.86 ( 0.00%) 34.83 ( 0.09%) 35.37 ( -1.46%) System mean 35.35 ( 0.00%) 35.29 ( 0.16%) 35.63 ( -0.80%) System stddev0.41 ( 0.00%)0.40 ( 0.10%)0.15 ( 62.26%) System max 35.94 ( 0.00%) 36.05 ( -0.31%) 35.81 ( 0.36%) Elapsed min 110.18 ( 0.00%) 109.65 ( 0.48%) 110.04 ( 0.13%) Elapsed mean110.21 ( 0.00%) 109.75 ( 0.42%) 110.15 ( 0.06%) Elapsed stddev0.03 ( 0.00%)0.07 (-167.83%)0.09 (-207.56%) Elapsed max 110.26 ( 0.00%) 109.86 ( 0.36%) 110.26 ( 0.00%) CPU min 352.00 ( 0.00%) 353.00 ( -0.28%) 352.00 ( 0.00%) CPU mean352.00 ( 0.00%) 353.00 ( -0.28%) 352.00 ( 0.00%) CPU stddev0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) CPU max 352.00 ( 0.00%) 353.00 ( -0.28%) 352.00 ( 0.00%) mikebuddy-v1r1 is Mike's patch that just got reverted. idlesibling is Peters patch. Elapsed mean time is the main value of interest. Mike's patch gains 0.42% which is less than the 2% lost but at least the gain is outside the noise. idlesibling make very little difference. System mean is also interesting because even though idlesibling shows a regression, it also shows that the variation between runs is reduced. That might indicate that fewer cache misses are being incurred in the select_idle_sibling() code although that is a bit of a leap of faith. The machine is in use at the moment but I'll queue up a test this evening to gather a profile to confirm time is even being spent in select_idle_sibling() Just because 2% was lost in select_idle_sibling() back in 3.2 does not mean squat now. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, 2012-09-25 at 14:23 +0100, Mel Gorman wrote: It crashes on boot due to the fact that you created a function-scope variable called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you were interested in. D'0h! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 03:17:36PM +0200, Borislav Petkov wrote: For example, I did some measurements a couple of days ago on Bulldozer of tbench with and without select_idle_sibling: Here are updated benchmark results with your patch here: http://marc.info/?l=linux-kernelm=134850871822587 I think this pretty much confirms Mel's results. tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv localhost), tbench default settings as in debian testing # clients 1 2 4 8 12 16 3.6-rc6+tip/auto-latest 115.91 238.571 469.606 1865.77 1863.08 1851.46 3.6-rc6+tip/auto-latest-kill select_idle_sibling(): 354.619 534.714 900.069 1969.35 1955.91 1940.84 3.6-rc6+tip/auto-latest-revert-the-revert 114.001 223.171 408.507 1771.48 1757.08 1736.12 3.6-rc7+tip/auto-latest-select_idle_sibling-lists 107.39 222.439 435.255 1659.42 1697.43 1685.92 3.6-rc6+tip/auto-latest --- Throughput 115.91 MB/sec 1 clients 1 procs max_latency=0.296 ms Throughput 238.571 MB/sec 2 clients 2 procs max_latency=1.296 ms Throughput 469.606 MB/sec 4 clients 4 procs max_latency=0.340 ms Throughput 1865.77 MB/sec 8 clients 8 procs max_latency=3.393 ms Throughput 1863.08 MB/sec 12 clients 12 procs max_latency=0.322 ms Throughput 1851.46 MB/sec 16 clients 16 procs max_latency=2.059 ms 3.6-rc6+tip/auto-latest-kill select_idle_sibling() -- Throughput 354.619 MB/sec 1 clients 1 procs max_latency=0.321 ms Throughput 534.714 MB/sec 2 clients 2 procs max_latency=2.651 ms Throughput 900.069 MB/sec 4 clients 4 procs max_latency=10.823 ms Throughput 1969.35 MB/sec 8 clients 8 procs max_latency=1.630 ms Throughput 1955.91 MB/sec 12 clients 12 procs max_latency=3.236 ms Throughput 1940.84 MB/sec 16 clients 16 procs max_latency=0.314 ms 3.6-rc6+tip/auto-latest-revert-the-revert - Throughput 114.001 MB/sec 1 clients 1 procs max_latency=0.352 ms Throughput 223.171 MB/sec 2 clients 2 procs max_latency=0.348 ms Throughput 408.507 MB/sec 4 clients 4 procs max_latency=0.388 ms Throughput 1771.48 MB/sec 8 clients 8 procs max_latency=0.280 ms Throughput 1757.08 MB/sec 12 clients 12 procs max_latency=3.280 ms Throughput 1736.12 MB/sec 16 clients 16 procs max_latency=0.333 ms 3.6-rc7+tip/auto-latest-select_idle_sibling-lists - Throughput 107.39 MB/sec 1 clients 1 procs max_latency=0.372 ms Throughput 222.439 MB/sec 2 clients 2 procs max_latency=0.345 ms Throughput 435.255 MB/sec 4 clients 4 procs max_latency=0.346 ms Throughput 1659.42 MB/sec 8 clients 8 procs max_latency=3.497 ms Throughput 1697.43 MB/sec 12 clients 12 procs max_latency=3.205 ms Throughput 1685.92 MB/sec 16 clients 16 procs max_latency=0.331 ms -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov b...@alien8.de wrote: 3.6-rc6+tip/auto-latest-kill select_idle_sibling() Is this literally just removing it entirely? Because apart from the latency spike at 4 procs (and the latency numbers look very noisy, so that's probably just noise), it looks clearly superior to everything else. On that benchmark, at least. How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with kill select_idle_sibling being the worst case. Sad, because it really would be lovely to just remove that thing ;) Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote: On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov b...@alien8.de wrote: 3.6-rc6+tip/auto-latest-kill select_idle_sibling() Is this literally just removing it entirely? Basically yes: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a14b990..016ba387c7f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2640,6 +2640,8 @@ static int select_idle_sibling(struct task_struct *p, int target) struct sched_group *sg; int i; + goto done; + /* * If the task is going to be woken-up on this cpu and if it is * already idle, then it is the right target. Because apart from the latency spike at 4 procs (and the latency numbers look very noisy, so that's probably just noise), it looks clearly superior to everything else. On that benchmark, at least. Yep, I need more results for a more reliable say here. How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with kill select_idle_sibling being the worst case. Let me run pgbench tomorrow (I had run it only on an older family 0x10 single-node box) on Bulldozer to check that out. And we haven't started the multi-node measurements at all. Sad, because it really would be lovely to just remove that thing ;) Right, so why did we need it all, in the first place? There has to be some reason for it. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Tue, Sep 25, 2012 at 11:42 AM, Borislav Petkov b...@alien8.de wrote: Is this literally just removing it entirely? Basically yes: Ok, so you make it just always select 'target'. Fine. I wondered if you just removed the calling logic entirely. How does pgbench look? That's the one that apparently really wants to spread out, possibly due to user-level spinlocks. So I assume it will show the reverse pattern, with kill select_idle_sibling being the worst case. Let me run pgbench tomorrow (I had run it only on an older family 0x10 single-node box) on Bulldozer to check that out. And we haven't started the multi-node measurements at all. Ack, this clearly needs much more testing. That said, I really would *love* to just get rid of the function entirely. Sad, because it really would be lovely to just remove that thing ;) Right, so why did we need it all, in the first place? There has to be some reason for it. I'm not entirely convinced. Looking at the history of that thing, it's long and tortuous, and has a few commits completely fixing the logic of it (eg see commit 99bd5e2f245d). To the point where I don't think it necessarily even matches what the original cause for it was. So it's *possible* that we have a case of historical code that may have improved performance originally on at least some machines, but that has (a) been changed due to it being broken and (b) CPU's have changed too, so it may well be that it simply doesn't help any more. And we've had problems with this function before. See for example: - 4dcfe1025b51: sched: Avoid SMT siblings in select_idle_sibling() if possible - 518cd6234178: sched: Only queue remote wakeups when crossing cache boundaries so we've basically had odd special-case tuning of this function from the original. I do not think that there is any solid reason to believe that it does what it used to do, or that what it used to do makes sense any more. It's entirely possible that prev_cpu basically ends up being the better choice for spreading things out. That said, my *guess* is that when you run pgbench, you'll see the same regression that we saw due to Mike's patch too. It simply looks like tbench wants to have minimal cpu selection and avoid moving things around, while pgbench probably wants to spread out maximally. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote: On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith efa...@gmx.de wrote: Aside from the cache pollution I recall having been mentioned, on my E5620, cross core is a tbench win over affine, cross thread is not. Oh, I agree with trying to avoid HT threads, the resource contention easily gets too bad. It's more a question of if we have real cores with separate L1's but shared L2's, go with those first, before we start distributing it out to separate L2's. There is one issue though. If the tasks continue to run in this state and the periodic balance notices an idle L2, it will force migrate (using active migration) one of the tasks to the idle L2. As the periodic balance tries to spread the load as far as possible to take maximum advantage of the available resources (and the perf advantage of this really depends on the workload, cache usage/memory bw, the upside of turbo etc). But I am not sure if this was the reason why we chose to spread it out to separate L2's during wakeup. Anyways, this is one of the places where the Paul Turner's task load average tracking patches will be useful. Depending on how long a task typically runs, we can probably even chose a SMT siblings or a separate L2 to run. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/