Re: vmstat: On demand vmstat workers V3
On 23 April 2014 22:20, Max Krasnyansky wrote: > On 04/22/2014 03:32 AM, Viresh Kumar wrote: >> This vmstat interrupt is disturbing my core isolation :), have you got >> any far with this patchset? > > You don't mean an interrupt, right? Sorry for not being clear enough. I meant the interruption caused due to these works. > The updates are done via the regular priority workqueue. > > I'm playing with isolation as well (has been more or less a background thing > for the last 6+ years). Our threads that run on the isolated cores are > SCHED_FIFO > and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. Initially I thought that's not enough. As there were queued with a delayed work and so a timer+work. Because there is a timer to fire, kernel wouldn't stop the tick for long with NO_HZ_FULL as get_next_timer_interrupt() wouldn't return KTIME_MAX. And so we will stop the tick for some time but will still queue a hrtimer after say 'n' seconds. But the clockevent device will have a max value of counter it is running and it will disturb isolation with a interrupt after end of counter, for me it is 90 seconds. BUT, it looks there is something else as well here. For the first time this theory would probably work, but because we wouldn't allow the work to run, the timer wouldn't get queued again. And so things will start working soon after. While writing this mail, I got another vision at this point. Because there will be one task running and another queued for the work, tick wouldn't be stopped (nr_running > 1) :( .. And so isolation wouldn't work again. @Frederic/Kevin: Did we ever had a discussion about stopping tick even if we have more than a task in queue but are SCHED_FIFO ? > I do have a few patches for the workqueues to make things better for > isolation. Please share them, even if they aren't mainlinable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Wed, 23 Apr 2014, Max Krasnyansky wrote: > The updates are done via the regular priority workqueue. Yup so things could be fixed at that level with setting an additional workqueue flag? > I'm playing with isolation as well (has been more or less a background thing > for the last 6+ years). Our threads that run on the isolated cores are > SCHED_FIFO > and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. > I do have a few patches for the workqueues to make things better for > isolation. Would you share those with us please? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
Hi Viresh, On 04/22/2014 03:32 AM, Viresh Kumar wrote: > On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter wrote: >> V2->V3: >> - Introduce a new tick_get_housekeeping_cpu() function. Not sure >> if that is exactly what we want but it is a start. Thomas? >> - Migrate the shepherd task if the output of >> tick_get_housekeeping_cpu() changes. >> - Fixes recommended by Andrew. > > Hi Christoph, > > This vmstat interrupt is disturbing my core isolation :), have you got > any far with this patchset? You don't mean an interrupt, right? The updates are done via the regular priority workqueue. I'm playing with isolation as well (has been more or less a background thing for the last 6+ years). Our threads that run on the isolated cores are SCHED_FIFO and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. I do have a few patches for the workqueues to make things better for isolation. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Wed, 23 Apr 2014, Max Krasnyansky wrote: The updates are done via the regular priority workqueue. Yup so things could be fixed at that level with setting an additional workqueue flag? I'm playing with isolation as well (has been more or less a background thing for the last 6+ years). Our threads that run on the isolated cores are SCHED_FIFO and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. I do have a few patches for the workqueues to make things better for isolation. Would you share those with us please? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On 23 April 2014 22:20, Max Krasnyansky m...@qti.qualcomm.com wrote: On 04/22/2014 03:32 AM, Viresh Kumar wrote: This vmstat interrupt is disturbing my core isolation :), have you got any far with this patchset? You don't mean an interrupt, right? Sorry for not being clear enough. I meant the interruption caused due to these works. The updates are done via the regular priority workqueue. I'm playing with isolation as well (has been more or less a background thing for the last 6+ years). Our threads that run on the isolated cores are SCHED_FIFO and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. Initially I thought that's not enough. As there were queued with a delayed work and so a timer+work. Because there is a timer to fire, kernel wouldn't stop the tick for long with NO_HZ_FULL as get_next_timer_interrupt() wouldn't return KTIME_MAX. And so we will stop the tick for some time but will still queue a hrtimer after say 'n' seconds. But the clockevent device will have a max value of counter it is running and it will disturb isolation with a interrupt after end of counter, for me it is 90 seconds. BUT, it looks there is something else as well here. For the first time this theory would probably work, but because we wouldn't allow the work to run, the timer wouldn't get queued again. And so things will start working soon after. While writing this mail, I got another vision at this point. Because there will be one task running and another queued for the work, tick wouldn't be stopped (nr_running 1) :( .. And so isolation wouldn't work again. @Frederic/Kevin: Did we ever had a discussion about stopping tick even if we have more than a task in queue but are SCHED_FIFO ? I do have a few patches for the workqueues to make things better for isolation. Please share them, even if they aren't mainlinable. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
Hi Viresh, On 04/22/2014 03:32 AM, Viresh Kumar wrote: On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter c...@linux.com wrote: V2-V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? - Migrate the shepherd task if the output of tick_get_housekeeping_cpu() changes. - Fixes recommended by Andrew. Hi Christoph, This vmstat interrupt is disturbing my core isolation :), have you got any far with this patchset? You don't mean an interrupt, right? The updates are done via the regular priority workqueue. I'm playing with isolation as well (has been more or less a background thing for the last 6+ years). Our threads that run on the isolated cores are SCHED_FIFO and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. I do have a few patches for the workqueues to make things better for isolation. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On 22 April 2014 19:08, Christoph Lameter wrote: > Sorry no too much other stuff. Would be glad if you could improve on it. > Should have some time on Friday to look at it. Really busy with other activities for improving core isolation, doesn't look like I will get enough time getting this done :( -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Tue, 22 Apr 2014, Viresh Kumar wrote: > On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter wrote: > > V2->V3: > > - Introduce a new tick_get_housekeeping_cpu() function. Not sure > > if that is exactly what we want but it is a start. Thomas? > > - Migrate the shepherd task if the output of > > tick_get_housekeeping_cpu() changes. > > - Fixes recommended by Andrew. > > This vmstat interrupt is disturbing my core isolation :), have you got > any far with this patchset? Sorry no too much other stuff. Would be glad if you could improve on it. Should have some time on Friday to look at it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter wrote: > V2->V3: > - Introduce a new tick_get_housekeeping_cpu() function. Not sure > if that is exactly what we want but it is a start. Thomas? > - Migrate the shepherd task if the output of > tick_get_housekeeping_cpu() changes. > - Fixes recommended by Andrew. Hi Christoph, This vmstat interrupt is disturbing my core isolation :), have you got any far with this patchset? -- viresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter c...@linux.com wrote: V2-V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? - Migrate the shepherd task if the output of tick_get_housekeeping_cpu() changes. - Fixes recommended by Andrew. Hi Christoph, This vmstat interrupt is disturbing my core isolation :), have you got any far with this patchset? -- viresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Tue, 22 Apr 2014, Viresh Kumar wrote: On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter c...@linux.com wrote: V2-V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? - Migrate the shepherd task if the output of tick_get_housekeeping_cpu() changes. - Fixes recommended by Andrew. This vmstat interrupt is disturbing my core isolation :), have you got any far with this patchset? Sorry no too much other stuff. Would be glad if you could improve on it. Should have some time on Friday to look at it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On 22 April 2014 19:08, Christoph Lameter c...@linux.com wrote: Sorry no too much other stuff. Would be glad if you could improve on it. Should have some time on Friday to look at it. Really busy with other activities for improving core isolation, doesn't look like I will get enough time getting this done :( -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Sat, 16 Nov 2013, Frederic Weisbecker wrote: > Not really. Thomas suggested an infrastructure to move CPU-local periodic > jobs handling to be offlined to set of remote housekeeping CPU. As I said in my reply to that proposal this is not possible since the cpu local jobs rely on cpu local operations in order to reduce the impact of statistics keeping on vm operations. > Now the problem is that vmstats updates use pure local lockless > operations. It may be possible to offline this update to remote CPUs > but then we need to convert vmstats updates to use locks. Which is > potentially costly. Unless we can find some clever lockless update > scheme. Do you think this can be possible? We got to these per cpu operations for vm statistics because they can have an significant influence on kernel performance. Experiments in in this area have usually led to significant performance degradations. We have code in the VM that fine tunes the limits of when global data is updated due to the performance impact that these limits have. > > + schedule_delayed_work_on(s, d, > > + __round_jiffies_relative(sysctl_stat_interval, s)); > > Note that on dynticks idle (CONFIG_NO_HZ_IDLE=y), the timekeeper CPU can > change quickly and often. > > I can imagine a nasty race there: CPU 0 is the timekeeper. It schedules the > vmstat sherpherd work in 2 seconds. But CPU 0 goes to sleep for a big while > and some other CPU takes the timekeeping duty. The shepherd timer won't be > processed until CPU 0 wakes up although we may have CPUs to monitor. > > CONFIG_NO_HZ_FULL may work incidentally because CPU 0 is the only timekeeper > there > but this is a temporary limitation. Expect the timekeeper to be dynamic in > the future > under that config. Could we stabilize the timekeeper? Its not really productive to move time and other processing between different cores. Low latency configurations mean that processes are bound to certain processores. Moving processing between cores causes cache disturbances and therefore more latencies. Also timekeeping tunes its clock depending on the performance of a core. Timekeeping could be thrown off. I could make this depend on CONFIG_NO_HZ_FULL or we can introduce another config option that keeps the timekeeper constant. > So such a system that dynamically schedules timers on demand is enough if we > want to _minimize_ timers. But what we want is a strong guarantee that the > CPU won't be disturbed at least while it runs in userland, right? Sure if we could have then we'd want it. > I mean, we are not only interested in optimizations but also in guarantees if > we have an extreme workload that strongly depends on the CPU not beeing > disturbed > at all. I know that some people in realtime want that. And I thought it's also > what your want, may be I misunderstood your usecase? Sure I want that too if its possible. I do know of any design that would be acceptable performance wise that would allow us to do that. Failing that I think that what I proposed is the best way to get rid of as much OS noise as possible. Also if a process invokes a system call then there are numerous reasons for the OS to enable the tick. F.e any network actions may require softirq processing, block operations may need something else. So this is not the only reason that the OS would have to interrupt the appliation. The lesson here is that a low latency application should avoid using system calls that require deferred processing. I can refine this approach if we have an agreement with going forward with the basic idea here of switching folding of differentials on an off. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Sat, 16 Nov 2013, Frederic Weisbecker wrote: Not really. Thomas suggested an infrastructure to move CPU-local periodic jobs handling to be offlined to set of remote housekeeping CPU. As I said in my reply to that proposal this is not possible since the cpu local jobs rely on cpu local operations in order to reduce the impact of statistics keeping on vm operations. Now the problem is that vmstats updates use pure local lockless operations. It may be possible to offline this update to remote CPUs but then we need to convert vmstats updates to use locks. Which is potentially costly. Unless we can find some clever lockless update scheme. Do you think this can be possible? We got to these per cpu operations for vm statistics because they can have an significant influence on kernel performance. Experiments in in this area have usually led to significant performance degradations. We have code in the VM that fine tunes the limits of when global data is updated due to the performance impact that these limits have. + schedule_delayed_work_on(s, d, + __round_jiffies_relative(sysctl_stat_interval, s)); Note that on dynticks idle (CONFIG_NO_HZ_IDLE=y), the timekeeper CPU can change quickly and often. I can imagine a nasty race there: CPU 0 is the timekeeper. It schedules the vmstat sherpherd work in 2 seconds. But CPU 0 goes to sleep for a big while and some other CPU takes the timekeeping duty. The shepherd timer won't be processed until CPU 0 wakes up although we may have CPUs to monitor. CONFIG_NO_HZ_FULL may work incidentally because CPU 0 is the only timekeeper there but this is a temporary limitation. Expect the timekeeper to be dynamic in the future under that config. Could we stabilize the timekeeper? Its not really productive to move time and other processing between different cores. Low latency configurations mean that processes are bound to certain processores. Moving processing between cores causes cache disturbances and therefore more latencies. Also timekeeping tunes its clock depending on the performance of a core. Timekeeping could be thrown off. I could make this depend on CONFIG_NO_HZ_FULL or we can introduce another config option that keeps the timekeeper constant. So such a system that dynamically schedules timers on demand is enough if we want to _minimize_ timers. But what we want is a strong guarantee that the CPU won't be disturbed at least while it runs in userland, right? Sure if we could have then we'd want it. I mean, we are not only interested in optimizations but also in guarantees if we have an extreme workload that strongly depends on the CPU not beeing disturbed at all. I know that some people in realtime want that. And I thought it's also what your want, may be I misunderstood your usecase? Sure I want that too if its possible. I do know of any design that would be acceptable performance wise that would allow us to do that. Failing that I think that what I proposed is the best way to get rid of as much OS noise as possible. Also if a process invokes a system call then there are numerous reasons for the OS to enable the tick. F.e any network actions may require softirq processing, block operations may need something else. So this is not the only reason that the OS would have to interrupt the appliation. The lesson here is that a low latency application should avoid using system calls that require deferred processing. I can refine this approach if we have an agreement with going forward with the basic idea here of switching folding of differentials on an off. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
On Thu, Oct 03, 2013 at 05:40:40PM +, Christoph Lameter wrote: > V2->V3: > - Introduce a new tick_get_housekeeping_cpu() function. Not sure > if that is exactly what we want but it is a start. Thomas? Not really. Thomas suggested an infrastructure to move CPU-local periodic jobs handling to be offlined to set of remote housekeeping CPU. This could be potentially useful for many kind of stats relying on periodic updates, the scheduler tick being a candidate (I have yet to check if we can really apply that in practice though). Now the problem is that vmstats updates use pure local lockless operations. It may be possible to offline this update to remote CPUs but then we need to convert vmstats updates to use locks. Which is potentially costly. Unless we can find some clever lockless update scheme. Do you think this can be possible? See below for more detailed review: [...] > > /* > @@ -1213,12 +1229,15 @@ static const struct file_operations proc > #ifdef CONFIG_SMP > static DEFINE_PER_CPU(struct delayed_work, vmstat_work); > int sysctl_stat_interval __read_mostly = HZ; > +static struct cpumask *monitored_cpus; > > static void vmstat_update(struct work_struct *w) > { > - refresh_cpu_vm_stats(); > - schedule_delayed_work(&__get_cpu_var(vmstat_work), > - round_jiffies_relative(sysctl_stat_interval)); > + if (refresh_cpu_vm_stats()) > + schedule_delayed_work(this_cpu_ptr(_work), > + round_jiffies_relative(sysctl_stat_interval)); > + else > + cpumask_set_cpu(smp_processor_id(), monitored_cpus); That looks racy against other CPUs that may set their own bit and also against the shepherd that clears processed monitored CPUs. That seem to matter because a CPU could be simply entirely forgotten by vmstat and never processed again. > } > > static void start_cpu_timer(int cpu) > @@ -1226,7 +1245,70 @@ static void start_cpu_timer(int cpu) > struct delayed_work *work = _cpu(vmstat_work, cpu); > > INIT_DEFERRABLE_WORK(work, vmstat_update); > - schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu)); > + schedule_delayed_work_on(cpu, work, > + __round_jiffies_relative(sysctl_stat_interval, cpu)); > +} > + > +/* > + * Check if the diffs for a certain cpu indicate that > + * an update is needed. > + */ > +static bool need_update(int cpu) > +{ > + struct zone *zone; > + > + for_each_populated_zone(zone) { > + struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); > + > + /* > + * The fast way of checking if there are any vmstat diffs. > + * This works because the diffs are byte sized items. > + */ > + if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) > + return true; > + } > + return false; > +} > + > +static void vmstat_shepherd(struct work_struct *w) > +{ > + int cpu; > + int s = tick_get_housekeeping_cpu(); > + struct delayed_work *d = per_cpu_ptr(_work, s); > + > + refresh_cpu_vm_stats(); > + > + for_each_cpu(cpu, monitored_cpus) > + if (need_update(cpu)) { > + cpumask_clear_cpu(cpu, monitored_cpus); > + start_cpu_timer(cpu); > + } > + > + if (s != smp_processor_id()) { > + /* Timekeeping was moved. Move the shepherd worker */ > + cpumask_set_cpu(smp_processor_id(), monitored_cpus); > + cpumask_clear_cpu(s, monitored_cpus); > + cancel_delayed_work_sync(d); > + INIT_DEFERRABLE_WORK(d, vmstat_shepherd); > + } > + > + schedule_delayed_work_on(s, d, > + __round_jiffies_relative(sysctl_stat_interval, s)); Note that on dynticks idle (CONFIG_NO_HZ_IDLE=y), the timekeeper CPU can change quickly and often. I can imagine a nasty race there: CPU 0 is the timekeeper. It schedules the vmstat sherpherd work in 2 seconds. But CPU 0 goes to sleep for a big while and some other CPU takes the timekeeping duty. The shepherd timer won't be processed until CPU 0 wakes up although we may have CPUs to monitor. CONFIG_NO_HZ_FULL may work incidentally because CPU 0 is the only timekeeper there but this is a temporary limitation. Expect the timekeeper to be dynamic in the future under that config. > + > +} > + > +static void __init start_shepherd_timer(void) > +{ > + int cpu = tick_get_housekeeping_cpu(); > + struct delayed_work *d = per_cpu_ptr(_work, cpu); > + > + INIT_DEFERRABLE_WORK(d, vmstat_shepherd); > + monitored_cpus = kmalloc(BITS_TO_LONGS(nr_cpu_ids) * sizeof(long), > + GFP_KERNEL); > + cpumask_copy(monitored_cpus, cpu_online_mask); > + cpumask_clear_cpu(cpu, monitored_cpus); > + schedule_delayed_work_on(cpu, d, > + __round_jiffies_relative(sysctl_stat_interval, cpu)); > } So another issue with the whole design of this patch, outside its
Re: vmstat: On demand vmstat workers V3
On Thu, Oct 03, 2013 at 05:40:40PM +, Christoph Lameter wrote: V2-V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? Not really. Thomas suggested an infrastructure to move CPU-local periodic jobs handling to be offlined to set of remote housekeeping CPU. This could be potentially useful for many kind of stats relying on periodic updates, the scheduler tick being a candidate (I have yet to check if we can really apply that in practice though). Now the problem is that vmstats updates use pure local lockless operations. It may be possible to offline this update to remote CPUs but then we need to convert vmstats updates to use locks. Which is potentially costly. Unless we can find some clever lockless update scheme. Do you think this can be possible? See below for more detailed review: [...] /* @@ -1213,12 +1229,15 @@ static const struct file_operations proc #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; +static struct cpumask *monitored_cpus; static void vmstat_update(struct work_struct *w) { - refresh_cpu_vm_stats(); - schedule_delayed_work(__get_cpu_var(vmstat_work), - round_jiffies_relative(sysctl_stat_interval)); + if (refresh_cpu_vm_stats()) + schedule_delayed_work(this_cpu_ptr(vmstat_work), + round_jiffies_relative(sysctl_stat_interval)); + else + cpumask_set_cpu(smp_processor_id(), monitored_cpus); That looks racy against other CPUs that may set their own bit and also against the shepherd that clears processed monitored CPUs. That seem to matter because a CPU could be simply entirely forgotten by vmstat and never processed again. } static void start_cpu_timer(int cpu) @@ -1226,7 +1245,70 @@ static void start_cpu_timer(int cpu) struct delayed_work *work = per_cpu(vmstat_work, cpu); INIT_DEFERRABLE_WORK(work, vmstat_update); - schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu)); + schedule_delayed_work_on(cpu, work, + __round_jiffies_relative(sysctl_stat_interval, cpu)); +} + +/* + * Check if the diffs for a certain cpu indicate that + * an update is needed. + */ +static bool need_update(int cpu) +{ + struct zone *zone; + + for_each_populated_zone(zone) { + struct per_cpu_pageset *p = per_cpu_ptr(zone-pageset, cpu); + + /* + * The fast way of checking if there are any vmstat diffs. + * This works because the diffs are byte sized items. + */ + if (memchr_inv(p-vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) + return true; + } + return false; +} + +static void vmstat_shepherd(struct work_struct *w) +{ + int cpu; + int s = tick_get_housekeeping_cpu(); + struct delayed_work *d = per_cpu_ptr(vmstat_work, s); + + refresh_cpu_vm_stats(); + + for_each_cpu(cpu, monitored_cpus) + if (need_update(cpu)) { + cpumask_clear_cpu(cpu, monitored_cpus); + start_cpu_timer(cpu); + } + + if (s != smp_processor_id()) { + /* Timekeeping was moved. Move the shepherd worker */ + cpumask_set_cpu(smp_processor_id(), monitored_cpus); + cpumask_clear_cpu(s, monitored_cpus); + cancel_delayed_work_sync(d); + INIT_DEFERRABLE_WORK(d, vmstat_shepherd); + } + + schedule_delayed_work_on(s, d, + __round_jiffies_relative(sysctl_stat_interval, s)); Note that on dynticks idle (CONFIG_NO_HZ_IDLE=y), the timekeeper CPU can change quickly and often. I can imagine a nasty race there: CPU 0 is the timekeeper. It schedules the vmstat sherpherd work in 2 seconds. But CPU 0 goes to sleep for a big while and some other CPU takes the timekeeping duty. The shepherd timer won't be processed until CPU 0 wakes up although we may have CPUs to monitor. CONFIG_NO_HZ_FULL may work incidentally because CPU 0 is the only timekeeper there but this is a temporary limitation. Expect the timekeeper to be dynamic in the future under that config. + +} + +static void __init start_shepherd_timer(void) +{ + int cpu = tick_get_housekeeping_cpu(); + struct delayed_work *d = per_cpu_ptr(vmstat_work, cpu); + + INIT_DEFERRABLE_WORK(d, vmstat_shepherd); + monitored_cpus = kmalloc(BITS_TO_LONGS(nr_cpu_ids) * sizeof(long), + GFP_KERNEL); + cpumask_copy(monitored_cpus, cpu_online_mask); + cpumask_clear_cpu(cpu, monitored_cpus); + schedule_delayed_work_on(cpu, d, + __round_jiffies_relative(sysctl_stat_interval, cpu)); } So another issue with the whole design of this patch, outside its races, is that a CPU can run full dynticks, do some quick system call at some
Re: vmstat: On demand vmstat workers V3
Hmmm... This has been sitting there for over a month. What I can I do to to make progress on merging this? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
Hmmm... This has been sitting there for over a month. What I can I do to to make progress on merging this? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
vmstat: On demand vmstat workers V3
V2->V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? - Migrate the shepherd task if the output of tick_get_housekeeping_cpu() changes. - Fixes recommended by Andrew. V1->V2: - Optimize the need_update check by using memchr_inv. - Clean up. vmstat workers are used for folding counter differentials into the zone, per node and global counters at certain time intervals. They currently run at defined intervals on all processors which will cause some holdoff for processors that need minimal intrusion by the OS. The current vmstat_update mechanism depends on a deferrable timer firing every other second by default which registers a work queue item that runs on the local CPU, with the result that we have 1 interrupt and one additional schedulable task on each CPU every 2 seconds If a workload indeed causes VM activity or multiple tasks are running on a CPU, then there are probably bigger issues to deal with. However, some workloads dedicate a CPU for a single CPU bound task. This is done in high performance computing, in high frequency financial applications, in networking (Intel DPDK, EZchip NPS) and with the advent of systems with more and more CPUs over time, this may become more and more common to do since when one has enough CPUs one cares less about efficiently sharing a CPU with other tasks and more about efficiently monopolizing a CPU per task. The difference of having this timer firing and workqueue kernel thread scheduled per second can be enormous. An artificial test measuring the worst case time to do a simple "i++" in an endless loop on a bare metal system and under Linux on an isolated CPU with dynticks and with and without this patch, have Linux match the bare metal performance (~700 cycles) with this patch and loose by couple of orders of magnitude (~200k cycles) without it[*]. The loss occurs for something that just calculates statistics. For networking applications, for example, this could be the difference between dropping packets or sustaining line rate. Statistics are important and useful, but it would be great if there would be a way to not cause statistics gathering produce a huge performance difference. This patche does just that. This patch creates a vmstat shepherd worker that monitors the per cpu differentials on all processors. If there are differentials on a processor then a vmstat worker local to the processors with the differentials is created. That worker will then start folding the diffs in regular intervals. Should the worker find that there is no work to be done then it will make the shepherd worker monitor the differentials again. With this patch it is possible then to have periods longer than 2 seconds without any OS event on a "cpu" (hardware thread). Reviewed-by: Gilad Ben-Yossef Signed-off-by: Christoph Lameter Index: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c 2013-10-03 12:06:06.501932283 -0500 +++ linux/mm/vmstat.c 2013-10-03 12:27:49.403384459 -0500 @@ -14,12 +14,14 @@ #include #include #include +#include #include #include #include #include #include #include +#include #include "internal.h" @@ -417,13 +419,22 @@ void dec_zone_page_state(struct page *pa EXPORT_SYMBOL(dec_zone_page_state); #endif -static inline void fold_diff(int *diff) + +/* + * Fold a differential into the global counters. + * Returns the number of counters updated. + */ +static inline int fold_diff(int *diff) { int i; + int changes = 0; for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (diff[i]) + if (diff[i]) { atomic_long_add(diff[i], _stat[i]); + changes++; + } + return changes; } /* @@ -439,12 +450,15 @@ static inline void fold_diff(int *diff) * statistics in the remote zone struct as well as the global cachelines * with the global counters. These could cause remote node cache line * bouncing and will have to be only done when necessary. + * + * The function returns the number of global counters updated. */ -static void refresh_cpu_vm_stats(void) +static int refresh_cpu_vm_stats(void) { struct zone *zone; int i; int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; + int changes = 0; for_each_populated_zone(zone) { struct per_cpu_pageset __percpu *p = zone->pageset; @@ -484,15 +498,17 @@ static void refresh_cpu_vm_stats(void) continue; } - if (__this_cpu_dec_return(p->expire)) continue; - if (__this_cpu_read(p->pcp.count)) + if (__this_cpu_read(p->pcp.count)) { drain_zone_pages(zone, __this_cpu_ptr(>pcp)); + changes++; + } #endif } -
vmstat: On demand vmstat workers V3
V2-V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? - Migrate the shepherd task if the output of tick_get_housekeeping_cpu() changes. - Fixes recommended by Andrew. V1-V2: - Optimize the need_update check by using memchr_inv. - Clean up. vmstat workers are used for folding counter differentials into the zone, per node and global counters at certain time intervals. They currently run at defined intervals on all processors which will cause some holdoff for processors that need minimal intrusion by the OS. The current vmstat_update mechanism depends on a deferrable timer firing every other second by default which registers a work queue item that runs on the local CPU, with the result that we have 1 interrupt and one additional schedulable task on each CPU every 2 seconds If a workload indeed causes VM activity or multiple tasks are running on a CPU, then there are probably bigger issues to deal with. However, some workloads dedicate a CPU for a single CPU bound task. This is done in high performance computing, in high frequency financial applications, in networking (Intel DPDK, EZchip NPS) and with the advent of systems with more and more CPUs over time, this may become more and more common to do since when one has enough CPUs one cares less about efficiently sharing a CPU with other tasks and more about efficiently monopolizing a CPU per task. The difference of having this timer firing and workqueue kernel thread scheduled per second can be enormous. An artificial test measuring the worst case time to do a simple i++ in an endless loop on a bare metal system and under Linux on an isolated CPU with dynticks and with and without this patch, have Linux match the bare metal performance (~700 cycles) with this patch and loose by couple of orders of magnitude (~200k cycles) without it[*]. The loss occurs for something that just calculates statistics. For networking applications, for example, this could be the difference between dropping packets or sustaining line rate. Statistics are important and useful, but it would be great if there would be a way to not cause statistics gathering produce a huge performance difference. This patche does just that. This patch creates a vmstat shepherd worker that monitors the per cpu differentials on all processors. If there are differentials on a processor then a vmstat worker local to the processors with the differentials is created. That worker will then start folding the diffs in regular intervals. Should the worker find that there is no work to be done then it will make the shepherd worker monitor the differentials again. With this patch it is possible then to have periods longer than 2 seconds without any OS event on a cpu (hardware thread). Reviewed-by: Gilad Ben-Yossef gi...@benyossef.com Signed-off-by: Christoph Lameter c...@linux.com Index: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c 2013-10-03 12:06:06.501932283 -0500 +++ linux/mm/vmstat.c 2013-10-03 12:27:49.403384459 -0500 @@ -14,12 +14,14 @@ #include linux/module.h #include linux/slab.h #include linux/cpu.h +#include linux/cpumask.h #include linux/vmstat.h #include linux/sched.h #include linux/math64.h #include linux/writeback.h #include linux/compaction.h #include linux/mm_inline.h +#include linux/tick.h #include internal.h @@ -417,13 +419,22 @@ void dec_zone_page_state(struct page *pa EXPORT_SYMBOL(dec_zone_page_state); #endif -static inline void fold_diff(int *diff) + +/* + * Fold a differential into the global counters. + * Returns the number of counters updated. + */ +static inline int fold_diff(int *diff) { int i; + int changes = 0; for (i = 0; i NR_VM_ZONE_STAT_ITEMS; i++) - if (diff[i]) + if (diff[i]) { atomic_long_add(diff[i], vm_stat[i]); + changes++; + } + return changes; } /* @@ -439,12 +450,15 @@ static inline void fold_diff(int *diff) * statistics in the remote zone struct as well as the global cachelines * with the global counters. These could cause remote node cache line * bouncing and will have to be only done when necessary. + * + * The function returns the number of global counters updated. */ -static void refresh_cpu_vm_stats(void) +static int refresh_cpu_vm_stats(void) { struct zone *zone; int i; int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; + int changes = 0; for_each_populated_zone(zone) { struct per_cpu_pageset __percpu *p = zone-pageset; @@ -484,15 +498,17 @@ static void refresh_cpu_vm_stats(void) continue; } - if (__this_cpu_dec_return(p-expire)) continue; - if (__this_cpu_read(p-pcp.count)) + if