Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
On Wed, Aug 3, 2016 at 11:24 AM, Rafael Aquini <aqu...@redhat.com> wrote: > IIRC one of the issues Linus had with previous attempts was because > they were utilizing/bringing back a node-memory state based heuristic. > > Since Kyle patch is using a global state counter for that matter, > I think that issue condition might now be sorted out. It's been a few weeks since the last feedback. Are there any further questions or concerns I can help out with? -- Kyle Walker
Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
On Wed, Aug 3, 2016 at 11:24 AM, Rafael Aquini wrote: > IIRC one of the issues Linus had with previous attempts was because > they were utilizing/bringing back a node-memory state based heuristic. > > Since Kyle patch is using a global state counter for that matter, > I think that issue condition might now be sorted out. It's been a few weeks since the last feedback. Are there any further questions or concerns I can help out with? -- Kyle Walker
[PATCH v2] clocksource: Defer override invalidation unless clock is unstable
Clocksources don't get the VALID_FOR_HRES flag until they have been checked by a watchdog. However, when using an override, the clocksource_select logic will clear the override value if the clocksource is not marked VALID_FOR_HRES during that inititial check. When using the boot arguments clocksource=, this selection can run before the watchdog, and can cause the override to be incorrectly cleared. To address this condition, the override_name is only invalidated for unstable clocksources. Otherwise, the override is left intact until after the watchdog has validated the clocksource as stable/unstable. Signed-off-by: Kyle Walker <kwal...@redhat.com> Cc: John Stultz <john.stu...@linaro.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Martin Schwidefsky <schwidef...@de.ibm.com> Cc: linux-kernel@vger.kernel.org --- Notes: Changes from v1: * Altered changelog description, many thanks to John Stultz for the assist! kernel/time/clocksource.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 56ece14..4c1bb2a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur) */ if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) { /* Override clocksource cannot be used. */ - pr_warn("Override clocksource %s is not HRT compatible - cannot switch while in HRT/NOHZ mode\n", - cs->name); - override_name[0] = 0; + if (cs->flags & CLOCK_SOURCE_UNSTABLE) { + pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n", + cs->name); + override_name[0] = 0; + } else { + /* +* The override cannot be currently verified. +* Deferring to let the watchdog check. +*/ + pr_info("Override clocksource %s is not currently HRT compatible - deferring\n", + cs->name); + } } else /* Override clocksource can be used. */ best = cs; -- 2.5.5
[PATCH v2] clocksource: Defer override invalidation unless clock is unstable
Clocksources don't get the VALID_FOR_HRES flag until they have been checked by a watchdog. However, when using an override, the clocksource_select logic will clear the override value if the clocksource is not marked VALID_FOR_HRES during that inititial check. When using the boot arguments clocksource=, this selection can run before the watchdog, and can cause the override to be incorrectly cleared. To address this condition, the override_name is only invalidated for unstable clocksources. Otherwise, the override is left intact until after the watchdog has validated the clocksource as stable/unstable. Signed-off-by: Kyle Walker Cc: John Stultz Cc: Thomas Gleixner Cc: Martin Schwidefsky Cc: linux-kernel@vger.kernel.org --- Notes: Changes from v1: * Altered changelog description, many thanks to John Stultz for the assist! kernel/time/clocksource.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 56ece14..4c1bb2a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur) */ if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) { /* Override clocksource cannot be used. */ - pr_warn("Override clocksource %s is not HRT compatible - cannot switch while in HRT/NOHZ mode\n", - cs->name); - override_name[0] = 0; + if (cs->flags & CLOCK_SOURCE_UNSTABLE) { + pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n", + cs->name); + override_name[0] = 0; + } else { + /* +* The override cannot be currently verified. +* Deferring to let the watchdog check. +*/ + pr_info("Override clocksource %s is not currently HRT compatible - deferring\n", + cs->name); + } } else /* Override clocksource can be used. */ best = cs; -- 2.5.5
Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable
Good evening John, On Wed, Jul 27, 2016 at 10:29 AM, Kyle Walker <kwal...@redhat.com> wrote: > The issue I'm running into is that the override is not HRT compatible yet. > Though it will be later in the boot process, unless the clocksource watchdog > marks the clocksource as unstable. > > The issue with the current implementation is that the override_name value is > disabled when the tsc is first checked, before the watchdog has a chance to > check it and mark it stable or unstable. > > Without patch: > $ dmesg | grep -e clocksource > > clocksource: refined-jiffies: > Kernel command line: clocksource=tsc > clocksource: hpet: > clocksource: xen: > clocksource: jiffies: > clocksource: Switched to clocksource xen > clocksource: acpi_pm: > tsc: Refined TSC clocksource calibration: 2394.399 MHz > > clocksource: tsc: > clocksource: Override clocksource tsc is not HRT compatible - > > > With patch: > $ dmesg | grep -e clocksource > > clocksource: refined-jiffies: > Kernel command line: clocksource=tsc > clocksource: hpet: > clocksource: xen: > > clocksource: jiffies: > clocksource: Switched to clocksource xen > clocksource: acpi_pm: > tsc: Refined TSC clocksource calibration: 2394.461 MHz > clocksource: tsc: > clocksource: Override clocksource tsc is not currently HRT compatible > - deferring > clocksource: Switched to clocksource tsc > Is there anything else needed from my end? Please let me know if there is any further information or clarification I can provide. Have a great evening! -- Kyle Walker
Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable
Good evening John, On Wed, Jul 27, 2016 at 10:29 AM, Kyle Walker wrote: > The issue I'm running into is that the override is not HRT compatible yet. > Though it will be later in the boot process, unless the clocksource watchdog > marks the clocksource as unstable. > > The issue with the current implementation is that the override_name value is > disabled when the tsc is first checked, before the watchdog has a chance to > check it and mark it stable or unstable. > > Without patch: > $ dmesg | grep -e clocksource > > clocksource: refined-jiffies: > Kernel command line: clocksource=tsc > clocksource: hpet: > clocksource: xen: > clocksource: jiffies: > clocksource: Switched to clocksource xen > clocksource: acpi_pm: > tsc: Refined TSC clocksource calibration: 2394.399 MHz > > clocksource: tsc: > clocksource: Override clocksource tsc is not HRT compatible - > > > With patch: > $ dmesg | grep -e clocksource > > clocksource: refined-jiffies: > Kernel command line: clocksource=tsc > clocksource: hpet: > clocksource: xen: > > clocksource: jiffies: > clocksource: Switched to clocksource xen > clocksource: acpi_pm: > tsc: Refined TSC clocksource calibration: 2394.461 MHz > clocksource: tsc: > clocksource: Override clocksource tsc is not currently HRT compatible > - deferring > clocksource: Switched to clocksource tsc > Is there anything else needed from my end? Please let me know if there is any further information or clarification I can provide. Have a great evening! -- Kyle Walker
Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable
I'm so sorry for the duplicate, gmail managed to sneak in some HTML. Resending due to the mailing list correctly blocking the initial send. On Tue, Jul 26, 2016 at 5:36 PM, John Stultz <john.stu...@linaro.org> wrote: > Sorry for not getting back to you. This has been in my to-look-at list. No problem at all. Thanks for taking a look! > On Tue, Jul 26, 2016 at 2:24 PM, Kyle Walker <kwal...@redhat.com> wrote: >> The clock_select() operation will attempt to use the clocksource override >> to apply the desired clocksource when the "clocksource=" boot parameter is >> supplied. However, in the event that "clocksource=tsc" is used on a system >> where there is a more desireable clocksource available, the boot parameter >> fails. This is due to the TSC clocksource being installed unvalidated, but >> the override being invalidated during the initial run through >> clocksource_done_booting(). > > I've read this a few times, and I'm not sure I really understand it. > > Can you give an example of a "more desirable clocksource" then the > TSC? Especially when the TSC was specified as a boot argument? > I apologize for the confusion. By "more desireable", I mean that there is another clocksource that the system is wanting to use by default. For example, on Xen platforms, the "xen" clocksource is the "best" as determined by clocksource_find_best(), being at the top of the list. crash> list -s clocksource.name,rating clocksource.list -H clocksource_list 81c0cb00 name = 0x817ad585 "xen" rating = 400 81a98540 name = 0x817c2873 "tsc" rating = 300 81aa6580 name = 0x817b118a "hpet" rating = 250 81b22b00 name = 0x817c02b7 "acpi_pm" rating = 120 81ab48c0 name = 0x817c05c6 "jiffies" rating = 1 In that scenario, the xen clocksource would be used if no override was specified. > > The logic here is confusing as well. So.. if the override is not HRT > compatible, we check if its stable or not? Once we're in HRT there's > not much likelyhood of us going into non HRT mode. I'm not sure what > the stability has to do with it here. > > Sorry, could you explain the case you're running into in some further detail? The issue I'm running into is that the override is not HRT compatible yet. Though it will be later in the boot process, unless the clocksource watchdog marks the clocksource as unstable. The issue with the current implementation is that the override_name value is disabled when the tsc is first checked, before the watchdog has a chance to check it and mark it stable or unstable. Without patch: $ dmesg | grep -e clocksource clocksource: refined-jiffies: Kernel command line: clocksource=tsc clocksource: hpet: clocksource: xen: clocksource: jiffies: clocksource: Switched to clocksource xen clocksource: acpi_pm: tsc: Refined TSC clocksource calibration: 2394.399 MHz clocksource: tsc: clocksource: Override clocksource tsc is not HRT compatible - With patch: $ dmesg | grep -e clocksource clocksource: refined-jiffies: Kernel command line: clocksource=tsc clocksource: hpet: clocksource: xen: clocksource: jiffies: clocksource: Switched to clocksource xen clocksource: acpi_pm: tsc: Refined TSC clocksource calibration: 2394.461 MHz clocksource: tsc: clocksource: Override clocksource tsc is not currently HRT compatible - deferring clocksource: Switched to clocksource tsc Please let me know if there is any further clarification needed. Have a good one! -- Kyle Walker
Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable
I'm so sorry for the duplicate, gmail managed to sneak in some HTML. Resending due to the mailing list correctly blocking the initial send. On Tue, Jul 26, 2016 at 5:36 PM, John Stultz wrote: > Sorry for not getting back to you. This has been in my to-look-at list. No problem at all. Thanks for taking a look! > On Tue, Jul 26, 2016 at 2:24 PM, Kyle Walker wrote: >> The clock_select() operation will attempt to use the clocksource override >> to apply the desired clocksource when the "clocksource=" boot parameter is >> supplied. However, in the event that "clocksource=tsc" is used on a system >> where there is a more desireable clocksource available, the boot parameter >> fails. This is due to the TSC clocksource being installed unvalidated, but >> the override being invalidated during the initial run through >> clocksource_done_booting(). > > I've read this a few times, and I'm not sure I really understand it. > > Can you give an example of a "more desirable clocksource" then the > TSC? Especially when the TSC was specified as a boot argument? > I apologize for the confusion. By "more desireable", I mean that there is another clocksource that the system is wanting to use by default. For example, on Xen platforms, the "xen" clocksource is the "best" as determined by clocksource_find_best(), being at the top of the list. crash> list -s clocksource.name,rating clocksource.list -H clocksource_list 81c0cb00 name = 0x817ad585 "xen" rating = 400 81a98540 name = 0x817c2873 "tsc" rating = 300 81aa6580 name = 0x817b118a "hpet" rating = 250 81b22b00 name = 0x817c02b7 "acpi_pm" rating = 120 81ab48c0 name = 0x817c05c6 "jiffies" rating = 1 In that scenario, the xen clocksource would be used if no override was specified. > > The logic here is confusing as well. So.. if the override is not HRT > compatible, we check if its stable or not? Once we're in HRT there's > not much likelyhood of us going into non HRT mode. I'm not sure what > the stability has to do with it here. > > Sorry, could you explain the case you're running into in some further detail? The issue I'm running into is that the override is not HRT compatible yet. Though it will be later in the boot process, unless the clocksource watchdog marks the clocksource as unstable. The issue with the current implementation is that the override_name value is disabled when the tsc is first checked, before the watchdog has a chance to check it and mark it stable or unstable. Without patch: $ dmesg | grep -e clocksource clocksource: refined-jiffies: Kernel command line: clocksource=tsc clocksource: hpet: clocksource: xen: clocksource: jiffies: clocksource: Switched to clocksource xen clocksource: acpi_pm: tsc: Refined TSC clocksource calibration: 2394.399 MHz clocksource: tsc: clocksource: Override clocksource tsc is not HRT compatible - With patch: $ dmesg | grep -e clocksource clocksource: refined-jiffies: Kernel command line: clocksource=tsc clocksource: hpet: clocksource: xen: clocksource: jiffies: clocksource: Switched to clocksource xen clocksource: acpi_pm: tsc: Refined TSC clocksource calibration: 2394.461 MHz clocksource: tsc: clocksource: Override clocksource tsc is not currently HRT compatible - deferring clocksource: Switched to clocksource tsc Please let me know if there is any further clarification needed. Have a good one! -- Kyle Walker
[PATCH resend] clocksource: Defer override invalidation unless clock is unstable
The clock_select() operation will attempt to use the clocksource override to apply the desired clocksource when the "clocksource=" boot parameter is supplied. However, in the event that "clocksource=tsc" is used on a system where there is a more desireable clocksource available, the boot parameter fails. This is due to the TSC clocksource being installed unvalidated, but the override being invalidated during the initial run through clocksource_done_booting(). To address this condition, the override_name is only invalidated for unstable clocksources. Otherwise, the override is left intact until after the watchdog has validated the clocksource as stable/unstable. Signed-off-by: Kyle Walker <kwal...@redhat.com> Cc: John Stultz <john.stu...@linaro.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: linux-kernel@vger.kernel.org --- Notes: Resend due to no feedback on the initial submit. Thank you in advance! kernel/time/clocksource.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 56ece14..4c1bb2a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur) */ if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) { /* Override clocksource cannot be used. */ - pr_warn("Override clocksource %s is not HRT compatible - cannot switch while in HRT/NOHZ mode\n", - cs->name); - override_name[0] = 0; + if (cs->flags & CLOCK_SOURCE_UNSTABLE) { + pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n", + cs->name); + override_name[0] = 0; + } else { + /* +* The override cannot be currently verified. +* Deferring to let the watchdog check. +*/ + pr_info("Override clocksource %s is not currently HRT compatible - deferring\n", + cs->name); + } } else /* Override clocksource can be used. */ best = cs; -- 2.5.5
[PATCH resend] clocksource: Defer override invalidation unless clock is unstable
The clock_select() operation will attempt to use the clocksource override to apply the desired clocksource when the "clocksource=" boot parameter is supplied. However, in the event that "clocksource=tsc" is used on a system where there is a more desireable clocksource available, the boot parameter fails. This is due to the TSC clocksource being installed unvalidated, but the override being invalidated during the initial run through clocksource_done_booting(). To address this condition, the override_name is only invalidated for unstable clocksources. Otherwise, the override is left intact until after the watchdog has validated the clocksource as stable/unstable. Signed-off-by: Kyle Walker Cc: John Stultz Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org --- Notes: Resend due to no feedback on the initial submit. Thank you in advance! kernel/time/clocksource.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 56ece14..4c1bb2a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur) */ if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) { /* Override clocksource cannot be used. */ - pr_warn("Override clocksource %s is not HRT compatible - cannot switch while in HRT/NOHZ mode\n", - cs->name); - override_name[0] = 0; + if (cs->flags & CLOCK_SOURCE_UNSTABLE) { + pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n", + cs->name); + override_name[0] = 0; + } else { + /* +* The override cannot be currently verified. +* Deferring to let the watchdog check. +*/ + pr_info("Override clocksource %s is not currently HRT compatible - deferring\n", + cs->name); + } } else /* Override clocksource can be used. */ best = cs; -- 2.5.5
Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
On Mon, Jul 25, 2016 at 4:47 PM, Andrew Morton <a...@linux-foundation.org> wrote: > > Can this suffering be quantified please? > The observed suffering is primarily visible within an IBM Qradar installation. From a high level, the lower limit to the amount of advisory readahead pages results in a 3-5x increase in time necessary to complete an identical query within the application. Note, all of the below values are with Readahead configured to 64Kib. Baseline behaviour - Prior to: 600e19af ("mm: use only per-device readahead limit") 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages") Result: Qradar - Command: "username equals root" - 57.3s to complete search New performance - With: 600e19af ("mm: use only per-device readahead limit") 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages") Result: Qradar - "username equals root" query - 245.7s to complete search Proposed behaviour - With the proposed patch in place. Result: Qradar - "username equals root" query - 57s to complete search In narrowing the source of the performance deficit, it was observed that the amount of data loaded into pagecache via madvise was quite a bit lower following the noted commits. As simply reverting those lower limits were not accepted previously, the proposed alternative strategy seemed like the most beneficial path forwards. > > Linus probably has opinions ;) > I understand that changes to readahead that are very similar have been proposed quite a bit lately. If there are any changes or testing needed, I'm more than happy to tackle that. Thank you in advance! -- Kyle Walker
Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
On Mon, Jul 25, 2016 at 4:47 PM, Andrew Morton wrote: > > Can this suffering be quantified please? > The observed suffering is primarily visible within an IBM Qradar installation. From a high level, the lower limit to the amount of advisory readahead pages results in a 3-5x increase in time necessary to complete an identical query within the application. Note, all of the below values are with Readahead configured to 64Kib. Baseline behaviour - Prior to: 600e19af ("mm: use only per-device readahead limit") 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages") Result: Qradar - Command: "username equals root" - 57.3s to complete search New performance - With: 600e19af ("mm: use only per-device readahead limit") 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages") Result: Qradar - "username equals root" query - 245.7s to complete search Proposed behaviour - With the proposed patch in place. Result: Qradar - "username equals root" query - 57s to complete search In narrowing the source of the performance deficit, it was observed that the amount of data loaded into pagecache via madvise was quite a bit lower following the noted commits. As simply reverting those lower limits were not accepted previously, the proposed alternative strategy seemed like the most beneficial path forwards. > > Linus probably has opinions ;) > I understand that changes to readahead that are very similar have been proposed quite a bit lately. If there are any changes or testing needed, I'm more than happy to tackle that. Thank you in advance! -- Kyle Walker
[PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
Java workloads using the MappedByteBuffer library result in the fadvise() and madvise() syscalls being used extensively. Following recent readahead limiting alterations, such as 600e19af ("mm: use only per-device readahead limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages"), application performance suffers in instances where small readahead is configured. By moving this limit outside of the syscall codepaths, the syscalls are able to advise an inordinately large amount of readahead when desired. With a cap being imposed based on the half of NR_INACTIVE_FILE and NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a small readahead limit, but then benefiting from large sequential readahead values selectively. Signed-off-by: Kyle Walker <kwal...@redhat.com> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Michal Hocko <mho...@suse.com> Cc: Geliang Tang <geliangt...@163.com> Cc: Vlastimil Babka <vba...@suse.cz> Cc: Roman Gushchin <kl...@yandex-team.ru> Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com> --- mm/readahead.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/readahead.c b/mm/readahead.c index 65ec288..6f8bb44 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp, if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages)) return -EINVAL; - nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages); + nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) + +(global_page_state(NR_FREE_PAGES)) / 2)); + while (nr_to_read) { int err; @@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space *mapping, /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) { + req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages); force_page_cache_readahead(mapping, filp, offset, req_size); return; } -- 2.5.5
[PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls
Java workloads using the MappedByteBuffer library result in the fadvise() and madvise() syscalls being used extensively. Following recent readahead limiting alterations, such as 600e19af ("mm: use only per-device readahead limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages"), application performance suffers in instances where small readahead is configured. By moving this limit outside of the syscall codepaths, the syscalls are able to advise an inordinately large amount of readahead when desired. With a cap being imposed based on the half of NR_INACTIVE_FILE and NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a small readahead limit, but then benefiting from large sequential readahead values selectively. Signed-off-by: Kyle Walker Cc: Andrew Morton Cc: Michal Hocko Cc: Geliang Tang Cc: Vlastimil Babka Cc: Roman Gushchin Cc: "Kirill A. Shutemov" --- mm/readahead.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/readahead.c b/mm/readahead.c index 65ec288..6f8bb44 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp, if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages)) return -EINVAL; - nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages); + nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) + +(global_page_state(NR_FREE_PAGES)) / 2)); + while (nr_to_read) { int err; @@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space *mapping, /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) { + req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages); force_page_cache_readahead(mapping, filp, offset, req_size); return; } -- 2.5.5
[PATCH] clocksource: Defer override invalidation unless clock is unstable
The clock_select() operation will attempt to use the clocksource override to apply the desired clocksource when the "clocksource=" boot parameter is supplied. However, in the event that "clocksource=tsc" is used on a system where there is a more desireable clocksource available, the boot parameter fails. This is due to the TSC clocksource being installed unvalidated, but the override being invalidated during the initial run through clocksource_done_booting(). To address this condition, the override_name is only invalidated for unstable clocksources. Otherwise, the override is left intact until after the watchdog has validated the clocksource as stable/unstable. Signed-off-by: Kyle Walker <kwal...@redhat.com> Cc: John Stultz <john.stu...@linaro.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: linux-kernel@vger.kernel.org --- kernel/time/clocksource.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 56ece14..4c1bb2a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur) */ if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) { /* Override clocksource cannot be used. */ - pr_warn("Override clocksource %s is not HRT compatible - cannot switch while in HRT/NOHZ mode\n", - cs->name); - override_name[0] = 0; + if (cs->flags & CLOCK_SOURCE_UNSTABLE) { + pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n", + cs->name); + override_name[0] = 0; + } else { + /* +* The override cannot be currently verified. +* Deferring to let the watchdog check. +*/ + pr_info("Override clocksource %s is not currently HRT compatible - deferring\n", + cs->name); + } } else /* Override clocksource can be used. */ best = cs; -- 2.5.5
[PATCH] clocksource: Defer override invalidation unless clock is unstable
The clock_select() operation will attempt to use the clocksource override to apply the desired clocksource when the "clocksource=" boot parameter is supplied. However, in the event that "clocksource=tsc" is used on a system where there is a more desireable clocksource available, the boot parameter fails. This is due to the TSC clocksource being installed unvalidated, but the override being invalidated during the initial run through clocksource_done_booting(). To address this condition, the override_name is only invalidated for unstable clocksources. Otherwise, the override is left intact until after the watchdog has validated the clocksource as stable/unstable. Signed-off-by: Kyle Walker Cc: John Stultz Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org --- kernel/time/clocksource.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 56ece14..4c1bb2a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur) */ if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) { /* Override clocksource cannot be used. */ - pr_warn("Override clocksource %s is not HRT compatible - cannot switch while in HRT/NOHZ mode\n", - cs->name); - override_name[0] = 0; + if (cs->flags & CLOCK_SOURCE_UNSTABLE) { + pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n", + cs->name); + override_name[0] = 0; + } else { + /* +* The override cannot be currently verified. +* Deferring to let the watchdog check. +*/ + pr_info("Override clocksource %s is not currently HRT compatible - deferring\n", + cs->name); + } } else /* Override clocksource can be used. */ best = cs; -- 2.5.5
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes wrote: > > I struggle to understand how the approach of randomly continuing to kill > more and more processes in the hope that it slows down usage of memory > reserves or that we get lucky is better. Thank you to one and all for the feedback. I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, and omitting them from the oom selection process, continuing the carnage is likely to result in more unpredictable results. At this time, I believe Oleg's solution of zapping the process memory use while it sleeps with the fatal signal enroute is ideal. Kyle Walker -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes <rient...@google.com> wrote: > > I struggle to understand how the approach of randomly continuing to kill > more and more processes in the hope that it slows down usage of memory > reserves or that we get lucky is better. Thank you to one and all for the feedback. I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, and omitting them from the oom selection process, continuing the carnage is likely to result in more unpredictable results. At this time, I believe Oleg's solution of zapping the process memory use while it sleeps with the fatal signal enroute is ideal. Kyle Walker -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Currently, the oom killer will attempt to kill a process that is in TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional period of time, such as processes writing to a frozen filesystem during a lengthy backup operation, this can result in a deadlock condition as related processes memory access will stall within the page fault handler. Within oom_unkillable_task(), check for processes in TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will move on to another task. Signed-off-by: Kyle Walker --- mm/oom_kill.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1ecc0bc..66f03f8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, if (memcg && !task_in_mem_cgroup(p, memcg)) return true; + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ + if (p->state == TASK_UNINTERRUPTIBLE) + return true; + /* p may not have freeable memory in nodemask */ if (!has_intersects_mems_allowed(p, nodemask)) return true; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Currently, the oom killer will attempt to kill a process that is in TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional period of time, such as processes writing to a frozen filesystem during a lengthy backup operation, this can result in a deadlock condition as related processes memory access will stall within the page fault handler. Within oom_unkillable_task(), check for processes in TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will move on to another task. Signed-off-by: Kyle Walker <kwal...@redhat.com> --- mm/oom_kill.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1ecc0bc..66f03f8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, if (memcg && !task_in_mem_cgroup(p, memcg)) return true; + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ + if (p->state == TASK_UNINTERRUPTIBLE) + return true; + /* p may not have freeable memory in nodemask */ if (!has_intersects_mems_allowed(p, nodemask)) return true; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/