Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

2016-08-25 Thread Kyle Walker
On Wed, Aug 3, 2016 at 11:24 AM, Rafael Aquini <aqu...@redhat.com> wrote:
> IIRC one of the issues Linus had with previous attempts was because
> they were utilizing/bringing back a node-memory state based heuristic.
>
> Since Kyle patch is using a global state counter for that matter,
> I think that issue condition might now be sorted out.

It's been a few weeks since the last feedback. Are there any further
questions or concerns I can help out with?

--
Kyle Walker


Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

2016-08-25 Thread Kyle Walker
On Wed, Aug 3, 2016 at 11:24 AM, Rafael Aquini  wrote:
> IIRC one of the issues Linus had with previous attempts was because
> they were utilizing/bringing back a node-memory state based heuristic.
>
> Since Kyle patch is using a global state counter for that matter,
> I think that issue condition might now be sorted out.

It's been a few weeks since the last feedback. Are there any further
questions or concerns I can help out with?

--
Kyle Walker


[PATCH v2] clocksource: Defer override invalidation unless clock is unstable

2016-08-06 Thread Kyle Walker
Clocksources don't get the VALID_FOR_HRES flag until they have been
checked by a watchdog. However, when using an override, the
clocksource_select logic will clear the override value if the
clocksource is not marked VALID_FOR_HRES during that inititial check.
When using the boot arguments clocksource=, this selection can
run before the watchdog, and can cause the override to be incorrectly
cleared.

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Signed-off-by: Kyle Walker <kwal...@redhat.com>
Cc: John Stultz <john.stu...@linaro.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Martin Schwidefsky <schwidef...@de.ibm.com>
Cc: linux-kernel@vger.kernel.org
---

Notes:
Changes from v1:
* Altered changelog description, many thanks to John Stultz for the assist!

 kernel/time/clocksource.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..4c1bb2a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur)
 */
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
/* Override clocksource cannot be used. */
-   pr_warn("Override clocksource %s is not HRT compatible 
- cannot switch while in HRT/NOHZ mode\n",
-   cs->name);
-   override_name[0] = 0;
+   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+   pr_warn("Override clocksource %s is unstable 
and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
+   cs->name);
+   override_name[0] = 0;
+   } else {
+   /*
+* The override cannot be currently verified.
+* Deferring to let the watchdog check.
+*/
+   pr_info("Override clocksource %s is not 
currently HRT compatible - deferring\n",
+   cs->name);
+   }
} else
/* Override clocksource can be used. */
best = cs;
-- 
2.5.5



[PATCH v2] clocksource: Defer override invalidation unless clock is unstable

2016-08-06 Thread Kyle Walker
Clocksources don't get the VALID_FOR_HRES flag until they have been
checked by a watchdog. However, when using an override, the
clocksource_select logic will clear the override value if the
clocksource is not marked VALID_FOR_HRES during that inititial check.
When using the boot arguments clocksource=, this selection can
run before the watchdog, and can cause the override to be incorrectly
cleared.

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Signed-off-by: Kyle Walker 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Martin Schwidefsky 
Cc: linux-kernel@vger.kernel.org
---

Notes:
Changes from v1:
* Altered changelog description, many thanks to John Stultz for the assist!

 kernel/time/clocksource.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..4c1bb2a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur)
 */
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
/* Override clocksource cannot be used. */
-   pr_warn("Override clocksource %s is not HRT compatible 
- cannot switch while in HRT/NOHZ mode\n",
-   cs->name);
-   override_name[0] = 0;
+   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+   pr_warn("Override clocksource %s is unstable 
and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
+   cs->name);
+   override_name[0] = 0;
+   } else {
+   /*
+* The override cannot be currently verified.
+* Deferring to let the watchdog check.
+*/
+   pr_info("Override clocksource %s is not 
currently HRT compatible - deferring\n",
+   cs->name);
+   }
} else
/* Override clocksource can be used. */
best = cs;
-- 
2.5.5



Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable

2016-08-05 Thread Kyle Walker
Good evening John,

On Wed, Jul 27, 2016 at 10:29 AM, Kyle Walker <kwal...@redhat.com> wrote:
> The issue I'm running into is that the override is not HRT compatible yet.
> Though it will be later in the boot process, unless the clocksource watchdog
> marks the clocksource as unstable.
>
> The issue with the current implementation is that the override_name value is
> disabled when the tsc is first checked, before the watchdog has a chance to
> check it and mark it stable or unstable.
>
> Without patch:
> $ dmesg | grep -e clocksource
> 
> clocksource: refined-jiffies: 
> Kernel command line:  clocksource=tsc
> clocksource: hpet: 
> clocksource: xen: 
> clocksource: jiffies: 
> clocksource: Switched to clocksource xen
> clocksource: acpi_pm: 
> tsc: Refined TSC clocksource calibration: 2394.399 MHz
>
> clocksource: tsc: 
> clocksource: Override clocksource tsc is not HRT compatible - 
>
>
> With patch:
> $ dmesg | grep -e clocksource
> 
> clocksource: refined-jiffies:
> Kernel command line:  clocksource=tsc
> clocksource: hpet: 
> clocksource: xen: 
>
> clocksource: jiffies: 
> clocksource: Switched to clocksource xen
> clocksource: acpi_pm: 
> tsc: Refined TSC clocksource calibration: 2394.461 MHz
> clocksource: tsc: 
> clocksource: Override clocksource tsc is not currently HRT compatible
> - deferring
> clocksource: Switched to clocksource tsc
>

Is there anything else needed from my end? Please let me know if there is
any further information or clarification I can provide.

Have a great evening!
-- 
Kyle Walker


Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable

2016-08-05 Thread Kyle Walker
Good evening John,

On Wed, Jul 27, 2016 at 10:29 AM, Kyle Walker  wrote:
> The issue I'm running into is that the override is not HRT compatible yet.
> Though it will be later in the boot process, unless the clocksource watchdog
> marks the clocksource as unstable.
>
> The issue with the current implementation is that the override_name value is
> disabled when the tsc is first checked, before the watchdog has a chance to
> check it and mark it stable or unstable.
>
> Without patch:
> $ dmesg | grep -e clocksource
> 
> clocksource: refined-jiffies: 
> Kernel command line:  clocksource=tsc
> clocksource: hpet: 
> clocksource: xen: 
> clocksource: jiffies: 
> clocksource: Switched to clocksource xen
> clocksource: acpi_pm: 
> tsc: Refined TSC clocksource calibration: 2394.399 MHz
>
> clocksource: tsc: 
> clocksource: Override clocksource tsc is not HRT compatible - 
>
>
> With patch:
> $ dmesg | grep -e clocksource
> 
> clocksource: refined-jiffies:
> Kernel command line:  clocksource=tsc
> clocksource: hpet: 
> clocksource: xen: 
>
> clocksource: jiffies: 
> clocksource: Switched to clocksource xen
> clocksource: acpi_pm: 
> tsc: Refined TSC clocksource calibration: 2394.461 MHz
> clocksource: tsc: 
> clocksource: Override clocksource tsc is not currently HRT compatible
> - deferring
> clocksource: Switched to clocksource tsc
>

Is there anything else needed from my end? Please let me know if there is
any further information or clarification I can provide.

Have a great evening!
-- 
Kyle Walker


Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable

2016-07-27 Thread Kyle Walker
I'm so sorry for the duplicate, gmail managed to sneak in some HTML.
Resending due to the mailing list correctly blocking the initial send.

On Tue, Jul 26, 2016 at 5:36 PM, John Stultz <john.stu...@linaro.org> wrote:
> Sorry for not getting back to you. This has been in my to-look-at list.


No problem at all. Thanks for taking a look!


> On Tue, Jul 26, 2016 at 2:24 PM, Kyle Walker <kwal...@redhat.com> wrote:
>> The clock_select() operation will attempt to use the clocksource override
>> to apply the desired clocksource when the "clocksource=" boot parameter is
>> supplied. However, in the event that "clocksource=tsc" is used on a system
>> where there is a more desireable clocksource available, the boot parameter
>> fails. This is due to the TSC clocksource being installed unvalidated, but
>> the override being invalidated during the initial run through
>> clocksource_done_booting().
>
> I've read this a few times, and I'm not sure I really understand it.
>
> Can you give an example of a "more desirable clocksource" then the
> TSC?  Especially when the TSC was specified as a boot argument?
>

I apologize for the confusion. By "more desireable", I mean that there is
another clocksource that the system is wanting to use by default. For
example, on Xen platforms, the "xen" clocksource is the "best" as
determined by clocksource_find_best(), being at the top of the list.

crash> list -s clocksource.name,rating clocksource.list -H clocksource_list
81c0cb00
  name = 0x817ad585 "xen"
  rating = 400
81a98540
  name = 0x817c2873 "tsc"
  rating = 300
81aa6580
  name = 0x817b118a "hpet"
  rating = 250
81b22b00
  name = 0x817c02b7 "acpi_pm"
  rating = 120
81ab48c0
  name = 0x817c05c6 "jiffies"
  rating = 1


In that scenario, the xen clocksource would be used if no override was
specified.


>
> The logic here is confusing as well. So.. if the override is not HRT
> compatible, we check if its stable or not?  Once we're in HRT there's
> not much likelyhood of us going into non HRT mode. I'm not sure what
> the stability has to do with it here.
>
> Sorry, could you explain the case you're running into in some further detail?

The issue I'm running into is that the override is not HRT compatible yet.
Though it will be later in the boot process, unless the clocksource watchdog
marks the clocksource as unstable.

The issue with the current implementation is that the override_name value is
disabled when the tsc is first checked, before the watchdog has a chance to
check it and mark it stable or unstable.

Without patch:
$ dmesg | grep -e clocksource

clocksource: refined-jiffies: 
Kernel command line:  clocksource=tsc
clocksource: hpet: 
clocksource: xen: 
clocksource: jiffies: 
clocksource: Switched to clocksource xen
clocksource: acpi_pm: 
tsc: Refined TSC clocksource calibration: 2394.399 MHz

clocksource: tsc: 
clocksource: Override clocksource tsc is not HRT compatible - 


With patch:
$ dmesg | grep -e clocksource

clocksource: refined-jiffies:
Kernel command line:  clocksource=tsc
clocksource: hpet: 
clocksource: xen: 

clocksource: jiffies: 
clocksource: Switched to clocksource xen
clocksource: acpi_pm: 
tsc: Refined TSC clocksource calibration: 2394.461 MHz
clocksource: tsc: 
clocksource: Override clocksource tsc is not currently HRT compatible
- deferring
clocksource: Switched to clocksource tsc


Please let me know if there is any further clarification needed. Have a good
one!

--
Kyle Walker


Re: [PATCH resend] clocksource: Defer override invalidation unless clock is unstable

2016-07-27 Thread Kyle Walker
I'm so sorry for the duplicate, gmail managed to sneak in some HTML.
Resending due to the mailing list correctly blocking the initial send.

On Tue, Jul 26, 2016 at 5:36 PM, John Stultz  wrote:
> Sorry for not getting back to you. This has been in my to-look-at list.


No problem at all. Thanks for taking a look!


> On Tue, Jul 26, 2016 at 2:24 PM, Kyle Walker  wrote:
>> The clock_select() operation will attempt to use the clocksource override
>> to apply the desired clocksource when the "clocksource=" boot parameter is
>> supplied. However, in the event that "clocksource=tsc" is used on a system
>> where there is a more desireable clocksource available, the boot parameter
>> fails. This is due to the TSC clocksource being installed unvalidated, but
>> the override being invalidated during the initial run through
>> clocksource_done_booting().
>
> I've read this a few times, and I'm not sure I really understand it.
>
> Can you give an example of a "more desirable clocksource" then the
> TSC?  Especially when the TSC was specified as a boot argument?
>

I apologize for the confusion. By "more desireable", I mean that there is
another clocksource that the system is wanting to use by default. For
example, on Xen platforms, the "xen" clocksource is the "best" as
determined by clocksource_find_best(), being at the top of the list.

crash> list -s clocksource.name,rating clocksource.list -H clocksource_list
81c0cb00
  name = 0x817ad585 "xen"
  rating = 400
81a98540
  name = 0x817c2873 "tsc"
  rating = 300
81aa6580
  name = 0x817b118a "hpet"
  rating = 250
81b22b00
  name = 0x817c02b7 "acpi_pm"
  rating = 120
81ab48c0
  name = 0x817c05c6 "jiffies"
  rating = 1


In that scenario, the xen clocksource would be used if no override was
specified.


>
> The logic here is confusing as well. So.. if the override is not HRT
> compatible, we check if its stable or not?  Once we're in HRT there's
> not much likelyhood of us going into non HRT mode. I'm not sure what
> the stability has to do with it here.
>
> Sorry, could you explain the case you're running into in some further detail?

The issue I'm running into is that the override is not HRT compatible yet.
Though it will be later in the boot process, unless the clocksource watchdog
marks the clocksource as unstable.

The issue with the current implementation is that the override_name value is
disabled when the tsc is first checked, before the watchdog has a chance to
check it and mark it stable or unstable.

Without patch:
$ dmesg | grep -e clocksource

clocksource: refined-jiffies: 
Kernel command line:  clocksource=tsc
clocksource: hpet: 
clocksource: xen: 
clocksource: jiffies: 
clocksource: Switched to clocksource xen
clocksource: acpi_pm: 
tsc: Refined TSC clocksource calibration: 2394.399 MHz

clocksource: tsc: 
clocksource: Override clocksource tsc is not HRT compatible - 


With patch:
$ dmesg | grep -e clocksource

clocksource: refined-jiffies:
Kernel command line:  clocksource=tsc
clocksource: hpet: 
clocksource: xen: 

clocksource: jiffies: 
clocksource: Switched to clocksource xen
clocksource: acpi_pm: 
tsc: Refined TSC clocksource calibration: 2394.461 MHz
clocksource: tsc: 
clocksource: Override clocksource tsc is not currently HRT compatible
- deferring
clocksource: Switched to clocksource tsc


Please let me know if there is any further clarification needed. Have a good
one!

--
Kyle Walker


[PATCH resend] clocksource: Defer override invalidation unless clock is unstable

2016-07-26 Thread Kyle Walker
The clock_select() operation will attempt to use the clocksource override
to apply the desired clocksource when the "clocksource=" boot parameter is
supplied. However, in the event that "clocksource=tsc" is used on a system
where there is a more desireable clocksource available, the boot parameter
fails. This is due to the TSC clocksource being installed unvalidated, but
the override being invalidated during the initial run through
clocksource_done_booting().

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Signed-off-by: Kyle Walker <kwal...@redhat.com>
Cc: John Stultz <john.stu...@linaro.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: linux-kernel@vger.kernel.org
---

Notes:
Resend due to no feedback on the initial submit. Thank you in advance!

 kernel/time/clocksource.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..4c1bb2a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur)
 */
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
/* Override clocksource cannot be used. */
-   pr_warn("Override clocksource %s is not HRT compatible 
- cannot switch while in HRT/NOHZ mode\n",
-   cs->name);
-   override_name[0] = 0;
+   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+   pr_warn("Override clocksource %s is unstable 
and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
+   cs->name);
+   override_name[0] = 0;
+   } else {
+   /*
+* The override cannot be currently verified.
+* Deferring to let the watchdog check.
+*/
+   pr_info("Override clocksource %s is not 
currently HRT compatible - deferring\n",
+   cs->name);
+   }
} else
/* Override clocksource can be used. */
best = cs;
-- 
2.5.5



[PATCH resend] clocksource: Defer override invalidation unless clock is unstable

2016-07-26 Thread Kyle Walker
The clock_select() operation will attempt to use the clocksource override
to apply the desired clocksource when the "clocksource=" boot parameter is
supplied. However, in the event that "clocksource=tsc" is used on a system
where there is a more desireable clocksource available, the boot parameter
fails. This is due to the TSC clocksource being installed unvalidated, but
the override being invalidated during the initial run through
clocksource_done_booting().

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Signed-off-by: Kyle Walker 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
---

Notes:
Resend due to no feedback on the initial submit. Thank you in advance!

 kernel/time/clocksource.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..4c1bb2a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur)
 */
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
/* Override clocksource cannot be used. */
-   pr_warn("Override clocksource %s is not HRT compatible 
- cannot switch while in HRT/NOHZ mode\n",
-   cs->name);
-   override_name[0] = 0;
+   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+   pr_warn("Override clocksource %s is unstable 
and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
+   cs->name);
+   override_name[0] = 0;
+   } else {
+   /*
+* The override cannot be currently verified.
+* Deferring to let the watchdog check.
+*/
+   pr_info("Override clocksource %s is not 
currently HRT compatible - deferring\n",
+   cs->name);
+   }
} else
/* Override clocksource can be used. */
best = cs;
-- 
2.5.5



Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

2016-07-26 Thread Kyle Walker
On Mon, Jul 25, 2016 at 4:47 PM, Andrew Morton
<a...@linux-foundation.org> wrote:
>
> Can this suffering be quantified please?
>

The observed suffering is primarily visible within an IBM Qradar
installation. From a high level, the lower limit to the amount of advisory
readahead pages results in a 3-5x increase in time necessary to complete
an identical query within the application.

Note, all of the below values are with Readahead configured to 64Kib.

Baseline behaviour - Prior to:
 600e19af ("mm: use only per-device readahead limit")
 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA
   nodes and limit readahead pages")

Result:
 Qradar - Command: "username equals root" - 57.3s to complete search


New performance - With:
 600e19af ("mm: use only per-device readahead limit")
 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA
   nodes and limit readahead pages")

Result:
 Qradar - "username equals root" query - 245.7s to complete search


Proposed behaviour - With the proposed patch in place.

Result:
 Qradar - "username equals root" query - 57s to complete search


In narrowing the source of the performance deficit, it was observed that
the amount of data loaded into pagecache via madvise was quite a bit lower
following the noted commits. As simply reverting those lower limits were
not accepted previously, the proposed alternative strategy seemed like the
most beneficial path forwards.

>
> Linus probably has opinions ;)
>

I understand that changes to readahead that are very similar have been
proposed quite a bit lately. If there are any changes or testing needed,
I'm more than happy to tackle that.


Thank you in advance!
-- 
Kyle Walker


Re: [PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

2016-07-26 Thread Kyle Walker
On Mon, Jul 25, 2016 at 4:47 PM, Andrew Morton
 wrote:
>
> Can this suffering be quantified please?
>

The observed suffering is primarily visible within an IBM Qradar
installation. From a high level, the lower limit to the amount of advisory
readahead pages results in a 3-5x increase in time necessary to complete
an identical query within the application.

Note, all of the below values are with Readahead configured to 64Kib.

Baseline behaviour - Prior to:
 600e19af ("mm: use only per-device readahead limit")
 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA
   nodes and limit readahead pages")

Result:
 Qradar - Command: "username equals root" - 57.3s to complete search


New performance - With:
 600e19af ("mm: use only per-device readahead limit")
 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA
   nodes and limit readahead pages")

Result:
 Qradar - "username equals root" query - 245.7s to complete search


Proposed behaviour - With the proposed patch in place.

Result:
 Qradar - "username equals root" query - 57s to complete search


In narrowing the source of the performance deficit, it was observed that
the amount of data loaded into pagecache via madvise was quite a bit lower
following the noted commits. As simply reverting those lower limits were
not accepted previously, the proposed alternative strategy seemed like the
most beneficial path forwards.

>
> Linus probably has opinions ;)
>

I understand that changes to readahead that are very similar have been
proposed quite a bit lately. If there are any changes or testing needed,
I'm more than happy to tackle that.


Thank you in advance!
-- 
Kyle Walker


[PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

2016-07-25 Thread Kyle Walker
Java workloads using the MappedByteBuffer library result in the fadvise()
and madvise() syscalls being used extensively. Following recent readahead
limiting alterations, such as 600e19af ("mm: use only per-device readahead
limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages"), application performance
suffers in instances where small readahead is configured.

By moving this limit outside of the syscall codepaths, the syscalls are
able to advise an inordinately large amount of readahead when desired.
With a cap being imposed based on the half of NR_INACTIVE_FILE and
NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a
small readahead limit, but then benefiting from large sequential readahead
values selectively.

Signed-off-by: Kyle Walker <kwal...@redhat.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Michal Hocko <mho...@suse.com>
Cc: Geliang Tang <geliangt...@163.com>
Cc: Vlastimil Babka <vba...@suse.cz>
Cc: Roman Gushchin <kl...@yandex-team.ru>
Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
---
 mm/readahead.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 65ec288..6f8bb44 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
return -EINVAL;
 
-   nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
+   nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) +
+(global_page_state(NR_FREE_PAGES)) / 2));
+
while (nr_to_read) {
int err;
 
@@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space 
*mapping,
 
/* be dumb */
if (filp && (filp->f_mode & FMODE_RANDOM)) {
+   req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
force_page_cache_readahead(mapping, filp, offset, req_size);
return;
}
-- 
2.5.5



[PATCH] mm: Move readahead limit outside of readahead, and advisory syscalls

2016-07-25 Thread Kyle Walker
Java workloads using the MappedByteBuffer library result in the fadvise()
and madvise() syscalls being used extensively. Following recent readahead
limiting alterations, such as 600e19af ("mm: use only per-device readahead
limit") and 6d2be915 ("mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages"), application performance
suffers in instances where small readahead is configured.

By moving this limit outside of the syscall codepaths, the syscalls are
able to advise an inordinately large amount of readahead when desired.
With a cap being imposed based on the half of NR_INACTIVE_FILE and
NR_FREE_PAGES. In essence, allowing performance tuning efforts to define a
small readahead limit, but then benefiting from large sequential readahead
values selectively.

Signed-off-by: Kyle Walker 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Geliang Tang 
Cc: Vlastimil Babka 
Cc: Roman Gushchin 
Cc: "Kirill A. Shutemov" 
---
 mm/readahead.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 65ec288..6f8bb44 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -211,7 +211,9 @@ int force_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
return -EINVAL;
 
-   nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
+   nr_to_read = min(nr_to_read, (global_page_state(NR_INACTIVE_FILE) +
+(global_page_state(NR_FREE_PAGES)) / 2));
+
while (nr_to_read) {
int err;
 
@@ -484,6 +486,7 @@ void page_cache_sync_readahead(struct address_space 
*mapping,
 
/* be dumb */
if (filp && (filp->f_mode & FMODE_RANDOM)) {
+   req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
force_page_cache_readahead(mapping, filp, offset, req_size);
return;
}
-- 
2.5.5



[PATCH] clocksource: Defer override invalidation unless clock is unstable

2016-07-13 Thread Kyle Walker
The clock_select() operation will attempt to use the clocksource override
to apply the desired clocksource when the "clocksource=" boot parameter is
supplied. However, in the event that "clocksource=tsc" is used on a system
where there is a more desireable clocksource available, the boot parameter
fails. This is due to the TSC clocksource being installed unvalidated, but
the override being invalidated during the initial run through
clocksource_done_booting().

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Signed-off-by: Kyle Walker <kwal...@redhat.com>
Cc: John Stultz <john.stu...@linaro.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: linux-kernel@vger.kernel.org
---
 kernel/time/clocksource.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..4c1bb2a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur)
 */
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
/* Override clocksource cannot be used. */
-   pr_warn("Override clocksource %s is not HRT compatible 
- cannot switch while in HRT/NOHZ mode\n",
-   cs->name);
-   override_name[0] = 0;
+   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+   pr_warn("Override clocksource %s is unstable 
and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
+   cs->name);
+   override_name[0] = 0;
+   } else {
+   /*
+* The override cannot be currently verified.
+* Deferring to let the watchdog check.
+*/
+   pr_info("Override clocksource %s is not 
currently HRT compatible - deferring\n",
+   cs->name);
+   }
} else
/* Override clocksource can be used. */
best = cs;
-- 
2.5.5



[PATCH] clocksource: Defer override invalidation unless clock is unstable

2016-07-13 Thread Kyle Walker
The clock_select() operation will attempt to use the clocksource override
to apply the desired clocksource when the "clocksource=" boot parameter is
supplied. However, in the event that "clocksource=tsc" is used on a system
where there is a more desireable clocksource available, the boot parameter
fails. This is due to the TSC clocksource being installed unvalidated, but
the override being invalidated during the initial run through
clocksource_done_booting().

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Signed-off-by: Kyle Walker 
Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
---
 kernel/time/clocksource.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..4c1bb2a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -600,9 +600,18 @@ static void __clocksource_select(bool skipcur)
 */
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
/* Override clocksource cannot be used. */
-   pr_warn("Override clocksource %s is not HRT compatible 
- cannot switch while in HRT/NOHZ mode\n",
-   cs->name);
-   override_name[0] = 0;
+   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+   pr_warn("Override clocksource %s is unstable 
and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
+   cs->name);
+   override_name[0] = 0;
+   } else {
+   /*
+* The override cannot be currently verified.
+* Deferring to let the watchdog check.
+*/
+   pr_info("Override clocksource %s is not 
currently HRT compatible - deferring\n",
+   cs->name);
+   }
} else
/* Override clocksource can be used. */
best = cs;
-- 
2.5.5



Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-23 Thread Kyle Walker
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes  wrote:
>
> I struggle to understand how the approach of randomly continuing to kill
> more and more processes in the hope that it slows down usage of memory
> reserves or that we get lucky is better.

Thank you to one and all for the feedback.

I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
and omitting them from the oom selection process, continuing the
carnage is likely to result in more unpredictable results. At this
time, I believe Oleg's solution of zapping the process memory use
while it sleeps with the fatal signal enroute is ideal.

Kyle Walker
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-23 Thread Kyle Walker
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes <rient...@google.com> wrote:
>
> I struggle to understand how the approach of randomly continuing to kill
> more and more processes in the hope that it slows down usage of memory
> reserves or that we get lucky is better.

Thank you to one and all for the feedback.

I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
and omitting them from the oom selection process, continuing the
carnage is likely to result in more unpredictable results. At this
time, I believe Oleg's solution of zapping the process memory use
while it sleeps with the fatal signal enroute is ideal.

Kyle Walker
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-17 Thread Kyle Walker
Currently, the oom killer will attempt to kill a process that is in
TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
period of time, such as processes writing to a frozen filesystem during
a lengthy backup operation, this can result in a deadlock condition as
related processes memory access will stall within the page fault
handler.

Within oom_unkillable_task(), check for processes in
TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
move on to another task.

Signed-off-by: Kyle Walker 
---
 mm/oom_kill.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..66f03f8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
if (memcg && !task_in_mem_cgroup(p, memcg))
return true;
 
+   /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
+   if (p->state == TASK_UNINTERRUPTIBLE)
+   return true;
+
/* p may not have freeable memory in nodemask */
if (!has_intersects_mems_allowed(p, nodemask))
return true;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-17 Thread Kyle Walker
Currently, the oom killer will attempt to kill a process that is in
TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
period of time, such as processes writing to a frozen filesystem during
a lengthy backup operation, this can result in a deadlock condition as
related processes memory access will stall within the page fault
handler.

Within oom_unkillable_task(), check for processes in
TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
move on to another task.

Signed-off-by: Kyle Walker <kwal...@redhat.com>
---
 mm/oom_kill.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..66f03f8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
if (memcg && !task_in_mem_cgroup(p, memcg))
return true;
 
+   /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
+   if (p->state == TASK_UNINTERRUPTIBLE)
+   return true;
+
/* p may not have freeable memory in nodemask */
if (!has_intersects_mems_allowed(p, nodemask))
return true;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/