Re: [RFC 0/6] the big khugepaged redesign

2015-03-08 Thread Vlastimil Babka

On 02/23/2015 11:46 PM, Davidlohr Bueso wrote:

On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:

Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
   notoriously expensive. Page allocation slowpath now does extra checks for
   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
   compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
   range will be actually accessed - we can theoretically waste as much as 511
   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
   from different NUMA nodes and while base pages could be all local, THP could
   be remote to all but one CPU. The cost of remote accesses due to this false
   sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.


There are plenty of examples of this, ie for Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
http://oracle-base.com/articles/linux/configuring-huge-pages-for-oracle-on-linux-64.php


Just stumbled upon more references when catching up on lwn:

http://lwn.net/Articles/634797/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-03-05 Thread Vlastimil Babka
On 03/06/2015 01:21 AM, Andres Freund wrote:
> Long mail ahead, sorry for that.

No problem, thanks a lot!

> TL;DR: THP is still noticeable, but not nearly as bad.
> 
> On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
>> That however means the workload is based on hugetlbfs and shouldn't trigger 
>> THP
>> page fault activity, which is the aim of this patchset. Some more googling 
>> made
>> me recall that last LSF/MM, postgresql people mentioned THP issues and 
>> pointed
>> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
>> patchset should help, but I obviously won't be able to measure this before 
>> LSF/MM...
> 
> Just as a reference, this is how some the more extreme profiles looked
> like in the past:
> 
>> 96.50%postmaster  [kernel.kallsyms] [k] _spin_lock_irq
>>   |
>>   --- _spin_lock_irq
>>  |
>>  |--99.87%-- compact_zone
>>  |  compact_zone_order
>>  |  try_to_compact_pages
>>  |  __alloc_pages_nodemask
>>  |  alloc_pages_vma
>>  |  do_huge_pmd_anonymous_page
>>  |  handle_mm_fault
>>  |  __do_page_fault
>>  |  do_page_fault
>>  |  page_fault
>>  |  0x631d98
>>   --0.13%-- [...]
> 
> That specific profile is from a rather old kernel as you probably
> recognize.

Yeah, sounds like synchronous compaction before it was forbidden for THP page
faults...

>> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
>> psql performance with THPs enabled/disabled on recent kernels, where e.g.
>> compaction is no longer synchronous for THP page faults?
> 
> So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
> RAM.
> 
> First of: It's noticeably harder to trigger problems than it used to
> be. But, I can still trigger various problems that are much worse with
> THP enabled than without.
> 
> There seem to be various different bottlenecks; I can get somewhat
> different profiles.
> 
> In a somewhat artificial workload, that tries to simulate what I've seen
> trigger the problem at a customer, I can quite easily trigger large
> differences between THP=enable and THP=never.  There's two types of
> tasks running, one purely OLTP, another doing somewhat more complex
> statements that require a fair amount of process local memory.
> 
> (ignore the absolute numbers for progress, I just waited for somewhat
> stable results while doing other stuff)
> 
> THP off:
> Task 1 solo:
> progress: 200.0 s, 391442.0 tps, 0.654 ms lat
> progress: 201.0 s, 394816.1 tps, 0.683 ms lat
> progress: 202.0 s, 409722.5 tps, 0.625 ms lat
> progress: 203.0 s, 384794.9 tps, 0.665 ms lat
> 
> combined:
> Task 1:
> progress: 144.0 s, 25430.4 tps, 10.067 ms lat
> progress: 145.0 s, 22260.3 tps, 11.500 ms lat
> progress: 146.0 s, 24089.9 tps, 10.627 ms lat
> progress: 147.0 s, 25888.8 tps, 9.888 ms lat
> 
> Task 2:
> progress: 24.4 s, 30.0 tps, 2134.043 ms lat
> progress: 26.5 s, 29.8 tps, 2150.487 ms lat
> progress: 28.4 s, 29.7 tps, 2151.557 ms lat
> progress: 30.4 s, 28.5 tps, 2245.304 ms lat
> 
> flat profile:
>  6.07%  postgres  postgres[.] heap_form_minimal_tuple
>  4.36%  postgres  postgres[.] heap_fill_tuple
>  4.22%  postgres  postgres[.] ExecStoreMinimalTuple
>  4.11%  postgres  postgres[.] AllocSetAlloc
>  3.97%  postgres  postgres[.] advance_aggregates
>  3.94%  postgres  postgres[.] advance_transition_function
>  3.94%  postgres  postgres[.] ExecMakeTableFunctionResult
>  3.33%  postgres  postgres[.] heap_compute_data_size
>  3.30%  postgres  postgres[.] MemoryContextReset
>  3.28%  postgres  postgres[.] ExecScan
>  3.04%  postgres  postgres[.] ExecProject
>  2.96%  postgres  postgres[.] generate_series_step_int4
>  2.94%  postgres  [kernel.kallsyms]   [k] clear_page_c
> 
> (i.e. most of it postgres, cache miss bound)
> 
> THP on:
> Task 1 solo:
> progress: 140.0 s, 390458.1 tps, 0.656 ms lat
> progress: 141.0 s, 391174.2 tps, 0.654 ms lat
> progress: 142.0 s, 394828.8 tps, 0.648 ms lat
> progress: 143.0 s, 398156.2 tps, 0.643 ms lat
> 
> Task 1:
> progress: 179.0 s, 23963.1 tps, 10.683 ms lat
> progress: 180.0 s, 22712.9 tps, 11.271 ms lat
> progress: 181.0 s, 21211.4 tps, 12.069 ms lat
> progress: 182.0 s, 23207.8 tps, 11.031 ms lat
> 
> Task 2:
> progress: 28.2 s, 19.1 tps, 3349.747 ms lat
> progress: 31.0 s, 19.8 tps, 3230.589 ms lat
> progress: 34.3 s, 21.5 tps, 2979.113 ms lat
> progress: 37.4 s, 20.9 tps, 3055.143 ms lat

So that's 1/3 worse tps for task 2? Not very nice...

> flat 

Re: [RFC 0/6] the big khugepaged redesign

2015-03-05 Thread Andres Freund
Long mail ahead, sorry for that.

TL;DR: THP is still noticeable, but not nearly as bad.

On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
> That however means the workload is based on hugetlbfs and shouldn't trigger 
> THP
> page fault activity, which is the aim of this patchset. Some more googling 
> made
> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
> patchset should help, but I obviously won't be able to measure this before 
> LSF/MM...

Just as a reference, this is how some the more extreme profiles looked
like in the past:

> 96.50%postmaster  [kernel.kallsyms] [k] _spin_lock_irq
>   |
>   --- _spin_lock_irq
>  |
>  |--99.87%-- compact_zone
>  |  compact_zone_order
>  |  try_to_compact_pages
>  |  __alloc_pages_nodemask
>  |  alloc_pages_vma
>  |  do_huge_pmd_anonymous_page
>  |  handle_mm_fault
>  |  __do_page_fault
>  |  do_page_fault
>  |  page_fault
>  |  0x631d98
>   --0.13%-- [...]

That specific profile is from a rather old kernel as you probably
recognize.

> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
> psql performance with THPs enabled/disabled on recent kernels, where e.g.
> compaction is no longer synchronous for THP page faults?

So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
RAM.

First of: It's noticeably harder to trigger problems than it used to
be. But, I can still trigger various problems that are much worse with
THP enabled than without.

There seem to be various different bottlenecks; I can get somewhat
different profiles.

In a somewhat artificial workload, that tries to simulate what I've seen
trigger the problem at a customer, I can quite easily trigger large
differences between THP=enable and THP=never.  There's two types of
tasks running, one purely OLTP, another doing somewhat more complex
statements that require a fair amount of process local memory.

(ignore the absolute numbers for progress, I just waited for somewhat
stable results while doing other stuff)

THP off:
Task 1 solo:
progress: 200.0 s, 391442.0 tps, 0.654 ms lat
progress: 201.0 s, 394816.1 tps, 0.683 ms lat
progress: 202.0 s, 409722.5 tps, 0.625 ms lat
progress: 203.0 s, 384794.9 tps, 0.665 ms lat

combined:
Task 1:
progress: 144.0 s, 25430.4 tps, 10.067 ms lat
progress: 145.0 s, 22260.3 tps, 11.500 ms lat
progress: 146.0 s, 24089.9 tps, 10.627 ms lat
progress: 147.0 s, 25888.8 tps, 9.888 ms lat

Task 2:
progress: 24.4 s, 30.0 tps, 2134.043 ms lat
progress: 26.5 s, 29.8 tps, 2150.487 ms lat
progress: 28.4 s, 29.7 tps, 2151.557 ms lat
progress: 30.4 s, 28.5 tps, 2245.304 ms lat

flat profile:
 6.07%  postgres  postgres[.] heap_form_minimal_tuple
 4.36%  postgres  postgres[.] heap_fill_tuple
 4.22%  postgres  postgres[.] ExecStoreMinimalTuple
 4.11%  postgres  postgres[.] AllocSetAlloc
 3.97%  postgres  postgres[.] advance_aggregates
 3.94%  postgres  postgres[.] advance_transition_function
 3.94%  postgres  postgres[.] ExecMakeTableFunctionResult
 3.33%  postgres  postgres[.] heap_compute_data_size
 3.30%  postgres  postgres[.] MemoryContextReset
 3.28%  postgres  postgres[.] ExecScan
 3.04%  postgres  postgres[.] ExecProject
 2.96%  postgres  postgres[.] generate_series_step_int4
 2.94%  postgres  [kernel.kallsyms]   [k] clear_page_c

(i.e. most of it postgres, cache miss bound)

THP on:
Task 1 solo:
progress: 140.0 s, 390458.1 tps, 0.656 ms lat
progress: 141.0 s, 391174.2 tps, 0.654 ms lat
progress: 142.0 s, 394828.8 tps, 0.648 ms lat
progress: 143.0 s, 398156.2 tps, 0.643 ms lat

Task 1:
progress: 179.0 s, 23963.1 tps, 10.683 ms lat
progress: 180.0 s, 22712.9 tps, 11.271 ms lat
progress: 181.0 s, 21211.4 tps, 12.069 ms lat
progress: 182.0 s, 23207.8 tps, 11.031 ms lat

Task 2:
progress: 28.2 s, 19.1 tps, 3349.747 ms lat
progress: 31.0 s, 19.8 tps, 3230.589 ms lat
progress: 34.3 s, 21.5 tps, 2979.113 ms lat
progress: 37.4 s, 20.9 tps, 3055.143 ms lat

flat profile:
21.36%  postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
 4.93%  postgres  postgres[.] ExecStoreMinimalTuple
 4.02%  postgres  postgres[.] heap_form_minimal_tuple
 3.55%  postgres  [kernel.kallsyms]   [k] clear_page_c
 2.85%  postgres  postgres[.] heap_fill_tuple
 2.60%  postgres  postgres[.] ExecMakeTableF

Re: [RFC 0/6] the big khugepaged redesign

2015-03-05 Thread Andres Freund
On 2015-03-05 18:01:08 +0100, Vlastimil Babka wrote:
> On 03/05/2015 05:52 PM, Andres Freund wrote:
> > What exactly counts as "recent" in this context? Most of the bigger
> > installations where we found THP to be absolutely prohibitive (slowdowns
> > on the order of a magnitude, huge latency spikes) unfortunately run
> > quite old kernels...  I guess 3.11 does *not* count :/? That'd be a
> 
> Yeah that's too old :/

Guessed so.

> I also noticed that you now support hugetlbfs. That could be also interesting
> data point, if the hugetlbfs usage helped because THP code wouldn't
> trigger.

Well, mmap(MAP_HUGETLB), but yea.

Will let you know once I know whether it's possible to get a newer kernel.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-03-05 Thread Vlastimil Babka
On 03/05/2015 05:52 PM, Andres Freund wrote:
> Hi,
> 
> On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
>> That however means the workload is based on hugetlbfs and shouldn't trigger 
>> THP
>> page fault activity, which is the aim of this patchset. Some more googling 
>> made
>> me recall that last LSF/MM, postgresql people mentioned THP issues and 
>> pointed
>> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
>> patchset should help, but I obviously won't be able to measure this before 
>> LSF/MM...
>> 
>> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
>> psql performance with THPs enabled/disabled on recent kernels, where e.g.
>> compaction is no longer synchronous for THP page faults?
> 
> What exactly counts as "recent" in this context? Most of the bigger
> installations where we found THP to be absolutely prohibitive (slowdowns
> on the order of a magnitude, huge latency spikes) unfortunately run
> quite old kernels...  I guess 3.11 does *not* count :/? That'd be a

Yeah that's too old :/ 3.17 has patches to make compaction less aggressive on
THP page faults, and 3.18 prevents khugepaged from holding mmap_sem during
compaction, which could be also relevant.

> bigger machine where I could relatively quickly reenable THP to check
> whether it's still bad. I might be able to trigger it to be rebooted
> onto a newer kernel, will ask.

Thanks, that would be great, if you could do that.
I also noticed that you now support hugetlbfs. That could be also interesting
data point, if the hugetlbfs usage helped because THP code wouldn't trigger.

Vlastimil

> Greetings,
> 
> Andres Freund
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-03-05 Thread Andres Freund
Hi,

On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
> That however means the workload is based on hugetlbfs and shouldn't trigger 
> THP
> page fault activity, which is the aim of this patchset. Some more googling 
> made
> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
> patchset should help, but I obviously won't be able to measure this before 
> LSF/MM...
> 
> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
> psql performance with THPs enabled/disabled on recent kernels, where e.g.
> compaction is no longer synchronous for THP page faults?

What exactly counts as "recent" in this context? Most of the bigger
installations where we found THP to be absolutely prohibitive (slowdowns
on the order of a magnitude, huge latency spikes) unfortunately run
quite old kernels...  I guess 3.11 does *not* count :/? That'd be a
bigger machine where I could relatively quickly reenable THP to check
whether it's still bad. I might be able to trigger it to be rebooted
onto a newer kernel, will ask.

Greetings,

Andres Freund
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-03-05 Thread Vlastimil Babka
On 02/24/2015 11:32 AM, Vlastimil Babka wrote:
> On 02/23/2015 11:56 PM, Andrew Morton wrote:
>> On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso  wrote:
>>
>>> On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
 Recently, there was concern expressed (e.g. [1]) whether the quite 
 aggressive
 THP allocation attempts on page faults are a good performance trade-off.

 - THP allocations add to page fault latency, as high-order allocations are
notoriously expensive. Page allocation slowpath now does extra checks 
 for
GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
compaction for user page faults. But even async compaction can be 
 expensive.
 - During the first page fault in a 2MB range we cannot predict how much of 
 the
range will be actually accessed - we can theoretically waste as much as 
 511
worth of pages [2]. Or, the pages in the range might be accessed from 
 CPUs
from different NUMA nodes and while base pages could be all local, THP 
 could
be remote to all but one CPU. The cost of remote accesses due to this 
 false
sharing would be higher than any savings on the TLB.
 - The interaction with memcg are also problematic [1].

 Now I don't have any hard data to show how big these problems are, and I
 expect we will discuss this on LSF/MM (and hope somebody has such data 
 [3]).
 But it's certain that e.g. SAP recommends to disable THPs [4] for their 
 apps
 for performance reasons.
>>>
>>> There are plenty of examples of this, ie for Oracle:
>>>
>>> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
>>
>> hm, five months ago and I don't recall seeing any followup to this.
> 
> Actually it's year + five months, but nevertheless...
> 
>> Does anyone know what's happening?

So I think that post was actually about THP support enabled in .config slowing
down hugetlbfs, and found a followup post here
https://blogs.oracle.com/linuxkernel/entry/performance_impact_of_transparent_huge
 and
that was after all solved in 3.12. Sasha also mentioned that split PTL patchset
helped as well, and the degradation in IOPS due to THP enabled is now limited to
5%, and possibly the refcounting redesign could help.

That however means the workload is based on hugetlbfs and shouldn't trigger THP
page fault activity, which is the aim of this patchset. Some more googling made
me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
patchset should help, but I obviously won't be able to measure this before 
LSF/MM...

I'm CCing the psql guys from last year LSF/MM - do you have any insight about
psql performance with THPs enabled/disabled on recent kernels, where e.g.
compaction is no longer synchronous for THP page faults?

Thanks,
Vlastimil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-25 Thread Vlastimil Babka

On 02/24/2015 12:24 PM, Andrea Arcangeli wrote:

Hi everyone,


Hi,


On Tue, Feb 24, 2015 at 11:32:30AM +0100, Vlastimil Babka wrote:

I would suspect mmap_sem being held during whole THP page fault
(including the needed reclaim and compaction), which I forgot to mention
in the first e-mail - it's not just the problem page fault latency, but
also potentially holding back other processes, why we should allow
shifting from THP page faults to deferred collapsing.
Although the attempts for opportunistic page faults without mmap_sem
would also help in this particular case.

Khugepaged also used to hold mmap_sem (for read) during the allocation
attempt, but that was fixed since then. It could be also zone lru_lock
pressure.


I'm traveling and I didn't have much time to read the code yet but if
I understood well the proposal, I've some doubt boosting khugepaged
CPU utilization is going to provide a better universal trade off. I
think the low overhead background scan is safer default.


Making the background scanning more efficient should be win in any case.


If we want to do more async background work and less "synchronous work
at fault time", what may be more interesting is to generate
transparent hugepages in the background and possibly not to invoke
compaction (or much compaction) in the page faults.


Steps in that direction are in fact part of the patchset :)


I'd rather move compaction to a background kernel thread, and to
invoke compaction synchronously only in khugepaged. I like it more if
nothing else because it is a kind of background load that can come to
a full stop, once enough THP have been created.


Yes, we agree here.


Unlike khugepaged that
can never stop to scan and it better be lightweight kind of background
load, as it'd be running all the time.


IMHO it doesn't hurt if the scanning can focus on mm's where it's more 
likely to succeed, and tune its activity according to how successful it 
is. Then you don't need to achieve the "lightweightness" by setting the 
existing tunables to very long sleeps and very short scans, which 
increases the delay until the good collapse candidates are actually 
found by khugepaged.



Creating THP through khugepaged is much more expensive than creating
them on page faults. khugepaged will need to halt the userland access
on the range once more and it'll have to copy the 2MB.


Well, Mel also suggested another thing that I didn't mention yet - 
in-place collapsing, where the base pages would be allocated on page 
faults with such layout to allow later collapse without the copying. I 
think that Kiryl's refcounting changes could potentially allow this by 
allocating a hugepage, but mapping it using pte's so it could still be 
tracked which pages are actually accessed, and from which nodes. If 
after some time it looks like a good candidate, just switch it to pmd, 
otherwise break the hugepage and free the unused base pages.



Overall I agree with Andi we need more data collected for various
workloads before embarking into big changes, at least so we can proof
the changes to be beneficial to those workloads.


OK. I mainly wanted to stir some discussion at this point.


I would advise not to make changes for app that are already the
biggest users ever of hugetlbfs (like Oracle). Those already are
optimized by other means. THP target are apps that have several
benefit in not ever using hugetlbfs, so apps that are more dynamic
workloads that don't fit well with NUMA hard pinning with numactl or
other static placements of memory and CPU.

There are also other corner cases to optimize, that have nothing to do
with khugepaged nor compaction: for example redis has issues in the
way it forks() and then uses the child memory as a snapshot while the
parent keeps running and writing to the memory. If THP is enabled, the
parent that writes to the memory will allocate and copy 2MB objects
instead of 4k objects. That means more memory utilization but
especially the problem are those copy_user of 2MB instead of 4k hurting
the parent runtime.

For redis we need a more finegrined thing than MADV_NOHUGEPAGE. It
needs a MADV_COW_NOHUGEPAGE (please think at a better name) that will
only prevent THP creation during COW faults but still maximize THP
utilization for every other case. Once such a madvise will become
available, redis will run faster with THP enabled (currently redis
recommends THP disabled because of the higher latencies in the 2MB COW
faults while the child process is snapshotting). When the snapshot is
finished and the child quits, khugepaged will recreate THP for those
fragmented cows.


Hm sounds like Kiryl's patchset could also help here? In parent, split 
only the pmd and do cow on 4k pages, while child keeps the whole THP.
Later khugepaged can recreate THP for the parent, as you say. That 
should be better default behavior than the current 2MB copies, not just 
for redis? And no new madvise needed. Or maybe with MADV_HUGEPAGE you 
can assume that the call

Re: [RFC 0/6] the big khugepaged redesign

2015-02-24 Thread Andrea Arcangeli
On Tue, Feb 24, 2015 at 12:24:12PM +0100, Andrea Arcangeli wrote:
> I would advise not to make changes for app that are already the
> biggest users ever of hugetlbfs (like Oracle). Those already are
> optimized by other means. THP target are apps that have several

Before somebody risks to misunderstand perhaps I should clarify
further: what I meant is that if the khugepaged boost helps Oracle or
other heavy users of hugetlbfs, but it _hurts_ everything else as I'd
guess, I'd advise against it. Because if an app can deal with
hugetlbfs it's much simpler to optimize by other means and it's not
the primary target of THP so the priority for THP default behavior
should be biased towards those apps that can't easily fit into
hugetlbfs and numa hard pins static placement models.

Of course it'd be perfectly fine to make THP changes that helps even
the biggest hugetlbfs users out there, as long as these changes don't
hurt all other normal use cases (where THP is always guaranteed to
provide a significant performance boost if enabled). Chances are the
benchmarks are also comparing "hugetlbfs+THP" vs "hugetlbfs" without
THP, and not "nothing" vs "THP".

Clearly I'd like to optimize for all apps including the biggest
hugetlbfs users, and this is why I'd like to optimize redis as well,
considering it's simple enough to do it with just one madvise to
change the behavior of COW faults and it'd be guaranteed not to hurt
any other common usage. If we were to instead change the default
behavior of COW faults we'd need first to collect data for a variety
of apps and personally I doubt such a change would be a good universal
tradeoff, while it's a fine change for a behavioral change through
madvise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-24 Thread Andrea Arcangeli
Hi everyone,

On Tue, Feb 24, 2015 at 11:32:30AM +0100, Vlastimil Babka wrote:
> I would suspect mmap_sem being held during whole THP page fault 
> (including the needed reclaim and compaction), which I forgot to mention 
> in the first e-mail - it's not just the problem page fault latency, but 
> also potentially holding back other processes, why we should allow 
> shifting from THP page faults to deferred collapsing.
> Although the attempts for opportunistic page faults without mmap_sem 
> would also help in this particular case.
> 
> Khugepaged also used to hold mmap_sem (for read) during the allocation 
> attempt, but that was fixed since then. It could be also zone lru_lock 
> pressure.

I'm traveling and I didn't have much time to read the code yet but if
I understood well the proposal, I've some doubt boosting khugepaged
CPU utilization is going to provide a better universal trade off. I
think the low overhead background scan is safer default.

If we want to do more async background work and less "synchronous work
at fault time", what may be more interesting is to generate
transparent hugepages in the background and possibly not to invoke
compaction (or much compaction) in the page faults.

I'd rather move compaction to a background kernel thread, and to
invoke compaction synchronously only in khugepaged. I like it more if
nothing else because it is a kind of background load that can come to
a full stop, once enough THP have been created. Unlike khugepaged that
can never stop to scan and it better be lightweight kind of background
load, as it'd be running all the time.

Creating THP through khugepaged is much more expensive than creating
them on page faults. khugepaged will need to halt the userland access
on the range once more and it'll have to copy the 2MB.

Overall I agree with Andi we need more data collected for various
workloads before embarking into big changes, at least so we can proof
the changes to be beneficial to those workloads.

I would advise not to make changes for app that are already the
biggest users ever of hugetlbfs (like Oracle). Those already are
optimized by other means. THP target are apps that have several
benefit in not ever using hugetlbfs, so apps that are more dynamic
workloads that don't fit well with NUMA hard pinning with numactl or
other static placements of memory and CPU.

There are also other corner cases to optimize, that have nothing to do
with khugepaged nor compaction: for example redis has issues in the
way it forks() and then uses the child memory as a snapshot while the
parent keeps running and writing to the memory. If THP is enabled, the
parent that writes to the memory will allocate and copy 2MB objects
instead of 4k objects. That means more memory utilization but
especially the problem are those copy_user of 2MB instead of 4k hurting
the parent runtime.

For redis we need a more finegrined thing than MADV_NOHUGEPAGE. It
needs a MADV_COW_NOHUGEPAGE (please think at a better name) that will
only prevent THP creation during COW faults but still maximize THP
utilization for every other case. Once such a madvise will become
available, redis will run faster with THP enabled (currently redis
recommends THP disabled because of the higher latencies in the 2MB COW
faults while the child process is snapshotting). When the snapshot is
finished and the child quits, khugepaged will recreate THP for those
fragmented cows.

OTOH redis could also use the userfaultfd to do the snapshotting and
it could avoid fork in the first place, after I add UFFDIO_WP ioctl to
mark and unmark the memory wrprotected or not without altering the
vma, while catching the faults with read or POLLIN on the ufd to copy
the memory off before removing the wrprotection. The real problem to
fully implement the UFFDIO_WP will be the swapcache and swapouts: swap
entries have no wrprotection bit to know if to fire wrprotected
userfaults on write faults, if the range is registered as
uffdio_register.mode & UFFDIO_REGISTER_MODE_WP. So far I only
implemented in full the UFFDIO_REGISTER_MODE_MISSING tracking mode, so
I didn't need to attack the wrprotected swapentry thingy, but the new
userfaultfd API already is ready to implement all write protection (or
any other faulting reason) as well and it can incrementally be
extended to different memory types (tmpfs etc..) without backwards
compatibility issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-24 Thread Vlastimil Babka

On 02/23/2015 11:56 PM, Andrew Morton wrote:

On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso  wrote:


On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:

Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
   notoriously expensive. Page allocation slowpath now does extra checks for
   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
   compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
   range will be actually accessed - we can theoretically waste as much as 511
   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
   from different NUMA nodes and while base pages could be all local, THP could
   be remote to all but one CPU. The cost of remote accesses due to this false
   sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.


There are plenty of examples of this, ie for Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge


hm, five months ago and I don't recall seeing any followup to this.


Actually it's year + five months, but nevertheless...


Does anyone know what's happening?


I would suspect mmap_sem being held during whole THP page fault 
(including the needed reclaim and compaction), which I forgot to mention 
in the first e-mail - it's not just the problem page fault latency, but 
also potentially holding back other processes, why we should allow 
shifting from THP page faults to deferred collapsing.
Although the attempts for opportunistic page faults without mmap_sem 
would also help in this particular case.


Khugepaged also used to hold mmap_sem (for read) during the allocation 
attempt, but that was fixed since then. It could be also zone lru_lock 
pressure.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-23 Thread Sasha Levin
On 02/23/2015 05:56 PM, Andrew Morton wrote:
>>> Now I don't have any hard data to show how big these problems are, and I
>>> > > expect we will discuss this on LSF/MM (and hope somebody has such data 
>>> > > [3]).
>>> > > But it's certain that e.g. SAP recommends to disable THPs [4] for their 
>>> > > apps
>>> > > for performance reasons.
>> > 
>> > There are plenty of examples of this, ie for Oracle:
>> > 
>> > https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
> hm, five months ago and I don't recall seeing any followup to this. 
> Does anyone know what's happening?

I'll dig it up.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-23 Thread Andrew Morton
On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso  wrote:

> On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
> > Recently, there was concern expressed (e.g. [1]) whether the quite 
> > aggressive
> > THP allocation attempts on page faults are a good performance trade-off.
> > 
> > - THP allocations add to page fault latency, as high-order allocations are
> >   notoriously expensive. Page allocation slowpath now does extra checks for
> >   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
> >   compaction for user page faults. But even async compaction can be 
> > expensive.
> > - During the first page fault in a 2MB range we cannot predict how much of 
> > the
> >   range will be actually accessed - we can theoretically waste as much as 
> > 511
> >   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
> >   from different NUMA nodes and while base pages could be all local, THP 
> > could
> >   be remote to all but one CPU. The cost of remote accesses due to this 
> > false
> >   sharing would be higher than any savings on the TLB.
> > - The interaction with memcg are also problematic [1].
> > 
> > Now I don't have any hard data to show how big these problems are, and I
> > expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
> > But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
> > for performance reasons.
> 
> There are plenty of examples of this, ie for Oracle:
> 
> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

hm, five months ago and I don't recall seeing any followup to this. 
Does anyone know what's happening?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-23 Thread Davidlohr Bueso
On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
> Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
> THP allocation attempts on page faults are a good performance trade-off.
> 
> - THP allocations add to page fault latency, as high-order allocations are
>   notoriously expensive. Page allocation slowpath now does extra checks for
>   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
>   compaction for user page faults. But even async compaction can be expensive.
> - During the first page fault in a 2MB range we cannot predict how much of the
>   range will be actually accessed - we can theoretically waste as much as 511
>   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
>   from different NUMA nodes and while base pages could be all local, THP could
>   be remote to all but one CPU. The cost of remote accesses due to this false
>   sharing would be higher than any savings on the TLB.
> - The interaction with memcg are also problematic [1].
> 
> Now I don't have any hard data to show how big these problems are, and I
> expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
> But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
> for performance reasons.

There are plenty of examples of this, ie for Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
http://oracle-base.com/articles/linux/configuring-huge-pages-for-oracle-on-linux-64.php

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/6] the big khugepaged redesign

2015-02-23 Thread Andi Kleen
Vlastimil Babka  writes:

> This has been already discussed as a good
> idea and a RFC has been posted by Alex Thorlton last October [5].

In my opinion it's a very bad idea. It heavily penalizes the single
threaded application case, which is quite important. And it
would likely lead to even larger latencies on the application
base, even for the multithreaded case, as there is no good way
anymore to hide blocking latencies in the process.

The current single thead khugepaged has various issues, but this would
just make it much worse.

IMHO it's useless to do much here without a lot of data first
to identify the actual problems. Doing things first without analysis 
first seems totally backwards.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/6] the big khugepaged redesign

2015-02-23 Thread Vlastimil Babka
Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
  notoriously expensive. Page allocation slowpath now does extra checks for
  GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
  compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
  range will be actually accessed - we can theoretically waste as much as 511
  worth of pages [2]. Or, the pages in the range might be accessed from CPUs
  from different NUMA nodes and while base pages could be all local, THP could
  be remote to all but one CPU. The cost of remote accesses due to this false
  sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.

One might think that instead of fully disabling THP's it should be possible to
only disable (or make less aggressive, or limit to MADV_HUGEPAGE regions) THP's
for page faults and leave the collapsing up to khugepaged, which would hide the
latencies and allow better decisions based on how many base pages were faulted
in and from which nodes. However, looking more closely gives the impression
that khugepaged was meant rather as a rarely needed fallback for cases where
the THP page fault fails due to e.g. low memory. There are some tunables under
/sys/kernel/mm/transparent_hugepage/ but it doesn't seem sufficient for moving
the bulk of the THP work to khugepaged as it is.

- setting "defrag" to "madvise" or "never", while leaving khugepaged/defrag=1
  will result in very lightweight THP allocation attempts during page faults.
  This is nice and solves the latency problem, but not the other problems
  described above. It doesn't seem possible to disable page fault THP's
  completely without also disabling khugepaged.
- even if it was possible, the default settings for khugepaged are to scan up
  to 8 PMD's and collapse up to 1 THP per 10 seconds. That's probably too slow
  for some workloads and machines, but if one was to set this to be more
  aggressive, it would become quite inefficient. Khugepaged has a single global
  list of mm's to scan, which may include lot of tasks where scanning won't 
yield
  anything. It should rather focus on tasks that are actually running (and thus
  could benefit) and where collapses were recently successful. The activity
  should also be ideally accounted to the task that benefits from it.
- scanning on NUMA systems will proceed even when actual THP allocations fail,
  e.g. during memory pressure. In such case it should be better to save the
  scanning time until memory is available.
- there were some limitations on which PMD's khugepaged can collapse. Thanks to
  Ebru's recent patches, this should be fixed soon. With the
  khugepaged/max_ptes_none tunable, one can limit the potential memory wasted.
  But limiting NUMA false sharing is performed just in zone reclaim mode and
  could be made stricter.

This RFC patchset doesn't try to solve everything mentioned above - the full
motivation is included for the bigger picture discussion, including LSF/MM.
The main part of this patchset is the move of collapse scanning from
khugepaged to the task_work context. This has been already discussed as a good
idea and a RFC has been posted by Alex Thorlton last October [5]. In that
prototype, scanning has been driven from __khugepaged_enter(), which is called
on events such as vma_merge, fork, and THP page fault attempts, e.g. events
that are not exactly periodical. The main difference in my patchset is it being
modeled after the automatic NUMA balancing scanning, i.e. using scheduler's
task_tick_fair(). The second difference is that khugepaged is not disabled
entirely, but repurposed for the costly hugepage allocations. There is a
nodemask for indicating which nodes should have hugepages easily available.
The idea is that hugepage allocation attempts from the process context
(either page fault or the task_work collapsing) would not attempt
reclaim/compaction, and on failure will clear the nodemask bit and wake up
khugepaged to do the hard work and flip the bit back on. If if appears that
ther are no hugepages available, attempts to page fault THP or scan are
suspended.

I have done only light testing so far to see that it works as intended, but
not to prove it's "better" than current state. I wanted to post the RFC before
LSF/MM.

There are known limitations and TODO/FIXME's in the code, for example:
- the scanning period doesn't yet adjust itself based on recent collapse
  success/failure. The idea is