Re: [PATCH 11/11] sysctl: treewide: constify the ctl_table argument of handlers

2024-03-15 Thread Dave Chinner
   int write,
>  void *buffer, size_t *lenp, loff_t *ppos)
>  {
>   int ret;

And this.

> @@ -474,8 +475,10 @@ int perf_event_max_sample_rate_handler(struct ctl_table 
> *table, int write,
>  
>  int sysctl_perf_cpu_time_max_percent __read_mostly = 
> DEFAULT_CPU_TIME_MAX_PERCENT;
>  
> -int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
> - void *buffer, size_t *lenp, loff_t *ppos)
> +int perf_cpu_time_max_percent_handler(const struct ctl_table *table,
> +   int write,
> +   void *buffer, size_t *lenp,
> +   loff_t *ppos)
>  {
>   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>  

And this.

> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index b2fc2727d654..003f0f5cb111 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -239,9 +239,10 @@ static long hung_timeout_jiffies(unsigned long 
> last_checked,
>  /*
>   * Process updating of timeout sysctl
>   */
> -static int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
> -   void *buffer,
> -   size_t *lenp, loff_t *ppos)
> +static int proc_dohung_task_timeout_secs(const struct ctl_table *table,
> +  int write,
> +  void *buffer,
> +  size_t *lenp, loff_t *ppos)
>  {
>   int ret;
>  

And this.

> diff --git a/kernel/latencytop.c b/kernel/latencytop.c
> index 781249098cb6..0a5c22b19821 100644
> --- a/kernel/latencytop.c
> +++ b/kernel/latencytop.c
> @@ -65,8 +65,9 @@ static struct latency_record latency_record[MAXLR];
>  int latencytop_enabled;
>  
>  #ifdef CONFIG_SYSCTL
> -static int sysctl_latencytop(struct ctl_table *table, int write, void 
> *buffer,
> -     size_t *lenp, loff_t *ppos)
> +static int sysctl_latencytop(const struct ctl_table *table, int write,
> +  void *buffer,
> +  size_t *lenp, loff_t *ppos)
>  {
>   int err;
>  

And this.

I could go on, but there are so many examples of this in the patch
that I think that it needs to be toosed away and regenerated in a
way that doesn't trash the existing function parameter formatting.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [powerpc] kernel BUG fs/xfs/xfs_message.c:102! [4k block]

2023-10-12 Thread Dave Chinner
gt; 
> xfs/238 test was executed when the crash was encountered.
> Devices were formatted with 4k block size.

Yeah, I've seen this once before, I think I know what the problem is
from analysis of that failure, but I've been unable to reproduce it
again so I've not been able to confirm the diagnosis nor test a fix.

tl;dr: we just unlinked an inode whose cluster buffer has been
invalidated by xfs_icluster_free(). We go to log the inode, but this
is the first time we've logged the inode since it was last cleaned,
so it goes to read the cluster buffer to attach it. It finds the
cluster buffer already marked stale in the transaction, so the DONE
flag is not set and the ASSERT fires.

i.e. it appears to me that this requires inode cluster buffer
writeback between the unlink() operation and the inodegc
inactivation process to set the initial conditions for the problem
to trigger, and then have just a single inode in the inobt chunk
that triggers freeing of the chunk whilst the inode itself is clean. 

I need to confirm that this is the case before trying to fix it,
because this inode log item vs stale inode cluster buffer path is
tricky and nasty and there might be something else going on.
However, I haven't been able to reproduce this to be able to confirm
this hypothesis yet.

I suspect the fix may well be to use xfs_trans_buf_get() in the
xfs_inode_item_precommit() path if XFS_ISTALE is already set on the
inode we are trying to log. We don't need a populated cluster buffer
to read data out of or write data into in this path - all we need to
do is attach the inode to the buffer so that when the buffer
invalidation is committed to the journal it will also correctly
finish the stale inode log item.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: BUG xfs_buf while running tests/xfs/435 (next-20220715)

2022-07-18 Thread Dave Chinner
116409] REGS: c0002985be80 TRAP: 0c00   Tainted: GB   E
>(5.19.0-rc6-next-20220715)
> [  111.116414] MSR:  8280f033   
> CR: 24008282  XER: 
> [  111.116430] IRQMASK: 0 
> [  111.116430] GPR00: 0081 7e17dff0 7fff8c227300 
> 01003f2f0c18 
> [  111.116430] GPR04: 0800 000a 1999 
>  
> [  111.116430] GPR08: 7fff8c1b7830   
>  
> [  111.116430] GPR12:  7fff8c72ca50 00013adba650 
> 00013adba648 
> [  111.116430] GPR16:  0001  
> 00013adba428 
> [  111.116430] GPR20: 00013ade0068  7e17f948 
> 01003f2f02a0 
> [  111.116430] GPR24:  7e17f948 01003f2f0c18 
>  
> [  111.116430] GPR28:  01003f2f0bb0 01003f2f0c18 
> 01003f2f0bb0 
> [  111.116488] NIP [7fff8c158b88] 0x7fff8c158b88
> [  111.116492] LR [00013adb0398] 0x13adb0398
> [  111.116496] --- interrupt: c00
> [  111.116504] Object 0x2b93c535 @offset=5376
> [  111.116508] Object 0x9be4058b @offset=16896
> [  111.116511] Object 0xc1d5c895 @offset=24960
> [  111.116515] Object 0x97fb6f84 @offset=30336
> [  111.116518] Object 0x213fb535 @offset=43008
> [  111.116521] Object 0x45473fa3 @offset=43392
> [  111.116525] Object 0x6462ef89 @offset=44160
> [  111.116528] Object 0x0c85ce0b @offset=44544
> [  111.116531] Object 0x59166af4 @offset=45312
> [  111.116535] Object 0xe7b40b45 @offset=46848
> [  111.116538] Object 0xbc6ce716 @offset=54528
> [  111.116541] Object 0x5f7be1fa @offset=64512
> [  111.116546] [ cut here ]

Yup, Darrick reported this once and couldn't reproduce it. We know
it's a result of result of converting the xfs_buffer cache to
rcu-protected lockless lookups and for some reason and the rcu
callbacks that free these objects seem not to have been processed
before the module is removed. We have an rcu_barrier() in
xfs_destroy_caches() to avoid this ..

Wait. What is xfs_buf_terminate()? I don't recall that function

Yeah, there's the bug.

exit_xfs_fs(void)
{

xfs_buf_terminate();
xfs_mru_cache_uninit();
xfs_destroy_workqueues();
xfs_destroy_caches();


xfs_buf_terminate() calls kmem_cache_destroy() before the
rcu_barrier() call in xfs_destroy_caches().

Try the (slightly smoke tested only) patch below.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: xfs_buf cache destroy isn't RCU safe

From: Dave Chinner 

Darrick and Sachin Sant reported that xfs/435 and xfs/436 would
report an non-empty xfs_buf slab on module remove. This isn't easily
to reproduce, but is clearly a side effect of converting the buffer
caceh to RUC freeing and lockless lookups. Sachin bisected and
Darrick hit it when testing the patchset directly.

Turns out that the xfs_buf slab is not destroyed when all the other
XFS slab caches are destroyed. Instead, it's got it's own little
wrapper function that gets called separately, and so it doesn't have
an rcu_barrier() call in it that is needed to drain all the rcu
callbacks before the slab is destroyed.

Fix it by removing the xfs_buf_init/terminate wrappers that just
allocate and destroy the xfs_buf slab, and move them to the same
place that all the other slab caches are set up and destroyed.

Reported-by: Sachin Sant 
Fixes: 298f34224506 ("xfs: lockless buffer lookup")
Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_buf.c   | 25 +
 fs/xfs/xfs_buf.h   |  6 ++
 fs/xfs/xfs_super.c | 22 +-
 3 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4affba7c6669..f8bdc4698492 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -21,7 +21,7 @@
 #include "xfs_error.h"
 #include "xfs_ag.h"
 
-static struct kmem_cache *xfs_buf_cache;
+struct kmem_cache *xfs_buf_cache;
 
 /*
  * Locking orders
@@ -2300,29 +2300,6 @@ xfs_buf_delwri_pushbuf(
return error;
 }
 
-int __init
-xfs_buf_init(void)
-{
-   xfs_buf_cache = kmem_cache_create("xfs_buf", sizeof(struct xfs_buf), 0,
-SLAB_HWCACHE_ALIGN |
-SLAB_RECLAIM_ACCOUNT |
-SLAB_MEM_SPREAD,
-NULL);
-   if (!xfs_buf_cache)
-   goto out;
-
-   return 0;
-
- out:
-   return -ENOMEM;
-}
-
-void
-xfs_buf_terminate(void)
-{
-   kmem_cache_destroy(xfs_buf_cache);
-}
-
 void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref)
 {
/*
diff --git a/fs/xfs/xfs_

Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Dave Chinner
On Sun, Dec 17, 2017 at 04:28:44PM -0800, Joe Perches wrote:
> Some functions definitions have either the initial open brace and/or
> the closing brace outside of column 1.
> 
> Move those braces to column 1.
> 
> This allows various function analyzers like gnu complexity to work
> properly for these modified functions.
> 
> Miscellanea:
> 
> o Remove extra trailing ; and blank line from xfs_agf_verify
> 
> Signed-off-by: Joe Perches <j...@perches.com>
> ---
....

XFS bits look fine.

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com


Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993

2017-09-18 Thread Dave Chinner
On Mon, Sep 18, 2017 at 05:00:58PM -0500, Eric Sandeen wrote:
> On 9/18/17 4:31 PM, Dave Chinner wrote:
> > On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote:
> >> On 09/18/2017 09:27 AM, Christoph Hellwig wrote:
> >>> On Mon, Sep 18, 2017 at 08:26:05PM +0530, Abdul Haleem wrote:
> >>>> Hi,
> >>>>
> >>>> A warning is triggered from:
> >>>>
> >>>> file fs/iomap.c in function iomap_dio_rw
> >>>>
> >>>> if (ret)
> >>>> goto out_free_dio;
> >>>>
> >>>> ret = invalidate_inode_pages2_range(mapping,
> >>>> start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> >>>>>>  WARN_ON_ONCE(ret);
> >>>> ret = 0;
> >>>>
> >>>> inode_dio_begin(inode);
> >>>
> >>> This is expected and an indication of a problematic workload - which
> >>> may be triggered by a fuzzer.
> >>
> >> If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all
> >> the time running xfstests as well.
> > 
> > Because when a user reports a data corruption, the only evidence we
> > have that they are running an app that does something stupid is this
> > warning in their syslogs.  Tracepoints are not useful for replacing
> > warnings about data corruption vectors being triggered.
> 
> Is the full WARN_ON spew really helpful to us, though?  Certainly
> the user has no idea what it means, and will come away terrified
> but none the wiser.
> 
> Would a more informative printk_once() still give us the evidence
> without the ZOMG I THINK I OOPSED that a WARN_ON produces?  Or do we 
> want/need the backtrace?

backtrace is actually useful - that's how I recently learnt that
splice now supports direct IO.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993

2017-09-18 Thread Dave Chinner
On Mon, Sep 18, 2017 at 09:51:29AM -0600, Jens Axboe wrote:
> On 09/18/2017 09:43 AM, Al Viro wrote:
> > On Mon, Sep 18, 2017 at 05:39:47PM +0200, Christoph Hellwig wrote:
> >> On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote:
> >>> If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all
> >>> the time running xfstests as well.
> >>
> >> Dave insisted on it to decourage users/applications from mixing
> >> mmap and direct I/O.
> >>
> >> In many ways a tracepoint might be the better way to diagnose these.
> > 
> > sysctl suppressing those two, perhaps?
> 
> I'd rather just make it a trace point, but don't care too much.
> 
> The code doesn't even have a comment as to why that WARN_ON() is
> there or expected.

The big comment about how bad cache invalidation failures are is on
the second, post-io invocation of the page cache flush. That's the
failure that exposes the data coherency problem to userspace:

/*
 * Try again to invalidate clean pages which might have been cached by
 * non-direct readahead, or faulted in by get_user_pages() if the source
 * of the write was an mmap'ed region of the file we're writing.  Either
 * one is a pretty crazy thing to do, so we don't support it 100%.  If
 * this invalidation fails, tough, the write still worked...
 */
if (iov_iter_rw(iter) == WRITE) {
int err = invalidate_inode_pages2_range(mapping,
start >> PAGE_SHIFT, end >> PAGE_SHIFT);
WARN_ON_ONCE(err);
}

IOWs, the first warning is a "bad things might be about to
happen" warning, the second is "bad things have happened".

> Seems pretty sloppy to me, not a great way
> to "discourage" users to mix mmap/dio.

Again, it has nothing to do with "discouraging users" and everything
about post-bug report problem triage.

Yes, the first invalidation should also have a comment like the post
IO invalidation - the comment probably got dropped and not noticed
when the changeover from internal XFS code to generic iomap code was
made...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993

2017-09-18 Thread Dave Chinner
On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote:
> On 09/18/2017 09:27 AM, Christoph Hellwig wrote:
> > On Mon, Sep 18, 2017 at 08:26:05PM +0530, Abdul Haleem wrote:
> >> Hi,
> >>
> >> A warning is triggered from:
> >>
> >> file fs/iomap.c in function iomap_dio_rw
> >>
> >> if (ret)
> >> goto out_free_dio;
> >>
> >> ret = invalidate_inode_pages2_range(mapping,
> >> start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> >>>>  WARN_ON_ONCE(ret);
> >> ret = 0;
> >>
> >> inode_dio_begin(inode);
> > 
> > This is expected and an indication of a problematic workload - which
> > may be triggered by a fuzzer.
> 
> If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all
> the time running xfstests as well.

Because when a user reports a data corruption, the only evidence we
have that they are running an app that does something stupid is this
warning in their syslogs.  Tracepoints are not useful for replacing
warnings about data corruption vectors being triggered.

It needs to be on by default, bu tI'm sure we can wrap it with
something like an xfs_alert_tag() type of construct so the tag can
be set in /proc/fs/xfs/panic_mask to suppress it if testers so
desire.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: Linux 4.8: Reported regressions as of Sunday, 2016-09-18

2016-09-18 Thread Dave Chinner
On Sun, Sep 18, 2016 at 03:20:53PM +0200, Thorsten Leemhuis wrote:
> Hi! Here is my fourth regression report for Linux 4.8. It lists 14
> regressions I'm aware of. 5 of them are new; 1 mentioned in last 
> weeks report got fixed.
> 
> As always: Are you aware of any other regressions? Then please let me
> know (simply CC regressi...@leemhuis.info). And pls tell me if there
> is anything in the report that shouldn't be there.
> 
> Ciao, Thorsten
> 
> == Current regressions ==
> 
> Desc: genirq: Flags mismatch irq 8, 0088 (mmc0) vs. 0080 (rtc0). 
> mmc0: Failed to request irq 8: -16
> Repo: 2016-08-01 https://bugzilla.kernel.org/show_bug.cgi?id=150881
> Stat: 2016-09-09 https://bugzilla.kernel.org/show_bug.cgi?id=150881#c34
> Note: stalled; root cause somewhere in the main gpio merge for 4.8, but 
> problematic commit still unknown
> 
> Desc: [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
> Repo: 2016-08-09 http://www.spinics.net/lists/kernel/msg2317052.html
> Stat: 2016-09-09 https://marc.info/?t=14734151953=1=2
> Note: looks like post-4.8 material at this point: Mel working on it in his 
> spare time, but "The progression of this series has been unsatisfactory."

Actually, what Mel was working on (mapping lock contention) was not
related to the reported XFS regression. The regression was an XFS
sub-page write issue introduced by the new iomap infrastructure,
and nobody has been able to reproduce it exactly
outside of the reaim benchmark. We've reproduced other, similar
issues, and the fixes for those are queued for the 4.9 window.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-24 Thread Dave Chinner
On Mon, Mar 23, 2015 at 12:24:00PM +, Mel Gorman wrote:
 These are three follow-on patches based on the xfsrepair workload Dave
 Chinner reported was problematic in 4.0-rc1 due to changes in page table
 management -- https://lkml.org/lkml/2015/3/1/226.
 
 Much of the problem was reduced by commit 53da3bc2ba9e (mm: fix up numa
 read-only thread grouping logic) and commit ba68bc0115eb (mm: thp:
 Return the correct value for change_huge_pmd). It was known that the 
 performance
 in 3.19 was still better even if is far less safe. This series aims to
 restore the performance without compromising on safety.
 
 Dave, you already tested patch 1 on its own but it would be nice to test
 patches 1+2 and 1+2+3 separately just to be certain.

   3.19  4.0-rc4+p1  +p2  +p3
mm_migrate_pages266,750  572,839  558,632  223,706  201,429
run time  4m54s7m50s7m20s5m07s4m31s

numa stats form p1+p2:

numa_hit 8436537
numa_miss 0
numa_foreign 0
numa_interleave 30765
numa_local 8409240
numa_other 27297
numa_pte_updates 46109698
numa_huge_pte_updates 0
numa_hint_faults 44756389
numa_hint_faults_local 11841095
numa_pages_migrated 4868674
pgmigrate_success 4868674
pgmigrate_fail 0


numa stats form p1+p2+p3:

numa_hit 6991596
numa_miss 0
numa_foreign 0
numa_interleave 10336
numa_local 6983144
numa_other 8452
numa_pte_updates 24460492
numa_huge_pte_updates 0
numa_hint_faults 23677262
numa_hint_faults_local 5952273
numa_pages_migrated 3557928
pgmigrate_success 3557928
pgmigrate_fail 0

OK, the summary with all patches applied:

config  3.19   4.0-rc1  4.0-rc4  4.0-rc5+
defaults   8m08s 9m34s9m14s6m57s
-o ag_stride=-14m04s 4m38s4m11s4m06s
-o bhash=1010736m04s17m43s7m35s6m13s
-o ag_stride=-1,bhash=101073   4m54s 9m58s7m50s4m31s

So it looks like the patch set fixes the remaining regression and in
2 of the four cases actually improves performance

Thanks, Linus and Mel, for tracking this tricky problem down! 

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote:
 On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So I think there's something I'm missing. For non-shared mappings, I
  still have the idea that pte_dirty should be the same as pte_write.
  And yet, your testing of 3.19 shows that it's a big difference.
  There's clearly something I'm completely missing.
 
 Ahh. The normal page table scanning and page fault handling both clear
 and set the dirty bit together with the writable one. But fork()
 will clear the writable bit without clearing dirty. For some reason I
 thought it moved the dirty bit into the struct page like the VM
 scanning does, but that was just me having a brainfart. So yeah,
 pte_dirty doesn't have to match pte_write even under perfectly normal
 circumstances. Maybe there are other cases.
 
 Not that I see a lot of forking in the xfs repair case either, so..
 
 Dave, mind re-running the plain 3.19 numbers to really verify that the
 pte_dirty/pte_write change really made that big of a difference. Maybe
 your recollection of ~55,000 migrate_pages events was faulty. If the
 pte_write -pte_dirty change is the *only* difference, it's still very
 odd how that one difference would make migrate_rate go from ~55k to
 471k. That's an order of magnitude difference, for what really
 shouldn't be a big change.

My recollection wasn't faulty - I pulled it from an earlier email.
That said, the original measurement might have been faulty. I ran
the numbers again on the 3.19 kernel I saved away from the original
testing. That came up at 235k, which is pretty much the same as
yesterday's test. The runtime,however, is unchanged from my original
measurements of 4m54s (pte_hack came in at 5m20s).

Wondering where the 55k number came from, I played around with when
I started the measurement - all the numbers since I did the bisect
have come from starting it at roughly 130AGs into phase 3 where the
memory footprint stabilises and the tlb flush overhead kicks in.

However, if I start the measurement at the same time as the repair
test, I get something much closer to the 55k number. I also note
that my original 4.0-rc1 numbers were much lower than the more
recent steady state measurements (360k vs 470k), so I'd say the
original numbers weren't representative of the steady state
behaviour and so can be ignored...

 Maybe a system update has changed libraries and memory allocation
 patterns, and there is something bigger than that one-liner
 pte_dirty/write change going on?

Possibly. The xfs_repair binary has definitely been rebuilt (testing
unrelated bug fixes that only affect phase 6/7 behaviour), but
otherwise the system libraries are unchanged.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote:
 
  My recollection wasn't faulty - I pulled it from an earlier email.
  That said, the original measurement might have been faulty. I ran
  the numbers again on the 3.19 kernel I saved away from the original
  testing. That came up at 235k, which is pretty much the same as
  yesterday's test. The runtime,however, is unchanged from my original
  measurements of 4m54s (pte_hack came in at 5m20s).
 
 Ok. Good. So the more than an order of magnitude difference was
 really about measurement differences, not quite as real. Looks like
 more a factor of two than a factor of 20.
 
 Did you do the profiles the same way? Because that would explain the
 differences in the TLB flush percentages too (the 1.4% from
 tlb_invalidate_range() vs pretty much everything from migration).

No, the profiles all came from steady state. The profiles from the
initial startup phase hammer the mmap_sem because of page fault vs
mprotect contention (glibc runs mprotect() on every chunk of
memory it allocates). It's not until the cache reaches full and it
starts recycling old buffers rather than allocating new ones that
the tlb flush problem dominates the profiles.

 The runtime variation does show that there's some *big* subtle
 difference for the numa balancing in the exact TNF_NO_GROUP details.
 It must be *very* unstable for it to make that big of a difference.
 But I feel at least a *bit* better about unstable algorithm changes a
 small varioation into a factor-of-two vs that crazy factor-of-20.
 
 Can you try Mel's change to make it use
 
 if (!(vma-vm_flags  VM_WRITE))
 
 instead of the pte details? Again, on otherwise plain 3.19, just so
 that we have a baseline. I'd be *so* much happer with checking the vma
 details over per-pte details, especially ones that change over the
 lifetime of the pte entry, and the NUMA code explicitly mucks with.

Yup, will do. might take an hour or two before I get to it, though...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 Can you try Mel's change to make it use
 
 if (!(vma-vm_flags  VM_WRITE))
 
 instead of the pte details? Again, on otherwise plain 3.19, just so
 that we have a baseline. I'd be *so* much happer with checking the vma
 details over per-pte details, especially ones that change over the
 lifetime of the pte entry, and the NUMA code explicitly mucks with.

$ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

266,750  migrate:mm_migrate_pages ( +-  7.43% )

  10.002032292 seconds time elapsed ( +-  0.00% )

Bit more variance there than the pte checking, but runtime
difference is in the noise - 5m4s vs 4m54s - and profiles are
identical to the pte checking version.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote:
 
  Bit more variance there than the pte checking, but runtime
  difference is in the noise - 5m4s vs 4m54s - and profiles are
  identical to the pte checking version.
 
 Ahh, so that !(vma-vm_flags  VM_WRITE) test works _almost_ as well
 as the original !pte_write() test.
 
 Now, can you check that on top of rc4? If I've gotten everything
 right, we now have:
 
  - plain 3.19 (pte_write): 4m54s
  - 3.19 with vm_flags  VM_WRITE: 5m4s
  - 3.19 with pte_dirty: 5m20s

*nod*

 so the pte_dirty version seems to have been a bad choice indeed.
 
 For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
 _much_ worse, but I'm wondering whether that VM_WRITE test will at
 least shrink the difference like it does for 3.19.

Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
7m36s. IOWs's a bit better, but not significantly. page migrations
are pretty much unchanged, too:

   558,632  migrate:mm_migrate_pages ( +-  6.38% )

 And the VM_WRITE test should be stable and not have any subtle
 interaction with the other changes that the numa pte things
 introduced. It would be good to see if the profiles then pop something
 *else* up as the performance difference (which I'm sure will remain,
 since the 7m50s was so far off).

No, nothing new pops up in the kernel profiles. All the system CPU
time is still being spent sending IPIs on the tlb flush path.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-18 Thread Dave Chinner
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote:
 On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So why am I wrong? Why is testing for dirty not the same as testing
  for writable?
 
  I can see a few cases:
 
   - your load has lots of writable (but not written-to) shared memory
 
 Hmm. I tried to look at the xfsprog sources, and I don't see any
 MAP_SHARED activity.  It looks like it's just using pread64/pwrite64,
 and the only MAP_SHARED is for the xfsio mmap test thing, not for
 xfsrepair.
 
 So I don't see any shared mappings, but I don't know the code-base.

Right - all the mmap activity in the xfs_repair test is coming from
memory allocation through glibc - we don't use mmap() directly
anywhere in xfs_repair. FWIW, all the IO into these pages that are
allocated is being done via direct IO, if that makes any
difference...

   - something completely different that I am entirely missing
 
 So I think there's something I'm missing. For non-shared mappings, I
 still have the idea that pte_dirty should be the same as pte_write.
 And yet, your testing of 3.19 shows that it's a big difference.
 There's clearly something I'm completely missing.

This level of pte interactions is beyond my level of knowledge, so
I'm afraid at this point I'm not going to be much help other than to
test patches and report the result.

FWIW, here's the distribution of the hash table we are iterating
over. There are a lot of search misses, which means we are doing a
lot of pointer chasing, but the distribution is centred directly
around the goal of 8 entries per chain and there is no long tail:

libxfs_bcache: 0x67e110
Max supported entries = 808584
Max utilized entries = 808584
Active entries = 808583
Hash table size = 101073
Hits = 9789987
Misses = 8224234
Hit ratio = 54.35
MRU 0 entries =   4667 (  0%)
MRU 1 entries =  0 (  0%)
MRU 2 entries =  4 (  0%)
MRU 3 entries = 797447 ( 98%)
MRU 4 entries =653 (  0%)
MRU 5 entries =  0 (  0%)
MRU 6 entries =   2755 (  0%)
MRU 7 entries =   1518 (  0%)
MRU 8 entries =   1518 (  0%)
MRU 9 entries =  0 (  0%)
MRU 10 entries = 21 (  0%)
MRU 11 entries =  0 (  0%)
MRU 12 entries =  0 (  0%)
MRU 13 entries =  0 (  0%)
MRU 14 entries =  0 (  0%)
MRU 15 entries =  0 (  0%)
Hash buckets with   0 entries 30 (  0%)
Hash buckets with   1 entries241 (  0%)
Hash buckets with   2 entries   1019 (  0%)
Hash buckets with   3 entries   2787 (  1%)
Hash buckets with   4 entries   5838 (  2%)
Hash buckets with   5 entries   9144 (  5%)
Hash buckets with   6 entries  12165 (  9%)
Hash buckets with   7 entries  14194 ( 12%)
Hash buckets with   8 entries  14387 ( 14%)
Hash buckets with   9 entries  12742 ( 14%)
Hash buckets with  10 entries  10253 ( 12%)
Hash buckets with  11 entries   7308 (  9%)
Hash buckets with  12 entries   4872 (  7%)
Hash buckets with  13 entries   2869 (  4%)
Hash buckets with  14 entries   1578 (  2%)
Hash buckets with  15 entries894 (  1%)
Hash buckets with  16 entries430 (  0%)
Hash buckets with  17 entries188 (  0%)
Hash buckets with  18 entries 88 (  0%)
Hash buckets with  19 entries 24 (  0%)
Hash buckets with  20 entries 11 (  0%)
Hash buckets with  21 entries 10 (  0%)
Hash buckets with  22 entries  1 (  0%)


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-17 Thread Dave Chinner
On Tue, Mar 17, 2015 at 02:30:57PM -0700, Linus Torvalds wrote:
 On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner da...@fromorbit.com wrote:
 
  On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I
  was using during steady state shows:
 
   471,752  migrate:mm_migrate_pages ( +-  7.38% )
 
  The migrate pages rate is even higher than in 4.0-rc1 (~360,000)
  and 3.19 (~55,000), so that looks like even more of a problem than
  before.
 
 Hmm. How stable are those numbers boot-to-boot?

I've run the test several times but only profiles once so far.
runtimes were 7m45, 7m50, 7m44s, 8m2s, and the profiles came from
the 8m2s run.

reboot, run again:

$ sudo perf stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

   572,839  migrate:mm_migrate_pages( +-  3.15% )

  10.001664694 seconds time elapsed ( +-  0.00% )
$

And just to confirm, a minute later, still in phase 3:

590,974  migrate:mm_migrate_pages   ( +-  2.86% )

Reboot, run again:

575,344  migrate:mm_migrate_pages   ( +-  0.70% )

So there is boot-to-boot variation, but it doesn't look like it
gets any better

 That kind of extreme spread makes me suspicious. It's also interesting
 that if the numbers really go up even more (and by that big amount),
 then why does there seem to be almost no correlation with performance
 (which apparently went up since rc1, despite migrate_pages getting
 even _worse_).
 
  And the profile looks like:
 
  -   43.73% 0.05%  [kernel][k] native_flush_tlb_others
 
 Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%).
 And flush_tlb_page() does seem to be called about ten times more
 (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible
 at 0.13%)
 
 Damn. From a performance number standpoint, it looked like we zoomed
 in on the right thing. But now it's migrating even more pages than
 before. Odd.

Throttling problem, like Mel originally suspected?

  And the vmstats are:
 
  3.19:
 
  numa_hit 5163221
  numa_local 5153127
 
  4.0-rc1:
 
  numa_hit 36952043
  numa_local 36927384
 
  4.0-rc4:
 
  numa_hit 23447345
  numa_local 23438564
 
  Page migrations are still up by a factor of ~20 on 3.19.
 
 The thing is, those numa_hit things come from the zone_statistics()
 call in buffered_rmqueue(), which in turn is simple from the memory
 allocator. That has *nothing* to do with virtual memory, and
 everything to do with actual physical memory allocations.  So the load
 is simply allocating a lot more pages, presumably for those stupid
 migration events.
 
 But then it doesn't correlate with performance anyway..

 Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm:
 fix up numa read-only thread grouping logic) to 3.19, so that it uses
 the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19
 and 4.0-rc4 numbers comparable.

patched 3.19 numbers on this test are slightly worse than stock
3.19, but nowhere near as bad as 4.0-rc4:

241,718  migrate:mm_migrate_pages   ( +-  5.17% )

So that pte_write-pte_dirty change makes this go from ~55k to 240k,
and runtime go from 4m54s to 5m20s. vmstats:

numa_hit 9162476
numa_miss 0
numa_foreign 0
numa_interleave 10685
numa_local 9153740
numa_other 8736
numa_pte_updates 49582103
numa_huge_pte_updates 0
numa_hint_faults 48075098
numa_hint_faults_local 12974704
numa_pages_migrated 5748256
pgmigrate_success 5748256
pgmigrate_fail 0

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-10 Thread Dave Chinner
On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
 On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner da...@fromorbit.com wrote:
 
  Also, is there some sane way for me to actually see this behavior on a
  regular machine with just a single socket? Dave is apparently running
  in some fake-numa setup, I'm wondering if this is easy enough to
  reproduce that I could see it myself.
 
  Should be - I don't actually use 500TB of storage to generate this -
  50GB on an SSD is all you need from the storage side. I just use a
  sparse backing file to make it look like a 500TB device. :P
 
 What's your virtual environment setup? Kernel config, and
 virtualization environment to actually get that odd fake NUMA thing
 happening?

I don't have the exact .config with me (test machines at home
are shut down because I'm half a world away), but it's pretty much
this (copied and munged from a similar test vm on my laptop):

$ cat run-vm-4.sh
sudo qemu-system-x86_64 \
-machine accel=kvm \
-no-fd-bootchk \
-localtime \
-boot c \
-serial pty \
-nographic \
-alt-grab \
-smp 16 -m 16384 \
-hda /data/vm-2/root.img \
-drive file=/vm/vm-4/vm-4-test.img,if=virtio,cache=none \
-drive file=/vm/vm-4/vm-4-scratch.img,if=virtio,cache=none \
-drive file=/vm/vm-4/vm-4-500TB.img,if=virtio,cache=none \
-kernel /vm/vm-4/vmlinuz \
-append console=ttyS0,115200 root=/dev/sda1,numa=fake=4
$

And on the host I have /vm on a ssd that is an XFS filesystem, and
I've created /vm/vm-4/vm-4-500TB.img by doing:

$ xfs_io -f -c truncate 500t -c extsize 1m /vm/vm-4/vm-4-500TB.img

and in the guest the filesystem is created with:

# mkfs.xfs -f -mcrc=1,finobt=1 /dev/vdc

And that will create a 500TB filesystem that you can then mount and
run fsmark on it, then unmount and run xfs_repair on it.

the .config I have on my laptop is from 3.18-rc something, but it
should work just with a make oldconfig update. I'ts attached below.

Hopefully this will be sufficient for you, otherwise it'll have to
wait until I get home to get the exact configs for you.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.18.0-rc1 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT=elf64-x86-64
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME=(none)
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_FHANDLE is not set
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-09 Thread Dave Chinner
On Sun, Mar 08, 2015 at 11:35:59AM -0700, Linus Torvalds wrote:
 On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar mi...@kernel.org wrote:
 But:
 
  As a second hack (not to be applied), could we change:
 
   #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
 
  to:
 
   #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
 
  to double check that the position of the bit does not matter?
 
 Agreed. We should definitely try that.
 
 Dave?

As Mel has already mentioned, I'm in Boston for LSFMM and don't have
access to the test rig I've used to generate this.

 Also, is there some sane way for me to actually see this behavior on a
 regular machine with just a single socket? Dave is apparently running
 in some fake-numa setup, I'm wondering if this is easy enough to
 reproduce that I could see it myself.

Should be - I don't actually use 500TB of storage to generate this -
50GB on an SSD is all you need from the storage side. I just use a
sparse backing file to make it look like a 500TB device. :P

i.e. create an XFS filesystem on a 500TB sparse file with mkfs.xfs
-d size=500t,file=1 /path/to/file.img, mount it on loopback or as a
virtio,cache=none device for the guest vm and then use fsmark to
generate several million files spread across many, many directories
such as:

$  fs_mark -D 1 -S0 -n 10 -s 1 -L 32 -d \
/mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \
/mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \
/mnt/scratch/6 -d /mnt/scratch/7

That should only take a few minutes to run - if you throw 8p at it
then it should run at 100k files/s being created.

Then unmount and run xfs_repair -o bhash=101703 /path/to/file.img
on the resultant image file.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] mm: numa: Do not clear PTEs or PMDs for NUMA hinting faults

2015-03-05 Thread Dave Chinner
On Thu, Mar 05, 2015 at 11:54:52PM +, Mel Gorman wrote:
 Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
 
Across the board the 4.0-rc1 numbers are much slower, and the
degradation is far worse when using the large memory footprint
configs. Perf points straight at the cause - this is from 4.0-rc1
on the -o bhash=101073 config:
 
-   56.07%56.07%  [kernel][k] 
 default_send_IPI_mask_sequence_phys
   - default_send_IPI_mask_sequence_phys
  - 99.99% physflat_send_IPI_mask
 - 99.37% native_send_call_func_ipi
  smp_call_function_many
- native_flush_tlb_others
   - 99.85% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
  - handle_mm_fault
 - 99.73% __do_page_fault
  trace_do_page_fault
  do_async_page_fault
+ async_page_fault
   0.63% native_send_call_func_single_ipi
  generic_exec_single
  smp_call_function_single
 
 This was bisected to commit 4d9424669946 (mm: convert p[te|md]_mknonnuma
 and remaining page table manipulations) which clears PTEs and PMDs to make
 them PROT_NONE. This is tidy but tests on some benchmarks indicate that
 there are many more hinting faults trapped resulting in excessive migration.
 This is the result for the old autonuma benchmark for example.

[snip]

Doesn't fix the problem. Runtime is slightly improved (16m45s vs 17m35)
but it's still much slower that 3.19 (6m5s).

Stats and profiles still roughly the same:

360,228  migrate:mm_migrate_pages ( +-  4.28% )

-   52.69%52.69%  [kernel][k] 
default_send_IPI_mask_sequence_phys
 default_send_IPI_mask_sequence_phys
   - physflat_send_IPI_mask
  - 97.28% native_send_call_func_ipi
   smp_call_function_many
   native_flush_tlb_others
   flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.59% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
  + 2.72% native_send_call_func_single_ipi

numa_hit 36678767
numa_miss 905234
numa_foreign 905234
numa_interleave 14802
numa_local 36656791
numa_other 927210
numa_pte_updates 92168450
numa_huge_pte_updates 0
numa_hint_faults 87573926
numa_hint_faults_local 29730293
numa_pages_migrated 30195890
pgmigrate_success 30195890
pgmigrate_fail 0

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev