Re: [PATCH 11/11] sysctl: treewide: constify the ctl_table argument of handlers
int write, > void *buffer, size_t *lenp, loff_t *ppos) > { > int ret; And this. > @@ -474,8 +475,10 @@ int perf_event_max_sample_rate_handler(struct ctl_table > *table, int write, > > int sysctl_perf_cpu_time_max_percent __read_mostly = > DEFAULT_CPU_TIME_MAX_PERCENT; > > -int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write, > - void *buffer, size_t *lenp, loff_t *ppos) > +int perf_cpu_time_max_percent_handler(const struct ctl_table *table, > + int write, > + void *buffer, size_t *lenp, > + loff_t *ppos) > { > int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); > And this. > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index b2fc2727d654..003f0f5cb111 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -239,9 +239,10 @@ static long hung_timeout_jiffies(unsigned long > last_checked, > /* > * Process updating of timeout sysctl > */ > -static int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, > - void *buffer, > - size_t *lenp, loff_t *ppos) > +static int proc_dohung_task_timeout_secs(const struct ctl_table *table, > + int write, > + void *buffer, > + size_t *lenp, loff_t *ppos) > { > int ret; > And this. > diff --git a/kernel/latencytop.c b/kernel/latencytop.c > index 781249098cb6..0a5c22b19821 100644 > --- a/kernel/latencytop.c > +++ b/kernel/latencytop.c > @@ -65,8 +65,9 @@ static struct latency_record latency_record[MAXLR]; > int latencytop_enabled; > > #ifdef CONFIG_SYSCTL > -static int sysctl_latencytop(struct ctl_table *table, int write, void > *buffer, > - size_t *lenp, loff_t *ppos) > +static int sysctl_latencytop(const struct ctl_table *table, int write, > + void *buffer, > + size_t *lenp, loff_t *ppos) > { > int err; > And this. I could go on, but there are so many examples of this in the patch that I think that it needs to be toosed away and regenerated in a way that doesn't trash the existing function parameter formatting. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [powerpc] kernel BUG fs/xfs/xfs_message.c:102! [4k block]
gt; > xfs/238 test was executed when the crash was encountered. > Devices were formatted with 4k block size. Yeah, I've seen this once before, I think I know what the problem is from analysis of that failure, but I've been unable to reproduce it again so I've not been able to confirm the diagnosis nor test a fix. tl;dr: we just unlinked an inode whose cluster buffer has been invalidated by xfs_icluster_free(). We go to log the inode, but this is the first time we've logged the inode since it was last cleaned, so it goes to read the cluster buffer to attach it. It finds the cluster buffer already marked stale in the transaction, so the DONE flag is not set and the ASSERT fires. i.e. it appears to me that this requires inode cluster buffer writeback between the unlink() operation and the inodegc inactivation process to set the initial conditions for the problem to trigger, and then have just a single inode in the inobt chunk that triggers freeing of the chunk whilst the inode itself is clean. I need to confirm that this is the case before trying to fix it, because this inode log item vs stale inode cluster buffer path is tricky and nasty and there might be something else going on. However, I haven't been able to reproduce this to be able to confirm this hypothesis yet. I suspect the fix may well be to use xfs_trans_buf_get() in the xfs_inode_item_precommit() path if XFS_ISTALE is already set on the inode we are trying to log. We don't need a populated cluster buffer to read data out of or write data into in this path - all we need to do is attach the inode to the buffer so that when the buffer invalidation is committed to the journal it will also correctly finish the stale inode log item. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: BUG xfs_buf while running tests/xfs/435 (next-20220715)
116409] REGS: c0002985be80 TRAP: 0c00 Tainted: GB E >(5.19.0-rc6-next-20220715) > [ 111.116414] MSR: 8280f033 > CR: 24008282 XER: > [ 111.116430] IRQMASK: 0 > [ 111.116430] GPR00: 0081 7e17dff0 7fff8c227300 > 01003f2f0c18 > [ 111.116430] GPR04: 0800 000a 1999 > > [ 111.116430] GPR08: 7fff8c1b7830 > > [ 111.116430] GPR12: 7fff8c72ca50 00013adba650 > 00013adba648 > [ 111.116430] GPR16: 0001 > 00013adba428 > [ 111.116430] GPR20: 00013ade0068 7e17f948 > 01003f2f02a0 > [ 111.116430] GPR24: 7e17f948 01003f2f0c18 > > [ 111.116430] GPR28: 01003f2f0bb0 01003f2f0c18 > 01003f2f0bb0 > [ 111.116488] NIP [7fff8c158b88] 0x7fff8c158b88 > [ 111.116492] LR [00013adb0398] 0x13adb0398 > [ 111.116496] --- interrupt: c00 > [ 111.116504] Object 0x2b93c535 @offset=5376 > [ 111.116508] Object 0x9be4058b @offset=16896 > [ 111.116511] Object 0xc1d5c895 @offset=24960 > [ 111.116515] Object 0x97fb6f84 @offset=30336 > [ 111.116518] Object 0x213fb535 @offset=43008 > [ 111.116521] Object 0x45473fa3 @offset=43392 > [ 111.116525] Object 0x6462ef89 @offset=44160 > [ 111.116528] Object 0x0c85ce0b @offset=44544 > [ 111.116531] Object 0x59166af4 @offset=45312 > [ 111.116535] Object 0xe7b40b45 @offset=46848 > [ 111.116538] Object 0xbc6ce716 @offset=54528 > [ 111.116541] Object 0x5f7be1fa @offset=64512 > [ 111.116546] [ cut here ] Yup, Darrick reported this once and couldn't reproduce it. We know it's a result of result of converting the xfs_buffer cache to rcu-protected lockless lookups and for some reason and the rcu callbacks that free these objects seem not to have been processed before the module is removed. We have an rcu_barrier() in xfs_destroy_caches() to avoid this .. Wait. What is xfs_buf_terminate()? I don't recall that function Yeah, there's the bug. exit_xfs_fs(void) { xfs_buf_terminate(); xfs_mru_cache_uninit(); xfs_destroy_workqueues(); xfs_destroy_caches(); xfs_buf_terminate() calls kmem_cache_destroy() before the rcu_barrier() call in xfs_destroy_caches(). Try the (slightly smoke tested only) patch below. Cheers, Dave. -- Dave Chinner da...@fromorbit.com xfs: xfs_buf cache destroy isn't RCU safe From: Dave Chinner Darrick and Sachin Sant reported that xfs/435 and xfs/436 would report an non-empty xfs_buf slab on module remove. This isn't easily to reproduce, but is clearly a side effect of converting the buffer caceh to RUC freeing and lockless lookups. Sachin bisected and Darrick hit it when testing the patchset directly. Turns out that the xfs_buf slab is not destroyed when all the other XFS slab caches are destroyed. Instead, it's got it's own little wrapper function that gets called separately, and so it doesn't have an rcu_barrier() call in it that is needed to drain all the rcu callbacks before the slab is destroyed. Fix it by removing the xfs_buf_init/terminate wrappers that just allocate and destroy the xfs_buf slab, and move them to the same place that all the other slab caches are set up and destroyed. Reported-by: Sachin Sant Fixes: 298f34224506 ("xfs: lockless buffer lookup") Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf.c | 25 + fs/xfs/xfs_buf.h | 6 ++ fs/xfs/xfs_super.c | 22 +- 3 files changed, 16 insertions(+), 37 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 4affba7c6669..f8bdc4698492 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -21,7 +21,7 @@ #include "xfs_error.h" #include "xfs_ag.h" -static struct kmem_cache *xfs_buf_cache; +struct kmem_cache *xfs_buf_cache; /* * Locking orders @@ -2300,29 +2300,6 @@ xfs_buf_delwri_pushbuf( return error; } -int __init -xfs_buf_init(void) -{ - xfs_buf_cache = kmem_cache_create("xfs_buf", sizeof(struct xfs_buf), 0, -SLAB_HWCACHE_ALIGN | -SLAB_RECLAIM_ACCOUNT | -SLAB_MEM_SPREAD, -NULL); - if (!xfs_buf_cache) - goto out; - - return 0; - - out: - return -ENOMEM; -} - -void -xfs_buf_terminate(void) -{ - kmem_cache_destroy(xfs_buf_cache); -} - void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref) { /* diff --git a/fs/xfs/xfs_
Re: [trivial PATCH] treewide: Align function definition open/close braces
On Sun, Dec 17, 2017 at 04:28:44PM -0800, Joe Perches wrote: > Some functions definitions have either the initial open brace and/or > the closing brace outside of column 1. > > Move those braces to column 1. > > This allows various function analyzers like gnu complexity to work > properly for these modified functions. > > Miscellanea: > > o Remove extra trailing ; and blank line from xfs_agf_verify > > Signed-off-by: Joe Perches <j...@perches.com> > --- .... XFS bits look fine. Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com
Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993
On Mon, Sep 18, 2017 at 05:00:58PM -0500, Eric Sandeen wrote: > On 9/18/17 4:31 PM, Dave Chinner wrote: > > On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote: > >> On 09/18/2017 09:27 AM, Christoph Hellwig wrote: > >>> On Mon, Sep 18, 2017 at 08:26:05PM +0530, Abdul Haleem wrote: > >>>> Hi, > >>>> > >>>> A warning is triggered from: > >>>> > >>>> file fs/iomap.c in function iomap_dio_rw > >>>> > >>>> if (ret) > >>>> goto out_free_dio; > >>>> > >>>> ret = invalidate_inode_pages2_range(mapping, > >>>> start >> PAGE_SHIFT, end >> PAGE_SHIFT); > >>>>>> WARN_ON_ONCE(ret); > >>>> ret = 0; > >>>> > >>>> inode_dio_begin(inode); > >>> > >>> This is expected and an indication of a problematic workload - which > >>> may be triggered by a fuzzer. > >> > >> If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all > >> the time running xfstests as well. > > > > Because when a user reports a data corruption, the only evidence we > > have that they are running an app that does something stupid is this > > warning in their syslogs. Tracepoints are not useful for replacing > > warnings about data corruption vectors being triggered. > > Is the full WARN_ON spew really helpful to us, though? Certainly > the user has no idea what it means, and will come away terrified > but none the wiser. > > Would a more informative printk_once() still give us the evidence > without the ZOMG I THINK I OOPSED that a WARN_ON produces? Or do we > want/need the backtrace? backtrace is actually useful - that's how I recently learnt that splice now supports direct IO. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993
On Mon, Sep 18, 2017 at 09:51:29AM -0600, Jens Axboe wrote: > On 09/18/2017 09:43 AM, Al Viro wrote: > > On Mon, Sep 18, 2017 at 05:39:47PM +0200, Christoph Hellwig wrote: > >> On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote: > >>> If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all > >>> the time running xfstests as well. > >> > >> Dave insisted on it to decourage users/applications from mixing > >> mmap and direct I/O. > >> > >> In many ways a tracepoint might be the better way to diagnose these. > > > > sysctl suppressing those two, perhaps? > > I'd rather just make it a trace point, but don't care too much. > > The code doesn't even have a comment as to why that WARN_ON() is > there or expected. The big comment about how bad cache invalidation failures are is on the second, post-io invocation of the page cache flush. That's the failure that exposes the data coherency problem to userspace: /* * Try again to invalidate clean pages which might have been cached by * non-direct readahead, or faulted in by get_user_pages() if the source * of the write was an mmap'ed region of the file we're writing. Either * one is a pretty crazy thing to do, so we don't support it 100%. If * this invalidation fails, tough, the write still worked... */ if (iov_iter_rw(iter) == WRITE) { int err = invalidate_inode_pages2_range(mapping, start >> PAGE_SHIFT, end >> PAGE_SHIFT); WARN_ON_ONCE(err); } IOWs, the first warning is a "bad things might be about to happen" warning, the second is "bad things have happened". > Seems pretty sloppy to me, not a great way > to "discourage" users to mix mmap/dio. Again, it has nothing to do with "discouraging users" and everything about post-bug report problem triage. Yes, the first invalidation should also have a comment like the post IO invalidation - the comment probably got dropped and not noticed when the changeover from internal XFS code to generic iomap code was made... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993
On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote: > On 09/18/2017 09:27 AM, Christoph Hellwig wrote: > > On Mon, Sep 18, 2017 at 08:26:05PM +0530, Abdul Haleem wrote: > >> Hi, > >> > >> A warning is triggered from: > >> > >> file fs/iomap.c in function iomap_dio_rw > >> > >> if (ret) > >> goto out_free_dio; > >> > >> ret = invalidate_inode_pages2_range(mapping, > >> start >> PAGE_SHIFT, end >> PAGE_SHIFT); > >>>> WARN_ON_ONCE(ret); > >> ret = 0; > >> > >> inode_dio_begin(inode); > > > > This is expected and an indication of a problematic workload - which > > may be triggered by a fuzzer. > > If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all > the time running xfstests as well. Because when a user reports a data corruption, the only evidence we have that they are running an app that does something stupid is this warning in their syslogs. Tracepoints are not useful for replacing warnings about data corruption vectors being triggered. It needs to be on by default, bu tI'm sure we can wrap it with something like an xfs_alert_tag() type of construct so the tag can be set in /proc/fs/xfs/panic_mask to suppress it if testers so desire. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Linux 4.8: Reported regressions as of Sunday, 2016-09-18
On Sun, Sep 18, 2016 at 03:20:53PM +0200, Thorsten Leemhuis wrote: > Hi! Here is my fourth regression report for Linux 4.8. It lists 14 > regressions I'm aware of. 5 of them are new; 1 mentioned in last > weeks report got fixed. > > As always: Are you aware of any other regressions? Then please let me > know (simply CC regressi...@leemhuis.info). And pls tell me if there > is anything in the report that shouldn't be there. > > Ciao, Thorsten > > == Current regressions == > > Desc: genirq: Flags mismatch irq 8, 0088 (mmc0) vs. 0080 (rtc0). > mmc0: Failed to request irq 8: -16 > Repo: 2016-08-01 https://bugzilla.kernel.org/show_bug.cgi?id=150881 > Stat: 2016-09-09 https://bugzilla.kernel.org/show_bug.cgi?id=150881#c34 > Note: stalled; root cause somewhere in the main gpio merge for 4.8, but > problematic commit still unknown > > Desc: [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression > Repo: 2016-08-09 http://www.spinics.net/lists/kernel/msg2317052.html > Stat: 2016-09-09 https://marc.info/?t=14734151953=1=2 > Note: looks like post-4.8 material at this point: Mel working on it in his > spare time, but "The progression of this series has been unsatisfactory." Actually, what Mel was working on (mapping lock contention) was not related to the reported XFS regression. The regression was an XFS sub-page write issue introduced by the new iomap infrastructure, and nobody has been able to reproduce it exactly outside of the reaim benchmark. We've reproduced other, similar issues, and the fixes for those are queued for the 4.9 window. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
On Mon, Mar 23, 2015 at 12:24:00PM +, Mel Gorman wrote: These are three follow-on patches based on the xfsrepair workload Dave Chinner reported was problematic in 4.0-rc1 due to changes in page table management -- https://lkml.org/lkml/2015/3/1/226. Much of the problem was reduced by commit 53da3bc2ba9e (mm: fix up numa read-only thread grouping logic) and commit ba68bc0115eb (mm: thp: Return the correct value for change_huge_pmd). It was known that the performance in 3.19 was still better even if is far less safe. This series aims to restore the performance without compromising on safety. Dave, you already tested patch 1 on its own but it would be nice to test patches 1+2 and 1+2+3 separately just to be certain. 3.19 4.0-rc4+p1 +p2 +p3 mm_migrate_pages266,750 572,839 558,632 223,706 201,429 run time 4m54s7m50s7m20s5m07s4m31s numa stats form p1+p2: numa_hit 8436537 numa_miss 0 numa_foreign 0 numa_interleave 30765 numa_local 8409240 numa_other 27297 numa_pte_updates 46109698 numa_huge_pte_updates 0 numa_hint_faults 44756389 numa_hint_faults_local 11841095 numa_pages_migrated 4868674 pgmigrate_success 4868674 pgmigrate_fail 0 numa stats form p1+p2+p3: numa_hit 6991596 numa_miss 0 numa_foreign 0 numa_interleave 10336 numa_local 6983144 numa_other 8452 numa_pte_updates 24460492 numa_huge_pte_updates 0 numa_hint_faults 23677262 numa_hint_faults_local 5952273 numa_pages_migrated 3557928 pgmigrate_success 3557928 pgmigrate_fail 0 OK, the summary with all patches applied: config 3.19 4.0-rc1 4.0-rc4 4.0-rc5+ defaults 8m08s 9m34s9m14s6m57s -o ag_stride=-14m04s 4m38s4m11s4m06s -o bhash=1010736m04s17m43s7m35s6m13s -o ag_stride=-1,bhash=101073 4m54s 9m58s7m50s4m31s So it looks like the patch set fixes the remaining regression and in 2 of the four cases actually improves performance Thanks, Linus and Mel, for tracking this tricky problem down! Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote: On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds torva...@linux-foundation.org wrote: So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Ahh. The normal page table scanning and page fault handling both clear and set the dirty bit together with the writable one. But fork() will clear the writable bit without clearing dirty. For some reason I thought it moved the dirty bit into the struct page like the VM scanning does, but that was just me having a brainfart. So yeah, pte_dirty doesn't have to match pte_write even under perfectly normal circumstances. Maybe there are other cases. Not that I see a lot of forking in the xfs repair case either, so.. Dave, mind re-running the plain 3.19 numbers to really verify that the pte_dirty/pte_write change really made that big of a difference. Maybe your recollection of ~55,000 migrate_pages events was faulty. If the pte_write -pte_dirty change is the *only* difference, it's still very odd how that one difference would make migrate_rate go from ~55k to 471k. That's an order of magnitude difference, for what really shouldn't be a big change. My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Wondering where the 55k number came from, I played around with when I started the measurement - all the numbers since I did the bisect have come from starting it at roughly 130AGs into phase 3 where the memory footprint stabilises and the tlb flush overhead kicks in. However, if I start the measurement at the same time as the repair test, I get something much closer to the 55k number. I also note that my original 4.0-rc1 numbers were much lower than the more recent steady state measurements (360k vs 470k), so I'd say the original numbers weren't representative of the steady state behaviour and so can be ignored... Maybe a system update has changed libraries and memory allocation patterns, and there is something bigger than that one-liner pte_dirty/write change going on? Possibly. The xfs_repair binary has definitely been rebuilt (testing unrelated bug fixes that only affect phase 6/7 behaviour), but otherwise the system libraries are unchanged. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote: My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the more than an order of magnitude difference was really about measurement differences, not quite as real. Looks like more a factor of two than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the 1.4% from tlb_invalidate_range() vs pretty much everything from migration). No, the profiles all came from steady state. The profiles from the initial startup phase hammer the mmap_sem because of page fault vs mprotect contention (glibc runs mprotect() on every chunk of memory it allocates). It's not until the cache reaches full and it starts recycling old buffers rather than allocating new ones that the tlb flush problem dominates the profiles. The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. It must be *very* unstable for it to make that big of a difference. But I feel at least a *bit* better about unstable algorithm changes a small varioation into a factor-of-two vs that crazy factor-of-20. Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. Yup, will do. might take an hour or two before I get to it, though... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. $ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 266,750 migrate:mm_migrate_pages ( +- 7.43% ) 10.002032292 seconds time elapsed ( +- 0.00% ) Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote: Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Ahh, so that !(vma-vm_flags VM_WRITE) test works _almost_ as well as the original !pte_write() test. Now, can you check that on top of rc4? If I've gotten everything right, we now have: - plain 3.19 (pte_write): 4m54s - 3.19 with vm_flags VM_WRITE: 5m4s - 3.19 with pte_dirty: 5m20s *nod* so the pte_dirty version seems to have been a bad choice indeed. For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still _much_ worse, but I'm wondering whether that VM_WRITE test will at least shrink the difference like it does for 3.19. Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and 7m36s. IOWs's a bit better, but not significantly. page migrations are pretty much unchanged, too: 558,632 migrate:mm_migrate_pages ( +- 6.38% ) And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). No, nothing new pops up in the kernel profiles. All the system CPU time is still being spent sending IPIs on the tlb flush path. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds torva...@linux-foundation.org wrote: So why am I wrong? Why is testing for dirty not the same as testing for writable? I can see a few cases: - your load has lots of writable (but not written-to) shared memory Hmm. I tried to look at the xfsprog sources, and I don't see any MAP_SHARED activity. It looks like it's just using pread64/pwrite64, and the only MAP_SHARED is for the xfsio mmap test thing, not for xfsrepair. So I don't see any shared mappings, but I don't know the code-base. Right - all the mmap activity in the xfs_repair test is coming from memory allocation through glibc - we don't use mmap() directly anywhere in xfs_repair. FWIW, all the IO into these pages that are allocated is being done via direct IO, if that makes any difference... - something completely different that I am entirely missing So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. This level of pte interactions is beyond my level of knowledge, so I'm afraid at this point I'm not going to be much help other than to test patches and report the result. FWIW, here's the distribution of the hash table we are iterating over. There are a lot of search misses, which means we are doing a lot of pointer chasing, but the distribution is centred directly around the goal of 8 entries per chain and there is no long tail: libxfs_bcache: 0x67e110 Max supported entries = 808584 Max utilized entries = 808584 Active entries = 808583 Hash table size = 101073 Hits = 9789987 Misses = 8224234 Hit ratio = 54.35 MRU 0 entries = 4667 ( 0%) MRU 1 entries = 0 ( 0%) MRU 2 entries = 4 ( 0%) MRU 3 entries = 797447 ( 98%) MRU 4 entries =653 ( 0%) MRU 5 entries = 0 ( 0%) MRU 6 entries = 2755 ( 0%) MRU 7 entries = 1518 ( 0%) MRU 8 entries = 1518 ( 0%) MRU 9 entries = 0 ( 0%) MRU 10 entries = 21 ( 0%) MRU 11 entries = 0 ( 0%) MRU 12 entries = 0 ( 0%) MRU 13 entries = 0 ( 0%) MRU 14 entries = 0 ( 0%) MRU 15 entries = 0 ( 0%) Hash buckets with 0 entries 30 ( 0%) Hash buckets with 1 entries241 ( 0%) Hash buckets with 2 entries 1019 ( 0%) Hash buckets with 3 entries 2787 ( 1%) Hash buckets with 4 entries 5838 ( 2%) Hash buckets with 5 entries 9144 ( 5%) Hash buckets with 6 entries 12165 ( 9%) Hash buckets with 7 entries 14194 ( 12%) Hash buckets with 8 entries 14387 ( 14%) Hash buckets with 9 entries 12742 ( 14%) Hash buckets with 10 entries 10253 ( 12%) Hash buckets with 11 entries 7308 ( 9%) Hash buckets with 12 entries 4872 ( 7%) Hash buckets with 13 entries 2869 ( 4%) Hash buckets with 14 entries 1578 ( 2%) Hash buckets with 15 entries894 ( 1%) Hash buckets with 16 entries430 ( 0%) Hash buckets with 17 entries188 ( 0%) Hash buckets with 18 entries 88 ( 0%) Hash buckets with 19 entries 24 ( 0%) Hash buckets with 20 entries 11 ( 0%) Hash buckets with 21 entries 10 ( 0%) Hash buckets with 22 entries 1 ( 0%) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 17, 2015 at 02:30:57PM -0700, Linus Torvalds wrote: On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner da...@fromorbit.com wrote: On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I was using during steady state shows: 471,752 migrate:mm_migrate_pages ( +- 7.38% ) The migrate pages rate is even higher than in 4.0-rc1 (~360,000) and 3.19 (~55,000), so that looks like even more of a problem than before. Hmm. How stable are those numbers boot-to-boot? I've run the test several times but only profiles once so far. runtimes were 7m45, 7m50, 7m44s, 8m2s, and the profiles came from the 8m2s run. reboot, run again: $ sudo perf stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 572,839 migrate:mm_migrate_pages( +- 3.15% ) 10.001664694 seconds time elapsed ( +- 0.00% ) $ And just to confirm, a minute later, still in phase 3: 590,974 migrate:mm_migrate_pages ( +- 2.86% ) Reboot, run again: 575,344 migrate:mm_migrate_pages ( +- 0.70% ) So there is boot-to-boot variation, but it doesn't look like it gets any better That kind of extreme spread makes me suspicious. It's also interesting that if the numbers really go up even more (and by that big amount), then why does there seem to be almost no correlation with performance (which apparently went up since rc1, despite migrate_pages getting even _worse_). And the profile looks like: - 43.73% 0.05% [kernel][k] native_flush_tlb_others Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%). And flush_tlb_page() does seem to be called about ten times more (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible at 0.13%) Damn. From a performance number standpoint, it looked like we zoomed in on the right thing. But now it's migrating even more pages than before. Odd. Throttling problem, like Mel originally suspected? And the vmstats are: 3.19: numa_hit 5163221 numa_local 5153127 4.0-rc1: numa_hit 36952043 numa_local 36927384 4.0-rc4: numa_hit 23447345 numa_local 23438564 Page migrations are still up by a factor of ~20 on 3.19. The thing is, those numa_hit things come from the zone_statistics() call in buffered_rmqueue(), which in turn is simple from the memory allocator. That has *nothing* to do with virtual memory, and everything to do with actual physical memory allocations. So the load is simply allocating a lot more pages, presumably for those stupid migration events. But then it doesn't correlate with performance anyway.. Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm: fix up numa read-only thread grouping logic) to 3.19, so that it uses the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19 and 4.0-rc4 numbers comparable. patched 3.19 numbers on this test are slightly worse than stock 3.19, but nowhere near as bad as 4.0-rc4: 241,718 migrate:mm_migrate_pages ( +- 5.17% ) So that pte_write-pte_dirty change makes this go from ~55k to 240k, and runtime go from 4m54s to 5m20s. vmstats: numa_hit 9162476 numa_miss 0 numa_foreign 0 numa_interleave 10685 numa_local 9153740 numa_other 8736 numa_pte_updates 49582103 numa_huge_pte_updates 0 numa_hint_faults 48075098 numa_hint_faults_local 12974704 numa_pages_migrated 5748256 pgmigrate_success 5748256 pgmigrate_fail 0 Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner da...@fromorbit.com wrote: Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Should be - I don't actually use 500TB of storage to generate this - 50GB on an SSD is all you need from the storage side. I just use a sparse backing file to make it look like a 500TB device. :P What's your virtual environment setup? Kernel config, and virtualization environment to actually get that odd fake NUMA thing happening? I don't have the exact .config with me (test machines at home are shut down because I'm half a world away), but it's pretty much this (copied and munged from a similar test vm on my laptop): $ cat run-vm-4.sh sudo qemu-system-x86_64 \ -machine accel=kvm \ -no-fd-bootchk \ -localtime \ -boot c \ -serial pty \ -nographic \ -alt-grab \ -smp 16 -m 16384 \ -hda /data/vm-2/root.img \ -drive file=/vm/vm-4/vm-4-test.img,if=virtio,cache=none \ -drive file=/vm/vm-4/vm-4-scratch.img,if=virtio,cache=none \ -drive file=/vm/vm-4/vm-4-500TB.img,if=virtio,cache=none \ -kernel /vm/vm-4/vmlinuz \ -append console=ttyS0,115200 root=/dev/sda1,numa=fake=4 $ And on the host I have /vm on a ssd that is an XFS filesystem, and I've created /vm/vm-4/vm-4-500TB.img by doing: $ xfs_io -f -c truncate 500t -c extsize 1m /vm/vm-4/vm-4-500TB.img and in the guest the filesystem is created with: # mkfs.xfs -f -mcrc=1,finobt=1 /dev/vdc And that will create a 500TB filesystem that you can then mount and run fsmark on it, then unmount and run xfs_repair on it. the .config I have on my laptop is from 3.18-rc something, but it should work just with a make oldconfig update. I'ts attached below. Hopefully this will be sufficient for you, otherwise it'll have to wait until I get home to get the exact configs for you. Cheers, Dave. -- Dave Chinner da...@fromorbit.com # # Automatically generated file; DO NOT EDIT. # Linux/x86 3.18.0-rc1 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT=elf64-x86-64 CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11 CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE= # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME=(none) CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y # CONFIG_FHANDLE is not set CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 08, 2015 at 11:35:59AM -0700, Linus Torvalds wrote: On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar mi...@kernel.org wrote: But: As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) to double check that the position of the bit does not matter? Agreed. We should definitely try that. Dave? As Mel has already mentioned, I'm in Boston for LSFMM and don't have access to the test rig I've used to generate this. Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Should be - I don't actually use 500TB of storage to generate this - 50GB on an SSD is all you need from the storage side. I just use a sparse backing file to make it look like a 500TB device. :P i.e. create an XFS filesystem on a 500TB sparse file with mkfs.xfs -d size=500t,file=1 /path/to/file.img, mount it on loopback or as a virtio,cache=none device for the guest vm and then use fsmark to generate several million files spread across many, many directories such as: $ fs_mark -D 1 -S0 -n 10 -s 1 -L 32 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 That should only take a few minutes to run - if you throw 8p at it then it should run at 100k files/s being created. Then unmount and run xfs_repair -o bhash=101703 /path/to/file.img on the resultant image file. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] mm: numa: Do not clear PTEs or PMDs for NUMA hinting faults
On Thu, Mar 05, 2015 at 11:54:52PM +, Mel Gorman wrote: Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the -o bhash=101073 config: - 56.07%56.07% [kernel][k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This was bisected to commit 4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table manipulations) which clears PTEs and PMDs to make them PROT_NONE. This is tidy but tests on some benchmarks indicate that there are many more hinting faults trapped resulting in excessive migration. This is the result for the old autonuma benchmark for example. [snip] Doesn't fix the problem. Runtime is slightly improved (16m45s vs 17m35) but it's still much slower that 3.19 (6m5s). Stats and profiles still roughly the same: 360,228 migrate:mm_migrate_pages ( +- 4.28% ) - 52.69%52.69% [kernel][k] default_send_IPI_mask_sequence_phys default_send_IPI_mask_sequence_phys - physflat_send_IPI_mask - 97.28% native_send_call_func_ipi smp_call_function_many native_flush_tlb_others flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.59% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault + 2.72% native_send_call_func_single_ipi numa_hit 36678767 numa_miss 905234 numa_foreign 905234 numa_interleave 14802 numa_local 36656791 numa_other 927210 numa_pte_updates 92168450 numa_huge_pte_updates 0 numa_hint_faults 87573926 numa_hint_faults_local 29730293 numa_pages_migrated 30195890 pgmigrate_success 30195890 pgmigrate_fail 0 Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev